As the cybersecurity landscape evolves, security leaders are turning to AI-powered security operations centers (SOCs) to combat increasingly complex multi-domain attacks. These attacks exploit weaknesses in legacy systems and trusted network connections, setting new speed records for cyber intrusions. Over the past year, there has been a significant reduction in the average time taken for eCrime intrusion activities. This is largely due to the creative use of generative AI, social engineering, and a focus on cloud vulnerabilities by attackers. These strategies are particularly effective against organizations with outdated or inadequate cybersecurity systems. Security teams are under pressure to rapidly analyze vast amounts of data to detect and respond to threats more quickly. This is a task that traditional security information and event management (SIEM) systems struggle with. As such, many companies are seeking more efficient and cost-effective solutions. SOCs face several challenges that an AI-native approach could help solve. These include the overwhelming number of alerts from legacy systems, which often result in false positives, and a shortage of experienced SOC analysts. Additionally, the growing sophistication of multi-domain threats and increasingly complex cloud configurations pose significant risks. AI is already being used in criminal activities to overcome cybersecurity measures, but it can also be a powerful tool in the fight against cybercrime. According to the 2024 Cisco Cybersecurity Readiness Index, AI should be integral to a company's core infrastructure, rather than an add-on. Research firm Gartner predicts that by 2028, multi-agent AI will be used in 70% of threat detection and incident response implementations, primarily to augment staff rather than replace them. AI-driven SOCs can offer accelerated threat detection and improved predictive accuracy using real-time telemetry data. AI-based tools such as chatbots are already providing faster turnarounds on a range of queries. Furthermore, graph database technologies, which visualize and analyze interconnected data in real time, are helping defenders track threats and breaches more effectively. AI is proving successful in reducing false positives, automating incident responses, enhancing threat analysis, and streamlining SOC operations. When combined with graph databases, AI can also help SOCs track and stop multi-domain attacks. An AI-native SOC strategy should prioritize human-in-the-middle workflows, aiming to enhance the knowledge and skills of SOC analysts and retain talent. Organizations that foster a culture of continuous learning and use AI as a tool for training and operations are already gaining a competitive edge. AI-driven SOCs can significantly reduce incident response times, with some organizations reporting a decrease of up to 50%. This allows security teams to address threats more promptly, minimizing potential damage. The role of AI in SOCs is expected to expand to include proactive adversary simulations, continuous health monitoring of SOC ecosystems, and advanced endpoint and identity security through zero-trust integration. These advancements will further strengthen organizations' defenses against evolving cyber threats.
This piece is part of our special series, “AI at Scale: From Vision to Viability.” Discover more insights from this series here. In the rapidly evolving landscape of 2025, the challenge of scaling generative AI tools is more daunting than ever. Companies are striving to integrate large language models (LLMs) into their operations, but scaling goes beyond simply deploying larger models or investing in the latest tech. It involves a comprehensive transformation of operations to align AI capabilities with business objectives, optimize costs, and empower teams. As businesses transition from early AI experimentation to large-scale deployments, they face a turning point. The initial excitement of adoption has been replaced by the practical challenges of efficiency, cost management, and maintaining relevance in competitive markets. The key to successful AI scaling in 2025 lies in answering tough questions: How can generative tools be made impactful across departments? What infrastructure will support AI growth without straining resources? And crucially, how can teams adapt to AI-driven workflows? Three principles are critical to success: identifying clear, high-value use cases; maintaining technological flexibility; and fostering a workforce that can adapt. Successful companies don't merely adopt generative AI; they devise strategies that align the technology with business needs, continually reassessing costs, performance, and the cultural shifts necessary for long-term impact. This approach is not just about deploying advanced tools; it's about building operational resilience and scalability in a fast-paced, evolving tech environment. Companies like Wayfair and Expedia exemplify these lessons, demonstrating how a hybrid approach to LLM adoption can revolutionize operations. By combining external platforms with custom solutions, they illustrate the power of agility and precision, setting a benchmark for others. The choice between building or buying generative AI tools is not binary, as Wayfair and Expedia's strategies show. They strike a balance between flexibility and specificity, using external AI platforms for general applications while developing proprietary tools for specific needs. They've found that smaller, cost-effective models often outperform larger, more expensive ones in certain tasks, such as tagging product attributes. The journey of scaling LLMs reminds us of the enterprise resource planning (ERP) evolution in the 1990s, where companies had to choose between rigid solutions and customized systems. The successful ones recognized the value of blending external tools with tailored developments to address specific operational challenges. Wayfair and Expedia have shown that the true power of LLMs lies in targeted applications that deliver measurable impact. For instance, Wayfair uses generative AI to enrich its product catalog, improving search and customer recommendations. Expedia, on the other hand, has integrated generative AI across customer service and developer workflows, significantly improving customer satisfaction and accelerating code generation and debugging. Infrastructure considerations, often overlooked, play a crucial role in the long-term sustainability of LLMs. Both companies rely on cloud infrastructure to handle their AI workloads and continually assess the scalability of cloud providers while considering localized infrastructure for real-time applications. The challenge of deploying LLMs is not just technical but also cultural. Companies like Wayfair and Expedia emphasize the importance of preparing their organizations to adopt and integrate generative AI tools. Comprehensive training programs ensure employees can adapt to new workflows, while governance structures ensure responsible implementation. Wayfair and Expedia's experiences provide valuable lessons for any organization looking to scale LLMs effectively. They demonstrate that success depends on identifying clear business use cases, maintaining flexibility in technology choices, and fostering a culture of adaptation. Their hybrid approaches show how to balance innovation with efficiency, ensuring that investments in generative AI deliver tangible results. The pace of technological and cultural change makes scaling AI in 2025 an unprecedented challenge. The hybrid strategies, flexible infrastructures, and robust data cultures that define successful AI deployments today will pave the way for future innovation. Companies that lay these foundations now will not just scale AI; they will also scale resilience, adaptability, and competitive advantage. Looking forward, the challenges of inference costs, real-time capabilities, and evolving infrastructure needs will continue to shape the enterprise AI landscape. As Expedia's senior vice president aptly puts it, “Gen AI and LLMs are going to be a long-term investment for us and it has differentiated us in the travel space. We have to be mindful that this will require some conscious investment prioritization and understanding of use cases.” Stay ahead of the curve with VB Daily. We provide you with the latest insights on generative AI, from regulatory shifts to practical deployments, empowering you to impress your boss and maximize ROI.
As we conclude 2024, it's clear that artificial intelligence (AI) has made significant strides. Predicting what 2025 holds for AI is a challenge, but certain trends provide insight into what businesses can anticipate and how they can prepare to leverage these advancements. Over the past year, the costs of cutting-edge AI models have consistently fallen. The cost per million tokens of leading large language models (LLMs) has decreased by over 200 times in the past two years. This reduction in cost is largely due to increasing competition and technical advancements in accelerator chips and specialized inference hardware, which allow AI labs to offer their models at lower prices. Businesses should capitalize on this trend by experimenting with the most advanced LLMs and developing application prototypes around them, despite the current high costs. As model prices continue to decrease and their capabilities improve, many applications will become scalable, enabling businesses to achieve more with the same budget. The introduction of OpenAI o1 has sparked a wave of innovation in the LLM space. Models are now capable of "thinking" longer and reviewing their responses, allowing them to tackle reasoning problems that were previously unsolvable with single-inference calls. This has initiated a race in the AI industry, with many open-source models now replicating o1’s reasoning abilities and extending this paradigm to new areas, such as answering open-ended questions. These advances in o1-like models, also known as large reasoning models (LRMs), have significant implications for the future. Firstly, due to the large number of tokens that LRMs need to generate for their responses, hardware companies are likely to be motivated to create specialized AI accelerators with higher token throughput. Secondly, LRMs can help overcome a major hurdle for the next generation of language models: the need for high-quality training data. It's reported that OpenAI is using o1 to generate training examples for its future models, and we can expect LRMs to help produce a new generation of small, specialized models trained on synthetic data for very specific tasks. To leverage these advancements, businesses should allocate resources to explore potential applications of frontier LRMs. They should test the boundaries of these models and consider what applications might be possible if next-generation models can overcome current limitations. The computational and memory constraints of transformers, the primary deep learning architecture used in LLMs, have led to the development of alternative models with linear complexity. Among these, the state-space model (SSM) has made significant progress in the past year. Other promising models include liquid neural networks (LNNs), which use innovative mathematical equations to accomplish more with fewer artificial neurons and compute cycles. Researchers and AI labs have released pure SSM models and hybrid models that combine the strengths of transformers and linear models. While these models haven't yet reached the performance level of top-tier transformer-based models, they are catching up quickly and are significantly faster and more efficient. If progress continues, many simpler LLM applications could be transferred to these models and run on local servers or edge devices, allowing businesses to use proprietary data without sharing it with third parties. The scaling laws of LLMs are continually changing. The release of GPT-3 in 2020 demonstrated that increasing the size of a model could yield impressive results and enable models to perform tasks without explicit training. In 2022, DeepMind's Chinchilla paper established a new direction in data scaling laws, proving that training a model on a dataset several times larger than its parameters can continue to yield improvements. This allowed smaller models to compete with frontier models with hundreds of billions of parameters. However, there are concerns that these scaling laws are reaching their limits. Frontier labs are reportedly experiencing diminishing returns from training larger models. Concurrently, training datasets have grown to tens of trillions of tokens, and obtaining quality data is becoming increasingly challenging and expensive. LRMs, however, are promising a new direction: inference-time scaling. When model and dataset sizes fail, we might be able to make progress by allowing the models to run more inference cycles and correct their own errors. As we move into 2025, the AI landscape continues to evolve in unforeseen ways, with new architectures, reasoning capabilities, and economic models reshaping the realm of possibility. For businesses willing to experiment and adapt, these trends represent not just technological progress, but a fundamental shift in how we can utilize AI to solve real-world problems. Stay up-to-date with the latest in generative AI and its practical applications with VB Daily. We provide insights on everything from regulatory changes to practical deployments, helping you maximize your return on investment in this rapidly evolving field.
Google researchers have developed a new neural network architecture, called Titans, which could help overcome a significant challenge for large language models (LLMs) - enhancing their memory during inference without significantly increasing memory and computation costs. Titans effectively integrates traditional LLM attention blocks with "neural memory" layers, allowing models to handle both short- and long-term memory tasks efficiently. LLMs traditionally use a transformer architecture that employs a self-attention mechanism to calculate relationships between tokens, a method effective for learning complex patterns in token sequences. However, as the sequence length grows, the computational and memory costs of calculating and storing attention increase exponentially. While alternative architectures have been proposed to manage this issue, these linear models often fall short in performance compared to classic transformers due to their tendency to compress contextual data, leading to overlooked details. The Google researchers suggest an ideal architecture would incorporate different memory components capable of utilizing existing knowledge, memorizing new facts, and learning abstractions from their context. Titans fills this gap by introducing a "neural long-term memory" module, which learns new information during inference without the inefficiencies of the full attention mechanism. This module employs the concept of "surprise" to determine which information is worth storing, only retaining data that adds useful information to what the model already knows. Furthermore, it includes an adaptive forgetting mechanism for managing memory capacity by discarding unnecessary information. Titans is described as a family of models that blend existing transformer blocks with neural memory modules. It has three key components: a "core" module for short-term memory, a "long-term memory" module for storing information beyond the current context, and a "persistent memory" module for storing time-independent knowledge. Preliminary tests of Titans models, ranging from 170 million to 760 million parameters, have shown promising results, outperforming both transformers and linear models in tasks involving long sequences. However, the models still need to be tested at larger sizes. The development of Titans could potentially lead to more efficient applications that incorporate new knowledge into prompts, rather than relying on techniques like Retrieval-Augmented Generation (RAG). This could speed up the development cycle for prompt-based applications and reduce inference costs for very long sequences, enabling wider deployment of LLM applications. Google plans to release the PyTorch and JAX code for training and evaluating Titans models.
Rapidly gaining recognition as a leading open-source AI image generation startup, Black Forest Labs has surpassed the quality of models offered by its competitors. Notably, the founders previously worked at a renowned AI company. At one point, it even provided the default image generator for a popular language model. Black Forest Labs is now advancing its offerings with the introduction of the FLUX Pro Finetuning API. This innovative tool allows creators to tailor generative AI models to their specific needs using their own images and concepts. This API has been developed with professionals in creative industries in mind, such as marketing, branding, and storytelling. It offers a user-friendly approach to personalizing the company's flagship FLUX Pro and FLUX Ultra models. The FLUX Pro Finetuning API enables users to fine-tune generative text-to-image models using between five and 20 training images, which can optionally be accompanied by text descriptions. The result is a customized model that retains the generative versatility of the base FLUX Pro models while aligning outputs with specific creative visions. With modes including "character", "product", "style" and "general", the tool is adaptable to a wide range of use cases. The trained models can be seamlessly integrated with endpoints such as FLUX.1 Fill, Depth, Canny and Redux, and offer high-resolution generation capabilities of up to four megapixels. This makes the API ideal for creating brand-consistent marketing visuals or detailed character art, enhancing the precision and adaptability of AI-generated content. With the FLUX Pro Finetuning API, professionals can create customized models that preserve essential design elements, character consistency or brand properties. A study by Black Forest Labs demonstrated that nearly 70% of users preferred the fine-tuned results of FLUX Pro over competing services. The API enables various features such as: • Inpainting: Refining images with iterative edits using FLUX.1 Fill • Structural Control: Enhancing image generation with precise structural adjustments through integration with FLUX.1 Depth • Visual Branding: Consistency across marketing materials and campaigns Black Forest Labs has collaborated with BurdaVerlag, a leading German media and entertainment company, to showcase the potential of the FLUX Pro Finetuning API. The creative teams at BurdaVerlag are using the tool to develop customized FLUX models tailored to their brands, including a popular children's publication. With this integration, BurdaVerlag's design teams can create visuals that reflect each brand's identity while exploring new creative directions. The API has accelerated their production workflows, enabling high-quality content generation at scale. The FLUX Pro Finetuning API is now available through API endpoints via the Flux.1 [dev] model. The pricing for all FLUX models on Black Forest Labs’ API is competitive. The finetuning process is user-friendly, requiring minimal input. Users upload training images in supported formats (JPG, JPEG, PNG or WebP), with resolutions capped at one megapixel for optimal results. Advanced configuration options allow for fine control over the training process, including iteration counts, learning rates, and trigger words for precise prompt integration. Black Forest Labs has provided comprehensive resources, including a Finetuning Beta Guide and Python scripts for easy implementation. Users can monitor progress, adjust parameters, and test results directly via API endpoints, ensuring a smooth and efficient workflow. The FLUX Pro Finetuning API sets a new standard for customized content creation in generative AI. Now available, Black Forest Labs aims to revolutionize how individuals and organizations approach personalized media generation, unlocking creative possibilities at an unprecedented scale. For those looking to stay informed about the latest developments in generative AI, VB Daily provides insider insights on everything from regulatory changes to practical deployments, helping you maximize your ROI.
A few months ago, Google Cloud introduced its first Arm-based CPU, Axion, powering the C4A virtual machine instances. Now, it's taking the next step by launching C4A with Titanium SSDs, custom-designed local disks aimed at enhancing storage performance. This move strengthens Google's C4A offering by providing VMs that can significantly improve cloud performance for tasks requiring real-time data processing. These VMs offer an ideal combination of ultra-low latency, high-throughput storage, and cost efficiency, making them perfect for running high-performance databases, analytics engines, and search applications. The new Titanium SSD-equipped C4A VMs are now available in services such as Compute Engine, Google Kubernetes Engine (GKE), Batch, and Dataproc. Standard C4A VMs are also in preview in Dataflow, with support for Cloud SQL, AlloyDB, and other services in the works. Google Cloud’s C4A instances offer three storage options: Persistent Disk, Hyperdisk, or Local SSD. Persistent Disk is a standard block storage service where performance is shared between volumes of the same type, while Hyperdisk provides dedicated performance, supporting up to 350,000 input/output operations per second (IOPS) and 5 GB/s throughput per volume. However, for workloads requiring local storage capacity, Hyperdisk may fall short. This is where the local SSDs come into play, with Titanium SSDs being the latest innovation. The newly launched C4A instances with Titanium SSDs offer up to 2.4M random read IOPS, 10.4 GiB/s of read throughput, and 35% lower access latency compared to previous SSD generations. These SSDs, directly attached to the compute instances inside the host server, offload storage and networking tasks from the CPU, freeing up resources to boost application security and throughput performance. This is made possible by Google’s Titanium system, which runs the offloading job from the host CPU into a system of custom silicon, hardware, and software throughout the company’s data centers. The new C4A family with Titanium SSDs offers up to 72 vCPUs, 576 GB memory, and 6 TB of local storage. Enterprises can choose between Standard (4 GB/vCPU) and High-memory (8 GB/vCPU) configurations. Connectivity options can scale up to 100 Gbps. These capabilities can easily support high-traffic workloads with real-time data processing such as web/app servers, high-performance databases, data analytics engines, and search. Additionally, it can power applications requiring in-memory caching, media streaming and transcoding, and CPU-based AI/ML. According to Google Cloud senior product managers, C4A instances provide up to 65% better price-performance and up to 60% better energy efficiency than comparable current-generation x86-based instances. This makes C4A with Titanium SSDs an industry-leading option for a broad range of Arm-compatible general-purpose workloads. Early adopters of C4A VMs with Titanium SSDs, including Couchbase and Elastic, are already reporting performance gains. For instance, Couchbase's SVP of product and partners highlighted how the combination of Google Axion C4A instances with Titanium SSDs delivers unparalleled price-performance benefits, ultra-low latency, and scalable compute power for analytic and operational workloads. The C4A VMs with Titanium SSDs are now available in key regions, including the U.S., Europe, and Asia, with plans to expand further. Customers can access them through on-demand, Spot VMs, and discounted pricing options. In summary, with significant improvements in performance, energy efficiency, and scalability, C4A VMs with Titanium SSDs are setting a new standard for cloud workloads, catering to the needs of modern enterprises.
Microsoft Research has unveiled a groundbreaking AI system, MatterGen, that is set to revolutionize the way new materials are discovered and designed. This advanced tool uses AI to generate new materials with specific properties, potentially fast-tracking the creation of superior batteries, more efficient solar cells, and other crucial technologies. Traditionally, discovering new materials involves screening millions of existing compounds, a process that can take years. MatterGen, however, introduces a new approach. It creates new materials based on specified characteristics, much like AI image generators that produce images from text descriptions. The AI system uses a specialized type of AI known as a diffusion model, which is adapted to work with three-dimensional crystal structures. It refines random atom arrangements into stable, useful materials that meet the specified criteria. MatterGen's results have proven superior to previous methods. The materials it produces are over twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum compared to previous AI approaches. This indicates that the generated materials are more likely to be useful and physically feasible to create. In a notable demonstration of its capabilities, MatterGen designed a new material, TaCr2O6, that was synthesized by scientists at an advanced technology institute in China. The real-world material closely mirrored the AI's predictions, confirming the system's practical value. MatterGen is also highly flexible. It can be fine-tuned to generate materials with specific properties, such as particular crystal structures or desired electronic or magnetic characteristics. This flexibility could prove invaluable for designing materials for specific industrial uses. The potential impacts of this technology are immense. New materials are key to advancing technologies in energy storage, semiconductor design, and carbon capture. Improved battery materials could speed up the shift to electric vehicles, while more efficient solar cell materials could make renewable energy more affordable. Microsoft has released MatterGen's source code under an open-source license, allowing researchers worldwide to build upon this technology. This move could amplify the system's impact across numerous scientific fields. MatterGen's development is part of Microsoft's broader AI for Science initiative, which aims to expedite scientific discovery using AI. The project integrates with Microsoft's Azure Quantum Elements platform, potentially making the technology accessible to businesses and researchers through cloud computing services. However, experts warn that while MatterGen is a significant advancement, the journey from computationally designed materials to practical applications still involves rigorous testing and refinement. The system's predictions, while promising, require experimental validation before they can be deployed industrially. Despite this, MatterGen represents a significant leap in using AI to expedite scientific discovery. As one senior researcher on the project stated, "We're deeply committed to research that can have a positive, real-world impact, and this is just the beginning."
Microsoft is intensifying its focus on small language models (SLMs) by introducing rStar-Math, a novel reasoning technique designed to enhance the performance of smaller models in solving mathematical problems. This method has shown to be on par with, and in some instances, surpasses the capabilities of OpenAI's o1-preview model. Currently in its research phase, as detailed in a paper on the pre-review website arXiv.org, the rStar-Math technique has been tested on various smaller open-source models. These include Microsoft's Phi-3 mini, Alibaba's Qwen-1.5B (a 1.5-billion-parameter model), and Qwen-7B (a 7-billion-parameter model). The technique demonstrated improved performance across all models, even outperforming OpenAI's most advanced model on the MATH word problem-solving benchmark, which includes 12,500 questions spanning various areas like geometry and algebra at all levels of difficulty. As per a Hugging Face post, the researchers intend to make their code and data available on Github. However, the repository remains private as the team is still undergoing an internal review process for open-source release. The unveiling of rStar-Math comes shortly after Microsoft open-sourced its Phi-4 model, a smaller 14-billion-parameter AI system now available under the permissive MIT license on Hugging Face. While the Phi-4 release has broadened access to high-performance small models, rStar-Math presents a specialized approach, demonstrating how smaller AI systems can achieve cutting-edge results in mathematical reasoning. rStar-Math's success lies in its use of Monte Carlo Tree Search (MCTS), a technique that simulates human deep thinking by iteratively refining solutions to mathematical problems. This method simplifies complex math problems into single-step tasks, thereby reducing the difficulty for smaller models. The research team didn't merely apply MCTS, but also required the trained model to consistently output its reasoning steps as both natural language descriptions and Python code. They also trained a policy model to generate math reasoning steps and a process preference model to choose the most promising steps towards problem-solving. Both models were enhanced over four rounds of self-evolution. Using publicly available word problems and their solutions as starting data, the researchers generated new problem-solving steps with the two models. After four rounds of self-evolution, rStar-Math achieved significant milestones, including outperforming OpenAI o1-preview on the MATH benchmark and solving 53.3% of problems on the American Invitational Mathematics Examination (AIME). These results underscore the potential of SLMs in complex mathematical reasoning, a domain traditionally dominated by larger systems. Microsoft's focus on efficiency provides an alternative to the trend of scaling up language models, which often comes with high computational and energy costs. The introduction of rStar-Math further emphasizes this commitment by showing how SLMs can compete with, and in some cases outperform, larger models. Microsoft's dual releases of Phi-4 and the rStar-Math paper suggest that compact, specialized models can offer powerful alternatives to the industry's larger systems. By outperforming larger models in key benchmarks, these smaller models challenge the belief that bigger is always better, offering mid-sized organizations and academic researchers access to advanced capabilities without the financial or environmental impact of larger models.
The prominent AI orchestration framework, LlamaIndex, has rolled out a new architecture known as Agent Document Workflow (ADW). This innovative system aims to surpass the conventional retrieval-augmented generation (RAG) processes, thereby boosting agent efficiency and productivity. As orchestration frameworks continue to evolve, ADW could provide a viable solution for enhancing the decision-making capabilities of AI agents within organizations. LlamaIndex claims that ADW allows agents to handle intricate workflows that go beyond basic extraction or matching, which is a limitation of traditional RAG-based frameworks. To illustrate the practical application of ADW, LlamaIndex provided an example of contract reviews. Here, human analysts need to extract crucial information, cross-check against regulatory standards, identify potential risks, and formulate recommendations. Ideally, when implemented in such a workflow, AI agents would follow the same process, making decisions based on the documents they review and the knowledge derived from other documents. LlamaIndex explains that ADW addresses these challenges by treating documents as integral parts of broader business processes. An ADW system can maintain a consistent state across multiple steps, apply business rules, coordinate different components, and take actions based on document content, rather than merely analyzing it. LlamaIndex has earlier indicated that while RAG is a valuable technique, it is relatively primitive, particularly for enterprises seeking to enhance their decision-making capabilities using AI. To address this, LlamaIndex has developed reference architectures that merge its LlamaCloud parsing capabilities with AI agents, creating systems that can comprehend context, maintain state, and manage multi-step processes. In this system, each workflow has a document that serves as an orchestrator. This orchestrator directs agents to use LlamaParse to extract information from data, maintain the document context and process state, and retrieve reference material from another knowledge base. The agents can then generate recommendations for contract reviews or other actionable decisions for various use cases. LlamaIndex emphasizes that by maintaining state throughout the process, agents can manage complex multi-step workflows that go beyond basic extraction or matching. This approach allows them to build a comprehensive understanding of the documents they process while coordinating between different system components. AI agent orchestration is an emerging field, with many organizations still exploring its potential. The orchestration of AI agents and applications is expected to gain more attention as agents evolve from single systems to multi-agent ecosystems. AI agents extend the capabilities of RAG by grounding their operations in enterprise knowledge. However, as more enterprises start deploying AI agents, they are also expecting them to perform tasks similar to human employees. For these more complex use cases, standard RAG falls short. Hence, enterprises are considering advanced approaches like agentic RAG, which expands agents' knowledge base, allowing them to decide if they need to find more information, which tool to use for this purpose, and if the fetched context is relevant before delivering a result. To stay ahead in the rapidly evolving world of generative AI, VB Daily provides insider insights on everything from regulatory changes to practical deployments, ensuring you have the knowledge to maximize ROI.
As the capabilities of large language models (LLMs) in coding continue to evolve, traditional benchmarks for assessing their performance are becoming increasingly inadequate. Although many LLMs score similarly on these benchmarks, it's often challenging to determine which ones are best suited for specific software development projects. A recent study by researchers at Yale University and Tsinghua University introduces an innovative method to evaluate the capacity of these models to handle 'self-invoking code generation' tasks. These tasks, which demand reasoning, code generation, and the reuse of existing code, are more representative of real-world programming scenarios than traditional benchmark tests. Consequently, they offer a more accurate measure of an LLM's ability to tackle genuine coding problems. LLMs are commonly evaluated using benchmarks such as HumanEval and MBPP (Mostly Basic Python Problems), datasets comprising handcrafted problems that necessitate code writing for simple tasks. However, these benchmarks only capture a small fraction of the challenges that software developers encounter in real-life situations. In practical scenarios, developers not only write new code but also understand and reuse existing code, and develop reusable components to address complex problems. The researchers argue that the ability to comprehend and utilize self-generated code, or self-invoking code generation, is critical for LLMs to leverage their reasoning capabilities in code generation. This capability is overlooked in current benchmarks. To assess LLMs' proficiency in self-invoking code generation, the researchers developed two new benchmarks, HumanEval Pro and MBPP Pro. These benchmarks extend the existing datasets, with each problem building upon an existing example in the original dataset and introducing additional elements. These elements require the model to solve the base problem and use that solution to address a more complex problem. The researchers' findings reveal a significant disparity between traditional coding benchmarks and self-invoking code generation tasks. While top-tier LLMs are adept at generating individual code snippets, they often struggle to effectively use their own generated code to solve more intricate problems. Interestingly, while instruction fine-tuning significantly enhances performance on simple coding tasks, its efficacy diminishes in self-invoking code generation. This suggests a need to rethink how base models for coding and reasoning tasks are trained. To further research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. This approach uses advanced LLMs to generate self-invoking problems based on the original problems, generate potential solutions, and verify their correctness by executing the code and running test cases. This process reduces the need for manual code review, enabling more examples to be generated with less effort. This new category of benchmarks arrives as traditional coding benchmarks are rapidly being mastered by advanced models. Meanwhile, more complex benchmarks like SWE-Bench, which assesses models' abilities in comprehensive software engineering tasks, are proving challenging even for the most advanced models. Self-invoking code generation falls between simple benchmarks and SWE-Bench. It helps evaluate a specific type of reasoning ability: using existing code within a module to tackle complex problems. These benchmarks could provide a practical measure of the usefulness of LLMs in real-world settings, where human programmers are in charge, and AI copilots assist them in accomplishing specific coding tasks in the software development process. The researchers believe that HumanEval Pro and MBPP Pro could serve as valuable benchmarks for code-related evaluations and inspire future LLM development by highlighting current model shortcomings and encouraging innovation in training methodologies.