New Tests Reveal Open Source AI Closing Gap to Proprietary Leaders

Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

AI startup Galileo released a comprehensive benchmark Monday that revealed open-source language models are rapidly closing the performance gap with their proprietary counterparts, a shift that could change the AI landscape, potentially democratizing advanced AI capabilities and accelerating innovation across industries.

The second annual Galileo Hallucination Index assessed 22 leading large language models for their tendency to generate inaccurate information. While closed-source models still lead overall, that advantage has narrowed significantly in just eight months.

“The huge improvements in open source models have been absolutely amazing,” Vikram Chatterji, co-founder and CEO of Galileo, told VentureBeat. “Back then (in October 2023), the first five or six were closed-source API models, mostly OpenAI models. Compared to now, open source is closing that gap.”

This trend could lower barriers to entry for startups and researchers, while putting pressure on established players to innovate faster or risk losing their advantage.

New AI Royal Game: Anthropic’s Claude 3.5 Sonnet Dethrones OpenAI

Anthropic’s Claude 3.5 Sonnet topped the index as the best performing model across all tasks, surpassing OpenAI’s offerings that dominated last year’s rankings. The change signals a changing of the guard in the AI arms race, with new entrants challenging established leaders.

“We were very impressed with the latest Anthropic model set,” Chatterji said. “Not only did Sonet perform well on short, medium, and long context windows, averaging 0.97, 1, and 1 on the tasks, respectively, but the model’s support for context windows of up to 200,000 suggests it can handle even larger data sets than we tested.”

The index also highlighted the importance of considering value for money alongside raw performance. Google’s Gemini 1.5 Flash emerged as the most efficient option, delivering strong results at a fraction of the price of top-of-the-range models.

“The cost of prompt tokens in dollars per million for Flash was $0.35, but for Sonnet it was $3,” Chatterji told VentureBeat. “When you look at the bottom line, the cost of response tokens in dollars per million, it’s about $1 for Flash, but $15 for Sonnet. So now anyone using Sonnet has to have money in the bank, which is at least 15 to 20 times more, whereas Flash is literally not worse at all.”

This cost disparity could prove crucial for enterprises planning to deploy AI at scale, potentially driving adoption of higher-performing models even if they do not perform best in performance rankings.

Global competition intensifies: Alibaba’s open-source model causes a stir

Alibaba’s Qwen2-72B-Instruct performed best among the open-source models, scoring high on short and medium-length inputs. This success signals a broader trend of non-U.S. companies making significant strides in AI development, challenging the notion of American dominance in the field.

Chatterji sees this as part of a larger democratization of AI technology. “I see this unlocking—using Llama 3, using Qwen—teams all over the world, at all different economic strata, can just start building really amazing products,” he said.

He added that we will likely see these models optimized for edge and mobile devices, leading to “creating amazing mobile apps, web apps, and edge apps using these open-source models.”

The index introduces a new approach to how models handle varying lengths of context, from short snippets to long documents, reflecting the growing use of AI for tasks such as summarizing long reports or answering questions about vast data sets. This approach provides a more nuanced view of model capabilities, essential for companies considering AI deployment in a variety of scenarios.

“We focused on breaking it down based on context length—small, medium, large,” Chatterji told VentureBeat. “That, and the other big thing is cost versus performance. Because that’s really important to people.”

The index also revealed that bigger isn’t always better when it comes to AI models. In some cases, smaller models outperformed their larger counterparts, suggesting that efficient design can sometimes trump scale itself.

“The Gemini 1.5 Flash model was an absolute revelation for us because it outperformed the larger models,” Chatterji said. “That suggests that if you have a lot of model design efficiency, it can trump scale.”

This discovery could lead to a shift in AI development as companies focus more on optimizing existing architectures rather than just increasing model size.

AI Crystal Ball: Predicting the Future of Language Models

Galileo’s findings could significantly impact AI adoption in enterprises. As open-source models grow and become more cost-effective, companies will be able to deploy powerful AI capabilities without relying on expensive proprietary services. This could lead to broader AI integration across industries, potentially increasing productivity and innovation.

The startup, which provides tools for monitoring and improving AI systems, is positioning itself as a key player in helping enterprises navigate the rapidly evolving landscape of language models. By offering regular, hands-on benchmarking, Galileo aims to become an essential resource for technical decision-makers.

“We want our enterprise customers and our AI team users to be able to use this as a powerful, ever-evolving resource that allows them to find the most effective way to build AI applications, rather than just wandering around trying to find a solution,” Chatterji said.

As the AI arms race intensifies, with new models being released almost weekly, the Galileo Index offers a snapshot of an industry in flux. The company plans to update the benchmark quarterly, providing ongoing insight into the changing balance between open-source and proprietary AI technologies.

Looking ahead, Chatterji predicts more progress in this area. “We’re starting to see large models that are like operating systems for this very powerful reasoning,” he said. “And that’s going to become more and more generalizable over the next maybe one to two years, and we’re also going to see the lengths of contexts that they can support, especially on the open-source side, start to increase significantly. The costs are going to come down significantly, and the laws of physics are just going to start to work.”

He also predicts a rise in the popularity of multimodal models and agent-based systems, which will require new evaluation frameworks and likely spur another round of innovation in the AI industry.

As organizations grapple with the rapid pace of AI development, tools like Galileo’s Hallucination Index will likely play an increasingly important role in informing decision-making and strategy. The democratization of AI capabilities, combined with the growing importance of cost-effectiveness, suggests a future in which advanced AI is not only more efficient, but also more accessible to a wider range of organizations.

This evolving landscape presents both opportunities and challenges for companies. While the availability of powerful, cost-effective AI models can drive innovation and efficiency, it also requires careful consideration of which technologies to adopt and how to effectively integrate them.

As the line between open-source and proprietary AI continues to blur, companies will need to be informed and agile, ready to adapt their strategies as the technology evolves. The Galileo benchmark serves not only as a snapshot of the current state of AI, but as a roadmap for navigating the complex and rapidly changing world of AI.

VB Daily

Stay up to date! Get the latest news in your inbox every day

By subscribing, you agree to the VentureBeat Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.