close
close

Solondais

Where news breaks first, every time

sinolod

Cohere Adds Vision to Its RAG Search Capabilities


Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more


Cohere has added multimodal integrations to its search model, allowing users to deploy images in a RAG-style enterprise search.

Embed 3, released last year, uses embedding models that transform data into digital representations. Embeds have become crucial in augmented retrieval generation (AGR) because businesses can create embeddings of their documents that the model can then compare to get the information requested by the prompt.

The new multimodal version can generate embeddings in images and texts. Cohere claims that Embed 3 is “now the most powerful multimodal integration model on the market.” Aidan Gonzales, co-founder and CEO of Cohere, published a chart on X showing performance improvements in image search with Embed 3.

“This advancement allows companies to derive real value from their vast amount of data stored in images,” Cohere said in a blog post. “Businesses can now create systems that can accurately and quickly search important multi-modal assets such as complex reports, product catalogs and design files to improve workforce productivity. »

Cohere said a more multimodal approach increases the volume of data businesses can access through a RAG search. Many organizations often limit RAG searches to structured and unstructured text, even if they have multiple file formats in their data libraries. Customers can now import more tables, charts, product images and design templates.

Performance Improvements

Cohere said Embed 3’s encoders “share a unified latent space,” allowing users to include both images and text in a database. Some image integration methods often require maintaining a separate database for images and text. The company said this method leads to better mixed modalities research.

According to the company, “Other models tend to group text and image data into separate areas, leading to weak search results that are biased toward text data only. Embed 3, on the other hand, prioritizes the meaning of the data without favoring a specific modality.

Embed 3 is available in over 100 languages.

Cohere said multimodal Embed 3 is now available on its platform and Amazon SageMaker.

Play catch up

Many consumers are quickly becoming familiar with multimodal search, thanks to the introduction of image-based search on platforms like Google and chat interfaces like ChatGPT. As individual users become accustomed to searching for information from images, it makes sense that they would want the same experience in their professional lives.

Businesses have also started to see this benefit as other companies with integration models offer multimodal options. Some model developers, like Google and OpenAI, offer some type of multimodal integration. Other open source templates can also make it easier to integrate images and other modalities. The fight now is for the multimodal integration model that can operate at the speed, precision and security that businesses demand.

Cohere, which was founded by some of the researchers responsible for the Transformer model (Gomez is one of the authors of the famous article “Attention is All You Need”), has struggled to top the list for many people in the business world. It updated its APIs in September to allow customers to easily switch from competing models to Cohere models. At the time, Cohere said the move was intended to align with industry standards in which customers often switch between models.