Twelve Labs Launches Marengo 2.7, Introducing New Multi-Vector Approach to Video Understanding

Twelve Labs Launches Marengo 2.7, Introducing New Multi-Vector Approach to Video Understanding

logo

Latest innovation yields greater than 15% improvement over previous foundation model

Twelve Labs, the video understanding company, announced Marengo 2.7, a new state-of-the-art multimodal embedding model that achieves over 15% improvement over its predecessor Marengo-2.6. Building upon the success of the previous video foundation model, Marengo-2.7 represents a significant advancement in multimodal video understanding, as it adopts a multi-vector approach that enables more precise and comprehensive video content analysis. This is the first model of its kind to do so, and early results are stunning, including 90.6% average recall in object search (32.6% improvement from previous version) and 93.2% recall in speech search (2.8% higher than specialized speech-to-text systems).

Video understanding has been a notoriously difficult problem to solve. A single video clip simultaneously contains visual elements (objects, scenes, actions), temporal dynamics (motion, transitions), audio components (speech, ambient sounds, music), and often textual information (overlays, subtitles). Traditional single-vector approaches struggle to effectively compress all these diverse aspects into one representation without losing critical information. Marengo 2.7 up-ends this thinking to do something entirely new.

Marketing Technology News: MarTech Interview with Gulab Patil, Founder & CEO @ Lemma

“Twelve Labs continues to push video understanding forward in unprecedented ways, turning the concept of a multi-vector approach into reality for the very first time,” said Jae Lee, CEO of Twelve Labs

A Novel Approach

With Marengo 2.7, Twelve Labs deploys multi-vector representation for the first time to address the complexities inherent in video. Unlike Marengo-2.6 that compresses all information into a single embedding, Marengo-2.7 decomposes the raw inputs into multiple specialized vectors. Each vector independently captures distinct aspects of the video content – from visual appearance and motion dynamics to OCR text and speech patterns.

For example, one vector might capture what things look like (e.g., “a man in a black shirt”), another tracks movement (e.g., “waving his hand”), and another remembers what was said (e.g., “video foundation model is fun”). This approach helps the model better understand videos that contain many different types of information, leading to more accurate video analysis across all aspects – visual, motion, and audio.

Marketing Technology News: In the TikTokization era of Advertising, the ‘Perfect Ad’ doesn’t exist. Where do creators go from here?

Marengo 2.7 demonstrates particular strength in detecting small objects while maintaining exceptional performance in general text-based search tasks. This level of granular representation enables more nuanced multimodal search capabilities. Now, with Marengo 2.7, users can search complex visual scenes, find specific brand appearances, locate exact audio moments, match images to video segments, and more.

“Twelve Labs continues to push video understanding forward in unprecedented ways, turning the concept of a multi-vector approach into reality for the very first time,” said Jae Lee, CEO of Twelve Labs. “Our R&D team is laser focused on solving what was previously considered unsolvable. Their groundbreaking work has been rigorously tested, and the model’s performance is vastly superior to anything on the market . We look forward to seeing how our customers will use this powerful technology.”

Write in to psen@itechseries.com to learn more about our exclusive editorial packages and programs.

Picture of prweb

prweb

PRWeb is the leader in online news distribution. It provides a highly effective way for organizations to distribute news, increase visibility and attract customers.

You Might Also Like