Multi-robot collaboration,Grok 3 , smallest video language model, Generative AI Model for Gameplay, AI co-scientist, Mistral Saba, Fiverr Go, Step-Video-T2V and Step-Audio, Pikaswaps & more
Multi-robot collaboration,Grok 3 , smallest video language model, Generative AI Model for Gameplay, AI co-scientist, Mistral Saba, Fiverr Go, Step-Video-T2V and Step-Audio, Pikaswaps & more
Feb 21, 2025
Hey there! Welcome back to AI Brews - a concise roundup of this week's major developments in AI.
In today’s issue (Issue #94):
AI Pulse: Weekly News at a Glance
Weekly Spotlight: Noteworthy Reads and Open-source Projects
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News at a Glance
xAI unveiled an early preview of Grok 3 reasoning model, which topped the Chatbot Arena leaderboard. Grok 3, with a context window of 1 million tokens, has leading performance across both academic benchmarks and real-world user preferences. The model is available for free now through X and Grok.com [Details].
Figure introduced Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics. Video below shows collaborative grocery storage. A single set of Helix neural network weights runs simultaneously on two robots as they work together to put away groceries neither robot has ever seen before [Details].
Microsoft introduced Muse, the first World and Human Action Model (WHAM). It’s a generative AI model of a video game that can generate game visuals, controller actions, or both. Microsoft is open sourcing the weights and sample data [Details].
Hugging Face released SmolVLM2, lightweight multimodal models designed to analyze video content. The 2.2B model is the go-to choice for vision and video tasks, while the 500M and 256M models represent the smallest video language models ever released [Details].
Google introduced AI co-scientist, a multi-agent AI system built with Gemini 2.0 as a virtual scientific collaborator to help scientists generate novel hypotheses and research proposals [Details].
Phind, the AI search engine has been updated to go beyond text answers to include images, diagrams, interactive widgets, cards, and other rich visual outputs within the answer itself [Details].
StepFun released Step-Video-T2V and Step-Audio. Step-Video-T2V is a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. Step-Audio is a production-ready open-source model family for intelligent and natural speech interaction [Details]
Perplexity AI launched a Deep Research tool that performs dozens of searches, reads hundreds of sources, and reasons through the material to autonomously deliver a comprehensive report. It’s free for up to 5 queries per day for non-subscribers and 500 queries per day for Pro users [Details].
Sakana AI introduced AI CUDA Engineer, the first comprehensive agentic framework for fully automatic CUDA kernel discovery and optimization [Details].
Pika launched Pikaswaps, a new feature to replace anything in your videos using photos you upload, or scenes you describe. Pika also launched an official iOS app [Link].
OpenAI introduced SWE-Lancer, a new benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Sonnet 3.5 performs best, followed by o1 and then GPT-4o [Details]
Mistral released Mistral Saba, a 24B parameter model trained on curated datasets from across the Middle East and South Asia. It supports Arabic and many Indian-origin languages, and is particularly strong in South Indian-origin languages such as Tamil [Details].
Google released PaliGemma 2 Mix, an upgraded vision-language model in the Gemma family.PaliGemma 2 mix can solve tasks such as short and long captioning, optical character recognition (OCR), image question answering, object detection and segmentation [Details].
Fiverr launched Fiverr Go, to enable freelancers doing voice-over, graphic design, and certain related work the ability to train AI on their content and to charge customers for access [Details].
Meta announced LlamaCon, a developer conference for 2025, exploring the potential of Llama [Details].
LangChain released LangMem SDK, a library that helps your agents learn and improve through long-term memory [Details].
Perplexity AI released R1 1776, a DeepSeek-R1 reasoning model that has been post-trained to remove censorship [Details].
🔦 Weekly Spotlight: Noteworthy Reads and Open-source Projects
OmniParser V2: Turning Any LLM into a Computer Use Agent.
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance.
Arch: an intelligent (edge and LLM) proxy designed for agentic applications - to help you protect, observe, and build agentic tasks by simply connecting (existing) APIs.
Mcp.run:Host, discover, publish, and manage Model Context Protocol servlets for LLMs + agents.
Aide: Open Source AI-native code editor. It is a fork of VS Code, and integrates tightly with the leading agentic framework on swebench-lite
🔍 🛠️ AI Toolbox: Product Picks of the Week
Trupeer: Let AI turn your simple screen recordings into polished videos and detailed guides.
Beatoven.ai:AI composer for crafting background music