Stable Point Aware 3D, Cosmos, Autonomous game characters and Digits by Nvidia, Qwen Chat, Hailuo's Subject Reference, rStar-Math, Text-to-Video gen with Transparency, Cohere's North, STAR, & more
Stable Point Aware 3D, Cosmos, Autonomous game characters and Digits by Nvidia, Qwen Chat, Hailuo's Subject Reference, rStar-Math, Text-to-Video gen with Transparency, Cohere's North, STAR, & more
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #88 ):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Nvidia updates:
Cosmos - an open platform of generative world foundation models (WFMs), advanced tokenizers, guardrails, and an accelerated data processing and curation pipeline for autonomous vehicles (AVs) and robotics developers. Cosmos World Foundation Models are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data. The models are commercially usable [Details | GitHub].
Digits - A $3,000 personal AI supercomputer. It’s powered by the new GB10 Grace Blackwell Superchip, which packs enough processing power to run sophisticated AI models, with up to 200 billion parameters, while being compact enough to fit on a desk and run from a standard power outlet [Details].
NVIDIA ACE autonomous game characters - First introduced in 2023, NVIDIA ACE is a suite of RTX-accelerated digital human technologies that bring game characters to life with generative AI. Nvidia is now expanding ACE from conversational non-playable characters (NPCs) to autonomous game characters that use AI to perceive, plan, and act like human players. Powered by generative AI, ACE will enable living, dynamic game worlds with companions that comprehend and support player goals, and enemies that adapt dynamically to player tactics [Details].
New NVIDIA AI Blueprints for building agentic AI applications. Blueprints are pre-defined, customizable AI workflows [Details].
Llama Nemotron and Cosmos Nemotron family of models. The Llama Nemotron Nano model will be offered as a NIM microservice for RTX AI PCs and workstations, and excels at agentic AI tasks [Details].
Previewed Project R2X, a vision-enabled PC avatar that can assist with desktop apps and video conference calls, analyze complex documents, create custom workflows, optimize PC settings, perceive the world around you, and more [Video].
Stability AI released Stable Point Aware 3D (SPAR3D), a large reconstruction model based on SF3D, that enables real-time editing and complete structure generation of 3D objects from a single image in under a second. This model is free for both commercial and non-commercial use under the permissive Stability AI Community License [Details].
Microsoft has made its Phi-4 model, announced last month, fully open-source on Hugging Face with MIT license [Details].
Microsoft introduced rStar-Math, a new reasoning technique that shows small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students [Details].
Qwen team has launched Qwen Chat, a web app for interacting with Qwen models. You can chat with their flagship model Qwen2.5-Plus , vision-language model Qwen2-VL-Max , reasoning models QwQ and QVQ, coding model etc. It lets you compare multiple models in one interface, upload documents and includes Artifacts interface [Link].
Hailuo AI launched S2V-01 video model that generates character-consistent videos from just one reference image. It also lets you modify posture, expressions, lighting etc. all with simple text-based prompts [Link].
HKUST and Adobe Research introduced TransPixar, an open-source model that can generate videos with transparent backgrounds [Details].
LM Studio’s latest update includes a Function Calling / Tool Use API through LM Studio's OpenAI compatibility API. This means you can use LM Studio with any framework that currently knows how to use OpenAI tools, and utilize local models for tool use instead [Details].
Cohere launched the early access program for North, an all-in-one secure AI workspace platform that combines LLMs, search, and automation. AI agents created with North can quickly and easily connect to the workplace tools and applications that employees regularly use (including in-house applications) [Details].
Bytedance released Sa2VA - first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space [Details].
Google is testing a new Daily Listen feature that automatically generates a podcast based on your Discover feed [Details].
Bytedance introduced STAR, a novel approach that leverages Text-to-Video models for real-world video super-resolution. [Details]
LlamaIndex has introduced Agent Document Workflow (ADW), a new architecture for applying agents on top of your document. An ADW system can maintain state across steps, apply business rules, coordinate different components, and take actions based on document content - not just analyze it [Details].
Moondream released an updated Moondream 1.9B with structured outputs, improved text understanding and a new Vision AI capability, Gaze Detection that tracks human attention [Details].
🔦 Weekly Spotlight
9 NVIDIA Announcements From CES 2025 And Their Impact On Blockchain [Link]
Agents Overview by Chip Huyen [Link].
smolagents: an open-source library by Hugging Face to run powerful agents in a few lines of code [Link].
A Practical Guide to LLM Pitfalls with Open Source Software [Link]
Open Source Computer Use: A secure cloud Linux computer powered by E2B Desktop Sandbox and controlled by open-source LLMs [Link]
NVIDIA CEO Jensen Huang Keynote at CES 2025 [Link]
The latest crop of AI-enabled wearables like Bee AI and Omi listen to your conversations to help organize your life [Link]
ChatGPT search vs. Google: A deep dive analysis of 62 queries [Link]
Gemini Coder: an open-source project based on Llama Coder that turns your idea into an app [Link]
Google’s white paper on agents [Links]
🔍 🛠️ AI Toolbox: Product Picks of the Week
KoderAI (Wait list): A multi-agent coding platform to build apps, websites, and AI agents using simple natural language. KoderAI plans, designs wireframes, generates code, databases, tests scripts and deploys apps to the cloud or app store with one click using a proprietary mixture of experts (MoE) model.
Simple AI: AI phone assistant that calls businesses to make reservations, book appointments, and more
Velocity: Create product videos from the product page link.
FreeForm by Every: AI-powered tool for creating smarter, more adaptive forms
Last week’s issue
Thanks for reading and have a nice weekend! 🎉 Mariam.
Have you tried the Stable Point Aware 3D demo on Hugging Face? It's lots of fun and you can see the steps because you have to manually click through to the next one.
But in some cases, it fails to interpret the overall object, making a "flat" rotating 3D image.