Meta Motivo behavioral foundation model, Multimodal AI Agents, Gemini 2.0 Flash, Real-time Video and screen-sharing in the Multimodal Live API and ChatGPT, Phi-4, Sora, TRELLIS, Deep Research & more

Dec 13, 2024

Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.

In today’s issue (Issue #86 ):

AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week

From our sponsors:

200+ hours of research on AI tools & hacks packed in 3 hours

This free 3-hour Mini Course on AI & ChatGPT (worth $399) will help you become a master of 20+ AI tools & prompting techniques and save 16 hours/week.

Get it now for absolutely free! (for first 100 users only) 🎁

This course will teach you how to:

Build business that make $10,000 by just using AI tools
Make quick & smarter decisions using AI-led data insights
Write emails, content & more in seconds using AI
Solve complex problems, research 10x faster & save 16 hours every week

Register & save your seat now! (100 free seats only)

🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance

🔥 News

Google Updates [Details]:
1. Gemini 2.0 Flash (experimental version): a new model that outperforms 1.5 Pro on key benchmarks, at twice the speed and also comes with new capabilities and improved spatial understanding. In addition to supporting multimodal inputs like images, video and audio, 2.0 Flash now supports multimodal output like natively generated images mixed with text and steerable text-to-speech (TTS) multilingual audio. It can also natively call tools like Google Search, code execution as well as third-party user-defined functions. Gemini 2.0 Flash is available now in Google AI Studio and Vertex AI with multimodal input and text output available to all developers, and text-to-speech and native image generation available to early-access partners. General availability will follow in January, along with more model sizes.
2. Multimodal Live API for real-time, two-way interactions that use text, audio, and video input, with audio and text output. The model's video understanding enable you to share camera input or screencasts and ask questions about them.
3. Deep Research: personal AI research assistant rolling out in Gemini Advanced. Deep Research explores complex topics on your behalf and provides findings in a comprehensive, easy-to-read report [Details].
4. Jules: an experimental AI-powered code agent that integrates directly into a GitHub workflow.
5. Project Mariner: an early research prototype for browser-based agents built with Gemini 2.0 that can understand and reason across information in your browser screen, including pixels and web elements like text, code, images and forms, and then uses that information via a Chrome extension to complete tasks.
6. Project Astra (Universal AI Assistant): With Gemini 2.0, Project Astra can use Google Search, Lens and Maps. It now has up to 10 minutes of in-session memory and can remember more conversations.
7. Gemini-exp-1206: A new model with 2M token context length available via AI Studio and the Gemini API. Gemini-exp-1206 is at first place overall on the Chatbot Arena LLM Leaderboard [Link].
8. Willow, a state-of-the-art quantum chip. Willow performed a standard benchmark computation in under five minutes that would take one of today’s fastest supercomputers 10 septillion (that is, 10²⁵) years. Willow can reduce errors exponentially as it scales up using more qubits. This cracks a key challenge in quantum error correction that the field has pursued for almost 30 years [Details].
9. Android XR: a new Gemini-powered operating system for extended reality headsets and glasses [Details]
Meta Updates [Details]:
1. Meta Motivo: a first-of-its-kind behavioral foundation model that controls the movements of a virtual embodied humanoid agent to perform complex tasks. Meta Motivo is able to solve a wide range of whole-body control tasks, including motion tracking, goal pose reaching, and reward optimization, without any additional training or planning.
2. Meta Video Seal: an open source model for video watermarking that builds on the Meta Audio Seal. Video Seal adds a watermark (with an optional hidden message) into videos that is imperceptible to the naked eye and can later be uncovered to determine a video’s origin.
3. Meta Omni Seal Bench: a leaderboard for neural watermarking covering several modalities
OpenAI Updates :
1. Advanced Voice Mode in the ChatGPT mobile app upgraded with Video and screen-sharing enabling users to share live video or their screens during voice conversations with ChatGPT [Details].
2. Sora Turbo video generation model available for paid subscribers. It can generate 1080p videos of up to 20 seconds from text prompts [Details]
3. Canvas —a collaborative workspace to work with ChatGPT to draft, edit, and get feedback on writing & code—is now available to all users. You can run Python code directly in canvas, letting ChatGPT fix any bugs based on console errors. Canvas now works with GPTs.
4. ChatGPT is now integrated into Apple experiences within iOS, iPadOS, and macOS, allowing access to ChatGPT right within the OS.
5. Sora System Card [Details].
6. OpenAI's Reinforcement Fine-Tuning research program [Link].
Microsoft introduced Phi-4, a 14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in addition to conventional language processing. Phi-4 outperforms much larger models, including Gemini Pro 1.5, on math competition problems [Details].
Shanghai AI Lab and others released InternLM-XComposer2.5-OmniLive (IXC2.5-OL), a comprehensive open-source multimodal system designed for long-term streaming video and audio interactions, similar to Gemini 2.0 Live Streaming [Details]
Devin, Cognition’s autonomous AI software engineer, is generally available starting at $500 a month [Details].
Microsoft Research released TRELLIS, an open-source large 3D asset generation model. It takes in text or image prompts and generates high-quality 3D assets in various formats, such as Radiance Fields, 3D Gaussians, and meshes [Details | Demo].
Meta released Llama 3.3 - an open-source 70B model that delivers similar performance to Llama 3.1 405B with cost effective inference [Details].
Midjourney is launching a multiplayer collaborative worldbuilding tool called ‘Patchwork’ [Details].
Replit Agent is now out of early access, upgraded with advanced visual capabilities for app creation. Alongside this, Replit launched a new Replit Assistant and introduced a new billing model based on ‘checkpoints’ [Details].
Reddit is rolling out ‘Reddit Answers’, an AI-powered conversational tool that allows users to ask questions and receive curated summaries from relevant Reddit discussions, complete with links to related communities and posts [Details].
Lambda launches ‘inference-as-a-service’ API claiming lowest costs in AI industry [Details].
xAI's Grok now features ‘Aurora’, an image generation model capable of producing photorealistic visuals from text and images. It's available on the X platform in select countries, with global access coming soon [Details].
LM Arena launched WebDev Arena, an arena where two LLMs compete to build a web app and you can vote on which one performed better [Link].

🔦 Weekly Spotlight

Clio: A system for privacy-preserving insights into real-world AI use [Details].
Adding payments to your LLM agentic workflows [Link].
Hugging Face Community released Open Preference Dataset for Text-to-Image Generation [Link].
ClearerVoice-Studio: an open-source, AI-powered speech processing toolkit designed for researchers, developers, and end-users [Link].

🔍 🛠️ AI Toolbox: Product Picks of the Week

Hyper3D: 3D assets, HDRI and animatable 3D avatars generation.
Remento: Preserve a loved one’s stories the easy way. Remento turns their spoken words into a personalized keepsake book of their stories.
Consistent Character AI: Use simple phrase to change the characters Expression, Action, Pose while keeping max consistency.
AISmartCube: Build AI tools with No Code.
eSelf AI: Create real-time, video-based conversational AI agents.
Countless.dev: Compare AI models easily.

Last week’s issue
New World Models, World's smallest vision language model, o1 Pro Mode, Luma Photon, Largest Open-Source video model, Amazon Nova, PaliGemma 2, Fish Speech 1.5, LTX Video and more
December 6, 2024
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
Read full story