New World Models, World's smallest vision language model, o1 Pro Mode, Luma Photon, Largest Open-Source video model, Amazon Nova, PaliGemma 2, Fish Speech 1.5, LTX Video and more
New World Models, World's smallest vision language model, o1 Pro Mode, Luma Photon, Largest Open-Source video model, Amazon Nova, PaliGemma 2, Fish Speech 1.5, LTX Video and more
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #85):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Google DeepMind introduced Genie 2, a foundation world model capable of generating action-controllable, playable 3D environments for up to a minute. Based on a single prompt image, it can be played by a human or AI agent using keyboard and mouse inputs. Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.) [Details].
Tencent released HunyuanVideo, an open-source text-to-video model with 13 billion parameters making it the largest open-source video model. According to professional human evaluation results, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and 3 top performing Chinese video generative models [Details].
World Labs, the startup founded by AI expert Fei-Fei Li, introduced an AI system that generates 3D worlds from a single image. They’re rendered live in the browser and have a controllable camera with an adjustable simulated depth of field [Details] .
Moondream released Moondream 0.5B. With only 0.5 billion parameters, Moondream 0.5B is the world's smallest Vision-Language Model (VLM), optimized for edge devices and mobile platforms [Details].
Amazon has announced ‘Amazon Nova’ a new series of new AI foundation models available exclusively in Amazon Bedrock. There are two major categories [Details]:
Amazon Nova understanding models accept text, image, or video inputs to generate text output. Amazon Nova understanding models excel in Retrieval-Augmented Generation (RAG), function calling, and agentic applications.
Amazon creative content generation models accept text and image inputs to generate image or video output. Amazon Nova Canvas is an image generation model producing studio-quality images with precise control over style and content, including rich editing features such as inpainting, outpainting, and background removal. Amazon Nova Reel is a video generation model that can can produce short videos through text prompts and images, control visual style and pacing.
Hailuo AI(MiniMax) introduced l2V-01-Live, a new AI video model that brings 2D illustrations to life with enhanced smoothness and vivid motion [Link]
Luma AI revealed a new generation of text-to-image models - Luma Photon and Photon Flash that outperform leading models across several benchmarks for image quality, creativity, and adherence. The Photon models are faster and cheaper at generating ultra high-quality images compared to similar models and services; 1.5 cent for a 1080p image and 0.4 cent with Photon Flash (and in under 2s) [Details].
Lightricks released LTX Video (LTXV), an open-source AI model capable of generating video - 121 frames at 768×512 resolution - in four seconds on Nvidia’s H100 GPUs [Details].
OpenAI announced:
‘12 Days of OpenAI’: daily livestreams featuring launches, and demos ranging from major announcements to smaller treats [Link].
OpenAI o1 model is now out of preview in ChatGPT and is fully rolled out to ChatGPT Plus, Team, and Pro users. o1 now also supports image uploads and compared to 01-preview is faster and a more powerful reasoning model that’s better at coding, math & writing [Link].
A new $200 monthly plan - ChatGPT Pro - includes unlimited access to OpenAI o1 model as well as o1 pro mode, a version of o1 that uses more compute to think harder and provide even better answers to the hardest problems [Details].
OpenAI o1 System Card [Link].
Google released PaliGemma 2, a new iteration of the PaliGemma vision language model released by Google in May. Like its predecessor, PaliGemma 2 uses the same powerful SigLIP for vision, but it upgrades to the latest Gemma 2 for the text decoder part and available in 3B, 10B, 28B sizes [Details].
Fish Audio released Fish Speech 1.5, an open-source Text-to-Speech model ranked #2 on TTS-Arena. It’s trained on more than 1 million hours of audio data in multiple languages [Details].
ElevenLabs launched Conversational AI, a platform for building customizable, interactive voice agents including native Twilio integration for handling calls [Details].
Google’s latest video and image-generation models, Veo and Imagen 3, are now available on Vertex AI [Details].
Hume AI launched Voice Control, a new tool that allows developers to customize AI-generated voices across ten dimensions such as relaxedness, assertiveness, enthusiasm etc. using intuitive sliders for real-time adjustments [Details].
Cohere introduced Rerank 3.5 AI search foundation model. It delivers state-of-the-art performance with improved reasoning and multilingual capabilities to precisely search complex enterprise data like long documents, emails, tables, and code [Details].
Google DeepMind present GenCast, a new high resolution AI ensemble model that provides better forecasts of both day-to-day weather and extreme events than the top operational system, the European Centre for Medium-Range Weather Forecasts’ (ECMWF) ENS, up to 15 days in advance. Model’s code, weights, and forecasts will be released [Details].
Microsoft began rolling out a limited, U.S.-only preview of Copilot Vision, a tool that can understand and respond to questions about sites you’re visiting using Microsoft Edge [Details].
Anthropic launched Anthropic Fellows Program for AI Safety Research. The program will provide funding and mentorship for a small cohort of 10-15 Fellows to work full-time on AI safety research [Details].
Nous Research is training an AI model using machines distributed across the internet and is also live streaming the pre-training process on a dedicated website distro.nousresearch.com [Details].
Supabase launched Supabase Assistant v2 in the Dashboard - a global assistant with several new abilities [Details].
🔦 Weekly Spotlight
Certain names make ChatGPT grind to a halt, and we know why [Link].
Frontier Models are Capable of In-context Scheming [Link].
Moving generative AI into production - MIT Technology Review Insights [Link].
A smol course: a practical course by Hugging Face on aligning language models for a specific use case. The course is based on the SmolLM2 series of models [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
Exa Websets (Wait list): Use Websets to find a comprehensive set of anything you want from the web.
Pollo.ai: AI video generator, to create videos with text prompts, images or video
Lovable: an AI software engineer, which enables anyone to build websites and web apps.
Tables by Playmaker: Transform unstructured data into actionable data, instantly
Last week’s issue
Thanks for reading and have a nice weekend! 🎉 Mariam.