FLUX1.1 [pro], Canvas, Realtime API from OpenAI, open-sourcing of Reverb, Digital Twin Catalog, Copilot Vision, Depth Pro, new Whisper model, Pikaffects and more

FLUX1.1 [pro], Canvas, Realtime API from OpenAI, open-sourcing of Reverb, Digital Twin Catalog, Copilot Vision, Depth Pro,new Whisper model, Pikaffects and more

Oct 04, 2024

Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.

In today’s issue (Issue #79 ):

AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week

Sponsor AI Brews

🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance

🔥 News

Blackforest Labs released FLUX1.1 [pro] an improved text-to-image model alongside the general availability of the beta BFL API. FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity. It was tested under the codename “blueberry” in the Artificial Analysis image arena, a popular benchmark for text-to-image models. It surpassed all other models on the leaderboard, achieving the highest overall Elo score [Details].
OpenAI is rolling out an early version of Canvas—a new way to work with ChatGPT on writing & coding projects that go beyond simple chat. In canvas, ChatGPT can suggest edits, adjust length, change reading levels, and offer inline feedback. When writing code, canvas makes it easier to track and understand ChatGPT’s changes. It can also review code, add logs and comments, fix bugs, and port to other coding languages like JavaScript and Python [Details].
Rev announced the open-sourcing of Reverb Automatic Speech Recognition (ASR) and diarization models. Trained on an unprecedented extreme-quality 200,000 hours of human-transcribed English speech, Reverb outperforms all existing open-source speech recognition models across various long-form speech recognition domains [Details].
OpenAI’s DevDay announcements:
1. Public beta of the Realtime API for developers to build low-latency speech-to-speech experiences into apps. Audio capabilities in the Realtime API are powered by the new GPT-4o model gpt-4o-realtime-preview. The API supports function calling which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context [Details].
2. API Prompt Caching that reduces costs and latency by reusing recently seen input tokens during API calls, providing up to a 50% discount on cached tokens [Details].
3. Model Distillation suite which includes Evals and Stored Completions—a workflow to fine-tune smaller, cost-efficient models using outputs from large models, allowing them to match the performance of advanced models on specific tasks at a much lower cost [Details].
4. Vision Fine-tuning. Devs can now fine-tune GPT-4o with images, in addition to text. Free training till October 31, up to 1M tokens a day [Details].
5. New Playground feature— describe what you’re using a model for, and the Playground will automatically generate prompts and valid schemas for functions and structured outputs.
Meta released SAM 2.1 Developer Suite (new checkpoints, training code, web demo). Segment Anything Model 2 (SAM 2), introduced in July 2024, is a foundation model for promptable visual segmentation in images and videos [Details].
Nvidia released NVLM-1.0-D-72B, the frontier-class multimodal LLM with decoder-only architecture [Details].
Microsoft announced an updated version of Copilot, their AI companion, with new features like voice interaction, personalized daily briefings, and visual understanding capabilities. Copilot Vision understands the web page you’re viewing, both text and images, and can answer questions about its content, suggest next steps and help you without disrupting your workflow [Details].
Apple released Depth Pro, an open-source foundation model that synthesizes high-resolution monocular depth maps. It can generate a 2.25-megapixel depth map in 0.3 seconds on a standard GPU [Details].
Meta released Digital Twin Catalog (DTC), the largest and highest quality 3D object model dataset for 3D reconstruction research. DTC is a highly detailed set of over 2,400 3D object models that are sub-millimeter-level accurate with respect to their physical counterparts and highly photorealistic [Details].
Pika Labs launched Pika 1.5, an updated version of its model with more realistic movement, big screen shots, and animated special effects called ‘Pikaffects’. Some of the effects — namely crush it, squish it, and cake-ify — actually insert new props such as a hydraulic press, human hands and a knife into the frame, that interact with the objects in the image [Details].
OpenAI released a new Whisper model named large-v3-turbo, or turbo for short. It performs similarly to large-v2 and as fast as what tiny used to be [Details].
Apple present MM1.5, a new family of multimodal large language models (MLLMs) to enhance text-rich image understanding, visual referring and grounding, and multi-image reasoning. The models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants as well as two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding [Paper].
Liquid AI introduced the first generation of Liquid Foundation Models (LFMs). LFMs have a reduced memory footprint compared to transformer architectures. It includes a dense 1.3B model, ideal for highly resource-constrained environments, a dense 3.1B model, optimized for edge deployment and a 40.3B Mixture of Experts (MoE) model, designed for tackling more complex tasks [Details].
Google Lens received a major upgrade, enabling users to search using videos as well as via voice [Details].
Researchers from MIT have created Future You - a web-based platform that enables users to have an online, text-based conversation with an AI-generated simulation of their potential future self [Details].
Start-up BrainChip announced a new chip design for a milliwatt-level AI inference [Details].
Raspberry Pi launched camera module for vision-based AI applications [Details].

🔦 Weekly Spotlight

185 real-world gen AI use cases from the world's leading organizations - by Google Cloud [Link].
Meta Llama 3.2: A Deep Dive into Vision Capabilities [Link].
Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper [Link].
Histories of Mysteries - a new Podcast of 10 episodes by Andrej Karpathy made using NotebookLLM and other AI tools in 2 hours [Link].

🔍 🛠️ AI Toolbox: Product Picks of the Week

CharacterSDK by VideoSDK: For developers to create multimodal AI characters, capable of real-time interactions and contextual understanding
Buzzabout: AI-driven tool that extracts real-time insights from billions of online conversations.
Jars AI: Interactive AI generated shows. Prompt your show idea and it automatically generates characters, casts them into the show, and sets up their world
Townie: AI assistant that helps you build full-stack apps in seconds
Wispr Flow: a Mac dictation app that lets you speak naturally, and writes in your style, in every application — with auto-edits, command mode, and over 100 languages

Last week’s issue
Molmo, Meta's Vision Models, Next-Token Prediction Multimodal model, AlphaChip, Hundred Film Fund, HuggingChat macOS, Updated models from OpenAI and Google and more
September 27, 2024
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
Read full story