SceneScript, Automating the generation of foundation models, 01 Light, Stable Video,3D, AnimateDiff-Lightning, foundation models for self-driving and humanoid robots, NVIDIA NIM and more

Mar 22, 2024

Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.

In today’s issue (Issue #56 ):

AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week

Sponsor AI Brews

🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance

🔥 News

Meta AI introduced SceneScript, a novel method of generating scene layouts and representing scenes using language. SceneScript allows AR & AI devices to understand the geometry of physical spaces. It uses next token prediction like an LLM, but instead of natural language SceneScript model predicts the next architectural tokens such as ‘wall’ or ‘door.’ [Details].
Sakana AI presented Evolutionary Model Merge, a general method that uses evolutionary techniques to automate the creation of new foundation models without extensive additional training data or compute. Sakana AI applied this method to evolve 3 powerful foundation models for Japan: Large Language Model (EvoLLM-JP), Vision-Language Model (EvoVLM-JP) and Image Generation Model (EvoSDXL-JP) [Details | Hugging Face].
Elon Musk's brain-chip startup Neuralink livestreamed its first patient implanted with a chip using his mind to play online chess [Details | video].
Stability AI released Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion that takes in a still image of an object as a conditioning frame, and generates an orbital video of that object. It delivers improved quality and multi-view when compared to the previously released Stable Zero123, and outperforms other open source alternatives such as Zero123-XL Stable Video 3D can be used now for commercial purposes with a Stability AI Membership [Details |Hugging Face].
Waabi introduced Copilot4D, a foundation model for self-driving. It is the first foundation model purpose built for the physical world that can reason in 3D space and the fourth dimension, time. Copilot4D can understand the impact the self-driving vehicle future actions have on the behavior of surrounding traffic participants [Details].
Open Interpreter launched 01 Light, a portable voice interface that controls your home computer. It can see your screen, use your apps, and learn new skills. Batch 1 sold out in 2.5 hours; profits will be redistributed to open-source contributors [Details].
NVIDIA introduced:
1. NVIDIA NIM, a containerized inference microservice to simplify deployment of generative AI models across various infrastructures. Developers can test a wide range of models using ‌cloud APIs from the NVIDIA API catalog or they can self-host the models by downloading NIM and deploying with Kubernetes [Details].
2. Project GR00T, a general-purpose foundation model for humanoid robots and significant upgrades to the NVIDIA Isaac robotics platform. The GR00T model will enable a robot to understand multimodal instructions, such as language, video, and demonstration, and perform a variety of useful tasks. NVIDIA is building a comprehensive AI platform for several humanoid robot companies including 1X Technologies, Agility Robotics, Boston Dynamics, Figure AI etc. [Details].
3. Earth-2 climate digital twin cloud platform for simulating and visualizing weather and climate at unprecedented scale. Earth-2’s APIs offer AI models and employ a new NVIDIA generative AI model called CorrDiff that generates 12.5x higher resolution images than current numerical models 1,000x faster and 3,000x more energy efficiently [Details].
4. Next-generation AI supercomputer, the NVIDIA DGX SuperPOD, powered by its new NVIDIA GB200 Grace Blackwell Superchip designed to meet the demanding requirements of generative AI training and inference workloads involving trillion-parameter models [Details].
Google’s Gemini 1.5 Pro multimodel model with 1M token context window is now available to all in the Google AI Studio, with API being gradually rolled out [Link].
ByteDance released AnimateDiff-Lightning, a lightning-fast text-to-video generation model. It can generate videos more than ten times faster than the original AnimateDiff [Hugging Face | Demo].
Pleias, a French start-up, released Common Corpus - the largest public domain dataset released for training LLMs. It is multilingual and includes 500 billion words from a wide diversity of cultural heritage initiatives [Details].
Aether Research released Cerebrum 8x7b, a large language model (LLM) created specifically for reasoning tasks. It is based on the Mixtral 8x7b model and offers competitive performance to Gemini 1.0 Pro and GPT-3.5 Turbo on a range of tasks that require reasoning [Hugging Face].
Stability AI, Medical AI Research Center (MedARC) and others presented MindEye2, a model that can reconstruct seen images from fMRI brain activity using only 1 hour of training data. Given a sample of fMRI activity from a participant viewing an image, MindEye can identify either which image out of a pool of possible image candidates was the original seen image (retrieval), or it can recreate the image that was seen (reconstruction) along with its text caption [Details].
Nous Research released Hermes 2 Pro 7B, an upgraded, retrained version of Nous Hermes 2. It improves several capabilities, using an updated and cleaned version of the Hermes 2 dataset, and excels at function calling and JSON structured output [Hugging Face]
Google AI introduced a generalizable user-centric interface to help radiologists leverage ML models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating along with the corresponding regions of interest [Details | GitHub]
xAI released the base model weights and network architecture of Grok-1 under the Apache 2.0 license. Grok-1 is a 314 billion parameter Mixture-of-Experts model trained from scratch [GitHub | Hugging Face].
Lighthouz AI launched the Chatbot Guardrails Arena in collaboration with Hugging Face, to stress test LLMs and privacy guardrails in leaking sensitive data. Chat with two anonymous LLMs with guardrails and try to trick them into revealing sensitive financial information and cast your vote for the model that shows greater privacy [Details].
Apple introduced MM1, a family of multimodal models up to 30B parameters,
consisting of both dense models and mixture-of-experts (MoE) variants that are SOTA in pre-training metrics and achieve competitive performance across 12 established multimodal benchmarks [Paper].
Stability AI introduced a suite of image services on the Stability AI Developer Platform API for for image generation, upscaling, outpainting and editing [Details].
Google Research presented VLOGGER, a novel framework to synthesize humans from audio. Given a single input image and a sample audio input, our it generates photorealistic and temporally coherent videos of the person talking and vividly moving [Details].
Stability AI presented SD3-Turbo, a fast text-to-image foundation model that achieves the sample quality of SD3, Midjourney, and Dalle-E 3 in only 4 steps. Code and model weights will be publicly available [Paper].
GitHub introduced Code Scanning Autofix, for GitHub Advanced Security customers, powered by GitHub Copilot and CodeQL. Code Scanning Autofix covers more than 90% of alert types in JavaScript, Typescript, Java, and Python, and delivers code suggestions shown to remediate more than two-thirds of found vulnerabilities with little or no editing [Details].
Google Research released Skin Condition Image Network (SCIN) dataset in collaboration with physicians at Stanford Medicine. It is freely available as an open-access resource for researchers, educators, and developers [Details]
Roblox adds AI-powered avatar creation ( converts a 3D body mesh into a live, animated avatar) and texture generation (text prompts to quickly change the look of 3D objects) [Details].
Buildbox announced Buildbox 4 Alpha Preview, the AI-first game engine where you simply type to create [Details].
Google Research and Fitbit are working together to build a Personal Health Large Language model (LLM) that gives users more insights and recommendations based on their data in the Fitbit mobile app [Details].
Two of Inflection’s three co-founders, Mustafa and Karén, will be leaving Inflection to start Microsoft AI, a new division at Microsoft [Details].
Google DeepMind announced TacticAI, an AI assistant capable of offering insights to football experts on corner kicks [Details].

🔦 Weekly Spotlight

Open-source projects by Anthena Matrix to fortify AI systems against emerging threats: Website Prompt Injection Testing Tool, ASCII Art Prompt Injection, AI Vulnerability Assessment Framework and more [Link]
How People Are Really Using GenAI - Harvard Business Review [Link]
Under The Hood: How OpenAI's Sora Model Works [Link]
Nvidia 2024 AI Event: Everything Revealed in 16 Minutes [Video]
LLaMA Factory: A framework for efficient fine-tuning of 100+ language models without the need for coding through the built-in web UI LlamaBoard [Link].
Tutorial: Create synthetic web screenshots and their associated HTML code with Mistral and Deepseek Code Instruct [Link].

🔍 🛠️ AI Toolbox: Product Picks of the Week

Eggnog: Make AI videos with consistent characters. Creators will soon be able to use and remix characters made by other Eggnog users.
Dora AI: Prompt to website. Dora AI's model designs sites from beginning to end, handling layout to content to visual identity.

You can support my work via BuyMeaCoffee.

Thanks for reading and have a nice weekend! 🎉 Mariam.

Share AI Brews

AI Brews

Discussion about this post

Ready for more?