Open source AI voice cloning, Meta's full-bodied photorealistic avatars from audio, Mobile-ALOHA and more

Jan 05, 2024

Greetings and welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.

In today’s issue (Issue #46):

AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
AI Skillset: Learn & Build

🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance

🔥 News

Meta and UC, Berkeley introduced Audio2Photoreal, a framework for generating full-bodied photorealistic avatars with gestures driven from audio of a dyadic conversation [Details | GitHub].
MyShell along with researchers from MIT and Tsinghua University introduced OpenVoice, an open sourcce voice cloning approach that is nearly instantaneous and provides granular control of tone, from emotion to accent, rhythm, pauses, and intonation, using just a small audio clip [Details | Hugging Face] .
Suno and Nvidia present Parakeet, a family of open source speech recognition models that top the Open ASR Leaderboard. Parkeet models effectively prevent the generation of hallucinated transcript and are robust to noisy audio. Available for commercial use under CC BY 4.0 [Details | Hugging Face].
Researchers from Stanford University introduce Mobile-ALOHA, an open-source robot hardware that can can autonomously complete complex mobile manipulation tasks that require whole-body control like cook and serve shrimp, call and take elevator, store a 3Ibs pot to a two-door cabinet etc., with just 50 demos [Details].
Allen Institute for AI released Unified-IO 2 (open-source), the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. The model is pre-trained from scratch on an extensive variety of multimodal data -- 1 billion image-text pairs, 1 trillion text tokens, 180 million video clips, 130 million interleaved image & text, 3 million 3D assets, and 1 million agent trajectories [Details].
Alibaba Research introduced DreamTalk, a diffusion-based audio-driven expressive talking head generation framework that can produce high-quality talking head videos across diverse speaking styles [Details | GitHub].
OpenAI’s app store for GPTs will launch next week [Details].
GitHub Copilot Chat, powered by GPT-4, is now generally available for both Visual Studio Code and Visual Studio, and is included in all GitHub Copilot plans alongside the original GitHub Copilot [Details].
Microsoft Research presented a new and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training step [Paper] | Hugging Face].
Google DeepMind introduced AutoRT, SARA-RT and RT-Trajectory to improve real-world robot data collection, speed, and generalization [Details].
Salesforce Research presented MoonShot, a new video generation model that conditions simultaneously on multimodal inputs of image and text, demonstrating significant improvement on visual quality and temporal consistency compared to existing models. The model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing. Models will be made public here [Details].
Leonardo AI released Leonardo Motion for generating videos from images. Available to all users, paid and free [Link].
JPMorgan AI Research present DocLLM, a layout-aware generative language model for multimodal document understanding. The spatial layout information is incorporated through bounding box coordinates of the text tokens obtained typically using optical character recognition (OCR), and does not rely on any vision encoder component [Details].
Alibaba Research introduced Make-A-Character (Mach), a framework to create lifelike 3D avatars from text descriptions. Make-A-Character supports both English and Chinese prompts. [Details | Hugging Face].
Sony, Canon and Nikon set to combat deepfakes with digital signature tech in future cameras [Details].
Meta AI introduced Fairy, a versatile and efficient video-to-video synthesis framework that generates high-quality videos with remarkable speed. Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, outpacing prior works by at least 44× [Details].
Apple quietly released an open source multimodal LLM, called Ferret, in October 2023 [Details].
Australian researchers introduced a non-invasive AI system, called DeWave, that can turn silent thoughts into text while only requiring users to wear a snug-fitting cap [Details].
Pika Labs text-to-video AI platform Pika 1.0 is now available to all and accessible via the web [Link].
The New York Times sued OpenAI and Microsoft for copyright infringement [Details].

🔦 Weekly Spotlight

AITube: Discover videos generated by AI [Link].
AI Industry Analysis: 50 Most Visited AI Tools and Their 24B+ Traffic Behavior [Link].
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 [Link].
AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps [Link].
CrewAI: an open-source framework for orchestrating role-playing, autonomous AI agents [Link].
LangChain State of AI 2023 - stats on what and how people are building using LLMs [Link].

🔍 🛠️ AI Toolbox: Product Picks of the Week

Assistive Video: a new Generative video platform for creating videos from text and images. Available on the web, and via API
Creatify: an AI-powered app that generates marketing videos from a simple product link or text description.
LM Studio: Discover, download, and run local LLMs.

📕 📚 AI Skillset: Learn & Build

Retrieval Augmented Generation for Production with LangChain & LlamaIndex - free course on Activeloop [Link].
How to add Llama Guard to your RAG pipelines to moderate LLM inputs and outputs and combat prompt injection [Link].

Thanks for reading and have a nice weekend! 🎉 Mariam.

Share AI Brews

AI Brews

Discussion about this post

Ready for more?