Qwen 2.5, Seed-Music, StoryMaker, Jina Embeddings V3, Multimodal RAG, Luma Labs and Runway APIs, CogVideoX image-to-video generation model and More
Qwen 2.5, Seed-Music, StoryMaker, Jina Embeddings V3, Multimodal RAG, Luma Labs and Runway APIs, CogVideoX image-to-video generation model and More
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #77 ):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Alibaba released Qwen 2.5, along with specialized models for coding, Qwen2.5-Coder, and mathematics, Qwen2.5-Math available in various sizes ranging from 0.5 to 72 billion parameters. The Qwen2.5-72B base model achieves results comparable to Llama-3-405B while utilizing only one-fifth of the parameters and it significantly outperforms its peers in the same category across a wide range of tasks. Qwen2-VL-72B, which features performance enhancements compared to last month’s release, is also now open-source [Details].
Nvidia introduced NVLM 1.0, a family of frontier multimodal large language models that achieve state-of- the-art results on vision-language tasks, rivaling leading multimodal LLMs, without compromising text-only performance during multimodal training. The weights and codes will be released soon [Details].
Jina AI released jina-embeddings-v3, a frontier multilingual text embedding model with 570M parameters and 8192 token-length, outperforming the latest proprietary embeddings from OpenAI and Cohere on MTEB [Details].
Bytedance present Seed-Music, a music generation framework that enables the creation of music with expressive vocals in multiple languages, supports fine-grained control for note-level editing, and allows users to incorporate their own voices into the output [Details].
Mistral’s Pixtral 12B model (a vision-capable model with image understanding capabilities) is now available on their le Chat and la Plateforme.
Mistral launched a free tier on Plateforme, their serverless platform to tune and build with Mistral models as API endpoints. Mistral also announced a significant price drop (upto 80%) across all their models and an updated Mistral Small model [Details].
Kyutai released the technical report, code and weights for Moshi. Moshi is a real-time speech-to-speech model, unveiled in July 2024, that can listen and speak continuously, with no need for explicitly modelling speaker turns or interruptions [Details].
The CogVideoX image-to-video generation model, CogVideoX-5B-I2V, has been released [Details ].
Microsoft released GRIN MoE, a mixture-of-expert decoder-only Transformer model. It has 16x3.8B parameters with 6.6B active parameters when using 2 experts. With only 6.6B activate parameters, GRIN MoE outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data [Details].
LlamaIndex launched multimodal capabilities in LlamaCloud, their enterprise RAG platform. This new feature enables developers to build fully multimodal RAG pipelines across any unstructured data in minutes - whether it’s over marketing slide decks, legal/insurance contracts, finance reports [Details].
Nvidia launched NVIDIA AI Aerial - a comprehensive suite of software and hardware designed to accelerate the development, simulation, training, and deployment of AI-driven radio access network (AI-RAN) technology for wireless networks [Details]
Luma Labs launched the Dream Machine API in beta with the latest family of Dream Machine v1.6 models. It includes high quality text-to-video, image-to-video, video extension, loop creation and camera control capabilities [Details].
Runway launched an API for its Gen-3 Alpha Turbo video generation model [Details]
Runway released Gen-3 Alpha Video to Video on web for paid plans. Video to Video allows you to change the style of your videos by using a text prompt [Details].
Slack now lets users add AI agents from Asana, Cohere, Adobe, Workday and more [Details].
Fei-Fei Li, the Stanford professor many deem the “Godmother of AI,” has raised $230 million for her new startup, World Labs. It aims to build AI models, called “large world models”, that understand and interact with the 3D world [Details].
YouTube will integrate Veo, Google DeepMind's model for generating video, into YouTube Shorts later this year [Details].
Snap is introducing an AI video-generation tool for creators [Details].
Robotics startup 1X released generative world models to train robots [Details].
🔦 Weekly Spotlight
StoryMaker: a personalization solution that preserves not only the consistency of faces but also clothing, hairstyles and bodies in the multiple characters scene, enabling the potential to make a story consisting of a series of images [Link]
Kolors-Virtual-Try-On Demo on HuggingFace space by Kling AI [Link].
Generate an entire app from a prompt using Together AI’s LlamaCoder [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
Hoop: Connects to tools like Google Meet, Zoom, Slack to automatically capture tasks across tools and put everything in one spot.
Otto: Lets you use AI agents to help enrich lists, research companies, or read hundreds of documents in minutes, all through a native table interface.
AIPhone: AI-powered phone call app with live translation.
Google Illuminate: Generates audio with two AI-generated voices in conversation, discussing the key points of select papers. Illuminate is currently optimized for published computer science academic papers.
Last week’s issue
Thanks for reading and have a nice weekend! 🎉 Mariam.