Qwen 2.5, Seed-Music, StoryMaker, Jina Embeddings V3, Multimodal RAG, Luma Labs and Runway APIs, CogVideoX image-to-video generation model and More
Qwen 2.5, Seed-Music, StoryMaker, Jina Embeddings V3, Multimodal RAG, Luma Labs and Runway APIs, CogVideoX image-to-video generation model and More
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #77 ):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Alibaba released Qwen 2.5, along with specialized models for coding, Qwen2.5-Coder, and mathematics, Qwen2.5-Math available in various sizes ranging from 0.5 to 72 billion parameters. The Qwen2.5-72B base model achieves results comparable to Llama-3-405B while utilizing only one-fifth of the parameters and it significantly outperforms its peers in the same category across a wide range of tasks. Qwen2-VL-72B, which features performance enhancements compared to last month’s release, is also now open-source [Details].
Nvidia introduced NVLM 1.0, a family of frontier multimodal large language models that achieve state-of- the-art results on vision-language tasks, rivaling leading multimodal LLMs, without compromising text-only performance during multimodal training. The weights and codes will be released soon [Details].
Jina AI released jina-embeddings-v3, a frontier multilingual text embedding model with 570M parameters and 8192 token-length, outperforming the latest proprietary embeddings from OpenAI and Cohere on MTEB [Details].
Bytedance present Seed-Music, a music generation framework that enables the creation of music with expressive vocals in multiple languages, supports fine-grained control for note-level editing, and allows users to incorporate their own voices into the output [Details].
Mistral’s Pixtral 12B model (a vision-capable model with image understanding capabilities) is now available on their le Chat and la Plateforme.
Mistral launched a free tier on Plateforme, their serverless platform to tune and build with Mistral models as API endpoints. Mistral also announced a significant price drop (upto 80%) across all their models and an updated Mistral Small model [Details].
Kyutai released the technical report, code and weights for Moshi. Moshi is a real-time speech-to-speech model, unveiled in July 2024, that can listen and speak continuously, with no need for explicitly modelling speaker turns or interruptions [Details].
The CogVideoX image-to-video generation model, CogVideoX-5B-I2V, has been released [Details ].
Microsoft released GRIN MoE, a mixture-of-expert decoder-only Transformer model. It has 16x3.8B parameters with 6.6B active parameters when using 2 experts. With only 6.6B activate parameters, GRIN MoE outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data [Details].
LlamaIndex launched multimodal capabilities in LlamaCloud, their enterprise RAG platform. This new feature enables developers to build fully multimodal RAG pipelines across any unstructured data in minutes - whether it’s over marketing slide decks, legal/insurance contracts, finance reports [Details].
Nvidia launched NVIDIA AI Aerial - a comprehensive suite of software and hardware designed to accelerate the development, simulation, training, and deployment of AI-driven radio access network (AI-RAN) technology for wireless networks [Details]
Luma Labs launched the Dream Machine API in beta with the latest family of Dream Machine v1.6 models. It includes high quality text-to-video, image-to-video, video extension, loop creation and camera control capabilities [Details].
Runway launched an API for its Gen-3 Alpha Turbo video generation model [Details]
Runway released Gen-3 Alpha Video to Video on web for paid plans. Video to Video allows you to change the style of your videos by using a text prompt [Details].
Slack now lets users add AI agents from Asana, Cohere, Adobe, Workday and more [Details].
Fei-Fei Li, the Stanford professor many deem the “Godmother of AI,” has raised $230 million for her new startup, World Labs. It aims to build AI models, called “large world models”, that understand and interact with the 3D world [Details].
YouTube will integrate Veo, Google DeepMind's model for generating video, into YouTube Shorts later this year [Details].
Snap is introducing an AI video-generation tool for creators [Details].
Robotics startup 1X released generative world models to train robots [Details].
🔦 Weekly Spotlight
StoryMaker: a personalization solution that preserves not only the consistency of faces but also clothing, hairstyles and bodies in the multiple characters scene, enabling the potential to make a story consisting of a series of images [Link]
Kolors-Virtual-Try-On Demo on HuggingFace space by Kling AI [Link].
Generate an entire app from a prompt using Together AI’s LlamaCoder [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
Hoop: Connects to tools like Google Meet, Zoom, Slack to automatically capture tasks across tools and put everything in one spot.
Otto: Lets you use AI agents to help enrich lists, research companies, or read hundreds of documents in minutes, all through a native table interface.
AIPhone: AI-powered phone call app with live translation.
Google Illuminate: Generates audio with two AI-generated voices in conversation, discussing the key points of select papers. Illuminate is currently optimized for published computer science academic papers.
Last week’s issue
OpenAI's new reasoning model, Empathic Voice Interface 2, Covers by Suno, Pixtral Multimodal model, DataGemma, Notes to Podcast and more
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
Thanks for reading and have a nice weekend! 🎉 Mariam.