Fix to ‘lazy’ GPT-4, commercially permissive OSS LLaVA models, new multimodal model for digital agents, Google's new video model and more
Greetings and welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #49):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
AI Skillset: Learn & Build
From our sponsors:
Turn Your Users Into Power Users
Rehance lets your users simply tell your web app what to do.
When users get stuck or face repetitive tasks that would take forever to do by clicking around on your site, they can pull up the Rehance-powered interface, type in their request, and see it executed instantly.
Add features without building UI and help your users automate their work at rehance.ai.
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Amazon presents Diffuse to Choose, a diffusion-based image-conditioned inpainting model that allows users to virtually place any e-commerce item in any setting, ensuring detailed, semantically coherent blending with realistic lighting and shadows. Code and demo will be released soon [Details].
OpenAI announced two new embedding models, new GPT-4 Turbo and moderation models, new API usage management tools, and lower pricing on GPT-3.5 Turbo. The updated GPT-4 Turbo preview model reduces cases of “laziness” where the model doesn’t complete a task. The new embedding models include a smaller and highly efficient text-embedding-3-small model, and a larger and more powerful text-embedding-3-large model. [Details].
Hugging Face and Google partner to support developers building AI applications [Details].
Adept introduced Adept Fuyu-Heavy, a new multimodal model designed specifically for digital agents. Fuyu-Heavy scores higher on the MMMU benchmark than Gemini Pro [Details].
Fireworks.ai has open-sourced FireLLaVA, a LLaVA multi-modality model trained on OSS LLM generated instruction following data, with a commercially permissive license. Firewroks.ai is also providing both the completions API and chat completions API to devlopers [Details].
01.AI released Yi Vision Language (Yi-VL) model, an open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images. Yi-VL adopts the LLaVA architecture and is free for commercial use. Yi-VL-34B is the first open-source 34B vision language model worldwide [Details].
Tencent AI Lab introduced WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites [Paper].
Prophetic introduced MORPHEUS-1, a multi-modal generative ultrasonic transformer model designed to induce and stabilize lucid dreams from brain states. Instead of generating words, Morpheus-1 generates ultrasonic holograms for neurostimulation to bring one to a lucid state [Details].
Google Research presented Lumiere – a space-time video diffusion model for text-to-video, image-to-video, stylized generation, inpainting and cinemagraphs [Details].
TikTok released Depth Anything, an image-based depth estimation method trained on 1.5M labeled images and 62M+ unlabeled images jointly [Details].
Nightshade, the free tool that ‘poisons’ AI models, is now available for artists to use [Details].
Stability AI released Stable LM 2 1.6B, 1.6 billion parameter small language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch. Stable LM 2 1.6B can be used now both commercially and non-commercially with a Stability AI Membership [Details].
Etsy launched ‘Gift Mode,’ an AI-powered feature designed to match users with tailored gift ideas based on specific preferences [Details].
Google DeepMind presented AutoRT, a framework that uses foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. In AutoRT, a VLM describes the scene, an LLM generates robot goals and filters for affordance and safety, then routes execution to policies [Details].
Google Chrome gains AI features, including a writing helper, theme creator, and tab organizer [Details].
Tencent AI Lab released VideoCrafter2 for high quality text-to-video generation, featuring major improvements in visual quality, motion and concept Composition compared to VideoCrafter1 [Details | Demo]
Google opens beta access to the conversational experience, a new chat-based feature in Google Ads, for English language advertisers in the U.S. & U.K. It will let advertisers create optimized Search campaigns from their website URL by generating relevant ad content, including creatives and keywords [Details].
🔦 Weekly Spotlight
The next grand challenge for AI - TED talk by Jim Fan, research scientist at NVIDIA AI [Link]
PDFToChat: an open-source Next.js project powered by Together AI, Mixtral, Pinecone, and Langchain [Details].
Initial versions of the Ollama Python and JavaScript libraries are now available [Link].
An introduction to the Qwen series [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
Rehance.ai: lets your users simply tell your web app what to do. Add features without building UI.
Dubbing Studio by Eleven Labs: A new tool by Eleven labs that lets you translate a video across 29 languages in seconds and can handle multiple speakers. The new audio track maintains the original voice tone and style.
Keep It Shot: a Mac app that utilizes AI to automatically provide descriptive names for your screenshots.
VectorShift: An integrated framework of no-code, low-code, and out of the box generative AI solutions to build AI chatbots and automations.
📕 📚 AI Skillset: Learn & Build
'Prompt Engineering with Llama 2' — an interactive guide, by Meta AI, covering prompt engineering & best practices with Llama 2 [Link].
How to generate expressions using Multi Motion Brush in Gen-2 [Link].
How to Cut RAG Costs by 80% Using Prompt Compression [Link].
Thanks for reading and have a nice weekend! 🎉 Mariam.