GPT-4o, Google I/O updates, interact with tables and charts in real-time, ZeroGPU, Gemini API dev competition, Chinese-to-image generation and more
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #63 ):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
Google I/O 2024 Updates [Details]:
Gemini 1.5 Flash: a new lighter-weight new model that’s designed to be fast and efficient to serve at scale. It is optimized for tasks where low latency and cost matter - like chat applications, extracting data from long documents and more. 1.5 Flash is the fastest Gemini model served in the API.
PaliGemma: Google’s first vision-language open model, inspired by PaLI-3, that is optimized for visual Q&A and image captioning.
Significant improvements to 1.5 Pro model. Both 1.5 Pro and 1.5 Flash are available in public preview with a 1 million token context window on Google AI Studio and Vertex AI. 1.5 Pro is also available with a 2 million token context window to developers via waitlist.
Project Astra: new project focused on building future AI assistants powered by Gemini multimodal models. These agents can better understand the context they’re being used in, and respond quickly, in conversation [Demo].
Veo, a video generation model. It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles.
VideoFX, a new experimental tool powered by Veo. The experimental tool also comes with a Storyboard mode that lets you iterate scene by scene and add music to your final video. VideoFX is available in private preview starting in the U.S., and you can sign up to join the waitlist.
Imagen 3, image generation model. Imagen 3 better understands natural language, incorporates small details from longer prompts and has improved text rendering capabilities.
Music AI Sandbox, a suite of music AI tools to create new instrumental sections from scratch, transfer styles between trackers and much more.
Gemma 2, the next generation of Gemma open models. It’s built on a whole new architecture and will include a larger 27B parameter instance which outperforms models twice its size and runs on a single TPU host. Gemma 2 is still pretraining.
Gems: Gemini Advanced subscribers will soon be able to create Gems, customized versions of Gemini that are similar to OpenAI GPTs. Simply describe what you want your Gem to do and how you want it to respond and Gemini will take those instructions and create a Gem for your specific needs.
Gemini Advanced has a new planning feature that goes beyond a list of suggested activities and will actually create a custom itinerary.
Gemini 1.5 Pro is now available in the side panel in Gmail, Docs, Drive, Slides and Sheets via Workspace Labs.
LearnLM, a new family of models fine-tuned for learning, based on Gemini.
Firebase Genkit, an open source framework that helps devs build, deploy, and monitor production-ready AI-powered apps.
Project IDX, Google’s next-gen, AI-centric browser-based development environment, is now in open beta.
OpenAI unveiled GPT-4o (“o” for “omni”), a new flagship multimodal model that can reason across audio, vision, and text in real time. GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.
It was trained as a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.
It can respond to audio inputs in as little as 232 milliseconds.
GPT-4o is available in the API - 50% cheaper, 2x faster and 5x higher rate limits than GPT-4 Turbo
OpenAI is expanding access to GPT-4o, GPTs, GPT Store, Memory, data analysis, file uploading, and other features, previously available only to paid subscribers, to users on the free tier of ChatGPT [Details | Demos ]
OpenAI is rolling out new enhancements to data analysis in ChatGPT, enabling users to upload files directly from Google Drive and Microsoft OneDrive, interact with tables and charts in real-time, and customize presentation-ready charts [Details].
ElevenLabs launched the ElevenLabs Dubbing API — enabling developers to add audio or video translation to their product while preserving the unique characteristics of the original speaker’s voices [Details].
Tencent released Hunyuan-DiT, an open text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context [Details]
Hugging Face is committing $10M of free GPUs with the launch of ZeroGPU, a shared infrastructure for indie and academic AI builders to run AI demos on Spaces [Details].
Apple is set to release AI-enabled Eye Tracking on the iPhone and iPad as part of a new range of accessibility tools. Eye Tracking gives users a built-in option for navigating iPad and iPhone with just their eyes [Details].
Anthropic has launched a prompt generation tool available via Anthropic Console. Describe what you want to achieve, and Claude will use prompt engineering techniques like chain-of-thought reasoning to create more effective, precise and reliable prompts [Details].
Google announced Gemini API Developer Competition with $1 million in cash prizes. Submission deadline is Aug 12, 2024 [Details].
Amazon launched, Bedrock Studio, a web-based tool to experiment with generative AI models and build generative AI-powered apps. Bedrock Studio automatically deploys the relevant Amazon Web Services (AWS) resources as developers request them [Details].
Anthropic’s Claude model via web and iOS apps is now available for people and businesses across Europe [Details].
Researchers at the University of Waterloo released MMLU-Pro dataset, a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines [Details].
OpenAI and Reddit are partnering to integrate Reddit content into ChatGPT and other OpenAI products. OpenAI will access Reddit’s Data API for real-time content. This collaboration will also bring AI-powered features to Reddit and make OpenAI a Reddit advertising partner [Details].
Sony Music Group warns more than 700 companies against using its content to train AI [Details].
🔦 Weekly Spotlight
Winners and project gallery of Meta Llama 3 Hackathon [Link]
Pipecat: An open source framework for real-time, multi-modal, conversational AI applications.
Introduction to gpt-4o - OpenAI cookbook [Link].
Using LlamaIndex and llamafile to build a local, private research assistant [Link].
100 things we announced at I/O 2024 [Link].
Automate Audio Transcription with Zapier + Deepgram [Link]
Multi AI Agent Systems with crewAI - free short course on DeepLearning.ai [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
ChatLLM by Abacus AI: ChatLLM gives access to Llama-3 on Groq, GPT-4, GPT-4o, Clause Opus, and Gemini 1.5 - $10 / month per user.
Pictographic: AI generated illustration library with 40K+ images and SVGs. You can also generate your custom illustration.
Glato AI: Create video ads with expressive AI creators from a product URL.
SheetMagic: AI + web scraping in Google Sheets.
Glitter AI: Glitter AI turns your mouse clicks + voice into a written guide complete with screenshots + text that you can easily edit and share
Last week’s issue
You can support my work via BuyMeaCoffee.
Thanks for reading and have a nice weekend! 🎉 Mariam.