Open-source reasoning models, OpenAI's Operator, Bytedance's free Cursor alternative, Spell 3D worlds, Smallest VLM, Perplexity Assistant, open-source native GUI agent model, Kling's Elements & more
Open-source reasoning models, OpenAI's Operator, Bytedance's free Cursor alternative, Spell 3D worlds, Smallest VLM, Perplexity Assistant, open-source native GUI agent model, Kling's Elements & more
Hi. Welcome to this week's AI Brews for a concise roundup of the week's major developments in AI.
In today’s issue (Issue #90 ):
AI Pulse: Weekly News & Insights at a Glance
AI Toolbox: Product Picks of the Week
🗞️🗞️ AI Pulse: Weekly News & Insights at a Glance
🔥 News
DeepSeek released DeepSeek-R1, a fully open-source reasoning model with performance on par with OpenAI-o1. The code and the model weights are licensed under the MIT License. You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink". OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com at a significantly lower cost compared to o1. DeepSeek also released six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models [Details | Report].
OpenAI released Operator, an agent that can go to the web to perform tasks for you. Using its own browser, it can look at a webpage and interact with it by typing, clicking, and scrolling. Operator is trained to proactively ask the user to take over for tasks that require login, payment details, or when solving CAPTCHAs. Operator is powered by a new model called Computer-Using Agent (CUA). Combining GPT-4o's vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs). Operator can “see” (through screenshots) and “interact” (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations. It’s available in research preview to Pro users in the U.S. at operator.chatgpt.com [Details].
Hugging Face research team made two new additions to the SmolVLM model family: SmolVLM-256M and SmolVLM-500M. SmolVLM-256M, with 256M parameters, is the smallest Vision Language Model in the world. SmolVLM can answer questions about images, describe visual content, or transcribe text [Details].
Google released an updated Gemini 2.0 Flash Thinking model, now available as the exp-01-21 variant in AI Studio and via API. The model supports a 1 million token context window, native code execution, and longer output generation [Details].
AI-powered search engine Perplexity launched Perplexity Assistant, which uses reasoning, search, and apps to help with daily tasks. It’s available for Android devices [Details].
Anthropic launched Citations, a new API feature that lets Claude ground its answers in source documents. Claude can now provide detailed references to the exact sentences and passages it uses to generate responses, leading to more verifiable, trustworthy outputs. Citations is generally available on the Anthropic API and Google Cloud’s Vertex AI [Details].
Spline introduced Spell, an AI model to generate 3D worlds. Spell is designed to generate entire 3D scenes or “Worlds” from an image, in just a few minutes. The worlds are consistent with the initial image input and are represented as a volume that can be rendered using Gaussian Splatting (or other methods, like NeRFs). It is capable of visually simulating physical material properties like reflections, refractions, surface roughness, and some camera properties like Depth of field, and even camera/object intersections when attempting to go inside surfaces. Spell is launched at an early stage, with limited access and 'an intentionally high price’.[Details].
Kling AI added a new Elements feature available with the KLING AI 1.6 model for Image to Video generation. Upload 1-4 images, select the subjects (people, animals, objects, or scenes) in the images as elements, and describe their actions and interactions. A video will be created based on the elements and the prompt [Detail].
Stargate Project: a new company which intends to invest $500 billion over the next four years building new AI infrastructure for OpenAI in the United States. Arm, Microsoft, NVIDIA, Oracle, and OpenAI are the key initial technology partners [Details].
Bytedance released UI-TARS, an open-source native GUI agent model to interact with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules [Details].
Tencent released Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model - Hunyuan3D-DiT, and a large-scale texture synthesis model - Hunyuan3D-Paint. Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and e.t.c. [Details].
Bytedance launched Trae an AI code editor, that is a fork of Visual Studio Code. It’s is powered by OpenAI’s GPT-4o /Anthropic’s Claude-3.5-Sonnet and is currently free to use[Details].
MoonshotAI introduced Kimi k1.5 - an o1-level multi-modal model trained with reinforcement learning (RL). You can test Kimi k1.5 by applying for API access through the Kimi OpenPlatform [Paper].
Google’s Gemini Live on Android lets you add images, files and YouTube videos to the conversation [Details]
Bytedance introduced its closed-source multimodal model Doubao 1.5 Pro, that uses the MoE architecture and outperforms GPT-4o and Claude 3.5Sonnet on multiple visual benchmarks[Details].
Runway’s Frame image generation model, revealed in Nov. 2024, is now available to ‘unlimited’ and ‘enterprise’ plan members [Details].
Codeium updated its Windsurf IDE to Wave 2. New features include Automated Memories where the AI coding agent ‘Cascade’ learns your coding patterns from your usage. It has now access to a tool that can search the internet and parse the resulting web pages [Details].
Perplexity launched Sonar, a search API for developers to build generative search, powered by real-time information and citations, into apps [Details].
Trump revoked Biden executive order on addressing AI risks [Details].
Google.org is launching a second cohort of its Generative AI Accelerator to help nonprofits and other organizations use AI for social impact [Details].
Alibaba group introduced VideoLLaMA3, a more advanced multimodal foundation model for image and video. The model achieves state-of-the-art performance on most image and video understanding benchmarks [Details]
🔦 Weekly Spotlight
Building Towards Computer Use with Anthropic: new course on DeepLearning.AI by Colt Steele, Anthropic’s Head of Curriculum [Link].
FilmAgent: a multi-agent collaborative system for end-to-end film automation in 3D virtual spaces. FilmAgent simulates key crew roles—directors, screenwriters, actors, and cinematographers, and integrates efficient human workflows within a sandbox environment [Link].
Realtime API Agents Demo by OpenAI: a simple demonstration of more advanced, agentic patterns built on top of the Realtime API to help you quickly prototype your own multi-agent realtime voice app [Link].
Common pitfalls when building generative AI applications by Chip Huyen [Link].
🔍 🛠️ AI Toolbox: Product Picks of the Week
Trae by Bytedance: an adaptive AI IDE that collaborates with you. It offers features like AI Q&A, code auto-completion, and agent-based AI programming capabilities.
JoggAI: Create lifelike, personalized AI avatars from text prompts
/extract by Firecrawl: Firecrawl's /extract endpoint allows you to get structured web data with just a prompt
LogicStudio.ai: Build complex AI agent systems visually. Connect, orchestrate, and deploy intelligent workflows through an intuitive canvas interface. Open source.
Last week’s issue
Thanks for reading and have a nice weekend! 🎉 Mariam.
Such a news-filled week! I'm just working on my Sunday roundup and it's A LOT!