LLMs & Models Articles
Browse 420 articles about LLMs & Models.
Kimi K2 Runs 300 Sub-Agents Across 4,000 Steps on 4x H100s — The Story Hermes Found That Everyone Missed
Hermes's content ideation agent surfaced Kimi K2: an open-source system orchestrating 300 sub-agents across 4,000 coordinated steps on 4x H100 GPUs.
OpenAI's Goblin Problem: How RL Training in Codex Infected GPT-5.4 with Creature References Across Model Generations
GPT started mentioning goblins and gremlins in responses. The cause: RL 'nerdy personality' training in Codex scored creature references highly and bled…
Scott Aaronson's 2029 Warning: Why the World's Top Quantum Skeptic Is Now Sounding the Alarm
Scott Aaronson — historically skeptical of quantum timelines — now says fault-tolerant quantum computers capable of breaking crypto are expected by ~2029.
How to Use a Smart Orchestrator Model to Direct Cheaper Sub-Agent Models in Claude Code
Use Claude Opus as an orchestrator to plan and review while DeepSeek or Gemma handle heavy lifting—cutting token costs by 5-10x without losing quality.
What Is the Mistral Medium 3.5 Model? Open-Weight AI Built for Agent Harnesses
Mistral Medium 3.5 is a 128B open-weight model combining reasoning, coding, and instruction-following for agent harnesses like OpenClaw and Hermes.
AI Model Orchestration: How to Use a Smart Model to Direct Cheaper Sub-Agents
Use a frontier model as orchestrator and cheaper models like DeepSeek for heavy lifting. Learn how to build a cost-efficient multi-model agent pipeline.
Andrej Karpathy on DeepSeek's OCR Paper: Why Pixels May Beat Tokens as AI Inputs
Karpathy called DeepSeek's Oct 2025 OCR paper — 10x text compression, 97% accuracy — a sign that tokenizers are on the way out.
Andrej Karpathy's Verifiability Thesis: Why AI Is Superhuman at Code and Fails at Car Washes
Karpathy's Sequoia talk explains AI's jagged profile: RL only trains where outputs are verifiable. That's why Opus 4.7 refactors codebases but tells you to…
How to Build a Local AI Stack from Scratch: Ollama to vLLM, Step by Step
From Ollama for daily use to vLLM for serving to TensorRT-LLM for production — here's the complete local AI runtime stack and when to use each layer.
China Blocks Meta's $2B Manus Acquisition: 4 Reasons the Unwinding Problem Has No Clear Solution
China blocked Meta's $2B Manus deal after employees moved into Meta offices and capital was transferred. There's no clear legal mechanism to unwind it.
Claude Mythos and GPT-5.5 Pass the 'Last Ones' Cyberattack Benchmark: 6 Things You Need to Know
AISI's 32-step corporate network attack sim took human experts 20 hours. Claude Mythos completed it 3 times out of 10. Here's what that means.
Cursor SDK + GPT-5.5 Scores 87.2% vs Native Codex's 61.5% — The Harness Is the Bottleneck
Switching GPT-5.5 from Codex's native harness to Cursor's SDK jumped functionality from 61.5% to 87.2% — a 26-point gain from the harness alone.
DeepSeek V4 Launch: 5 Specs That Threaten Closed Frontier Labs
DeepSeek V4 dropped with 1M token context, open weights, and pricing that undercuts GPT-5.5 by nearly 9x on output tokens.
DeepSeek V4 Vision: 10x Cheaper Multimodal AI for Your Workflows
DeepSeek V4's vision model uses 90 KV cache entries vs 870 for Claude—10x cheaper. Learn how to use it in your AI workflows and agents.
DeepSeek V4 Vision Model: 10x KV-Cache Efficiency and 67% Maze Navigation vs GPT-5.4's 50%
DeepSeek's vision variant uses ~90 KV-cache entries per image vs Claude Sonnet 4.6's ~870 — and beats GPT-5.4 on maze navigation 67% to 50%.
Google AI Co-clinician vs GPT-5.4 Thinking: Which Medical AI Do Physicians Actually Prefer?
In blind physician evaluations, Google's AI Co-clinician beat GPT-5.4 thinking with search 63% to 30%. Here's what drove the gap.
Google DeepMind AI Co-clinician: 6 Benchmark Results That Redefine Medical AI in 2026
Preferred by physicians 67% of the time, zero critical errors in 97/98 cases, and beating GPT-5.4 thinking 63% to 30% — here's what the numbers actually show.
Google DeepMind's AI Co-clinician Tops the RXQA Drug Knowledge Benchmark — Beating Every Frontier Model
On RXQA — open FDA drug data, open-ended questions — Google's AI Co-clinician surpassed every other frontier AI system including GPT-5.4 and Claude.
How to Use OpenRouter with Claude Code: Run Cheaper Models as a Backend
Use OpenRouter to swap Claude's backend for DeepSeek or other models at 2–5% of the cost. A step-by-step guide to setting up the free-claude-code proxy.
Karpathy's Sequoia Talk: 5 Predictions About Agentic Engineering That Should Change How You Work
Karpathy named December 2025 as the inflection point for agentic coding and says he can't remember the last time he corrected the model.