Analyzing 156 LLM Launch Posts on Hacker News

This report tracks how Hacker News reacted to 156 posts about LLM launches from 13 providers, covering 132 model versions across 24 families (March 2023 to April 2026). The selected posts drew 56,041 HN comments in total, with a labeled sample of 735 non-deleted top-level comments for sentiment analysis.

Dataset and methodology

Posts are included if the title points to a model launch or release, from official or trusted third-party sources, above a minimum score and comments threshold (≥50 each), and verified by manual review. Duplicates (e.g. one post covering multiple models) are counted once.

For each post, the top 5 comments (by HN display order) are collected and labeled for sentiment and stance using an LLM. Discussion themes are identified via keyword matching.

156 posts from 13 providers, covering 132 model versions (Mar 2023 – Apr 2026). Mostly official announcements (118), plus repos (15), papers (7), and third-party coverage (16). 56,041 total HN comments across those posts, with 735 non-deleted top-level comments sampled for labeling.

Across providers

This section compares providers on launch timing, performance, and comment quality. Charts use the top 8 providers by total score.

Launch timeline

How many models shipped each month, and how the pace changed over time.

Monthly model launch cadence

Unique model versions launched per month (by release date). Peak: 2025-09 (11 models).

Launch performance

Score, comment volume, and per-post averages — which providers get the most attention and which perform best.

Overall model launch timeline

Size: Low Medium High | Type:

Score vs. comments

Each dot is one post. Points above the diagonal get more discussion relative to their score.

Provider engagement overview

All providers ranked by post count, total score, and engagement efficiency.

Provider	Posts	Total score	Total comments	Avg score	Avg comments	Score/comment	Score share
OpenAI #1	28	22,602	16,682	807	596	1.35	21.8%
Google DeepMind	24	21,055	11,043	877	460	1.91	20.3%
Anthropic	11	13,791	7,114	1254	647	1.94	13.3%
Alibaba	24	11,311	5,183	471	216	2.18	10.9%
Mistral AI	20	11,002	4,258	550	213	2.58	10.6%
Meta	6	7,488	3,217	1248	536	2.33	7.2%
DeepSeek	10	6,970	3,053	697	305	2.28	6.7%
Zhipu AI	9	3,227	1,775	359	197	1.82	3.1%
Moonshot AI	4	2,174	986	544	247	2.2	2.1%
xAI	7	1,871	2,082	267	297	0.9	1.8%
Microsoft	5	1,054	305	211	61	3.46	1.0%
MiniMax	4	864	245	216	61	3.53	0.8%
Cohere	4	364	98	91	25	3.71	0.4%

Scale vs. efficiency: providers with the most launches do not always have the highest per-post engagement. High-volume example: OpenAI. Highest average score (3+ posts): Anthropic.
Concentration risk: Cohere has the highest single-post concentration (52.5% of score from one post), while Alibaba is most evenly distributed (7.7%).

Comments

735 labeled top comments from 152 posts (100.0% of sampled, non-deleted comments) across providers. Net sentiment is 0.169 (range −2 to +2), with 38.0% debate-weighted controversy vs 37.9% pro. Anti-stance comments average 6.8 replies vs 6.5 for pro.

Comment sentiment by post

Color = net score, size = comment count, shape = post type. Hover for per-comment breakdown.

Size = comment count: Few Moderate Many | Type:

Common discussion themes

Keyword matches across 735 comments.

Sentiment distribution

735 labeled comments.

Comment sentiment by provider

Aggregate sentiment and stance per provider, with per-post breakdowns. Click a provider to expand.

	Sentiment	Stance	Pro %	Anti %	Neg %	Avg replies	Controversy %
OpenAI 134 comments	0.0	0.0	32.1%	32.1%	32.1%	8.86	50.3%
GPT-4 4,091pts	0.0	-0.2	40.0%	60.0%	60.0%	23.0
GPT-4 Turbo with Vision Generally Available 220pts	-0.2	0.0	40.0%	40.0%	40.0%	3.6
GPT-4o 3,138pts	0.4	0.0	20.0%	20.0%	20.0%	9.6
GPT-4o mini: advancing cost-efficient intelligence 222pts	0.0	0.2	40.0%	20.0%	20.0%	3.4
AI paid for by Ads – the GPT-4o mini inflection point 290pts	-1.0	-0.8	0.0%	80.0%	80.0%	6.8
OpenAI O1-Mini 30pts	0.0	0.0	0.0%	0.0%	0.0%	0.67
OpenAI O3-Mini 962pts	0.4	0.4	40.0%	0.0%	0.0%	9.8
GPT-4.5 1,136pts	-0.6	-0.6	0.0%	60.0%	60.0%	18.8
OpenAI's o1-pro now available via API 131pts	-0.2	0.0	20.0%	20.0%	20.0%	3.8
GPT-4.1 in the API 680pts	-0.2	0.2	20.0%	0.0%	20.0%	7.6
OpenAI o3 and o4-mini 555pts	0.0	0.2	60.0%	40.0%	40.0%	6.6
OpenAI o3-pro 289pts	0.0	-0.2	20.0%	40.0%	20.0%	4.2
GPT-5 2,063pts	-0.6	-0.8	0.0%	80.0%	60.0%	30.6
GPT-5-Codex 396pts	1.4	1.0	100.0%	0.0%	0.0%	3.0
Addendum to GPT-5 system card: GPT-5-Codex 250pts	0.0	-0.2	20.0%	40.0%	40.0%	3.0
GPT-5-Codex-Mini – A more compact and cost-efficient versio... 56pts	0.0	-0.25	25.0%	50.0%	50.0%	2.0
GPT-5.1: A smarter, more conversational ChatGPT 555pts	-0.4	-0.4	20.0%	60.0%	60.0%	14.0
Building more with GPT-5.1-Codex-Max 483pts	0.0	0.2	40.0%	20.0%	20.0%	9.6
GPT-5.2 1,195pts	-0.6	-0.2	20.0%	40.0%	60.0%	9.4
GPT Image 1.5 522pts	0.0	0.0	40.0%	40.0%	40.0%	6.6
GPT-5.2-Codex 589pts	1.0	0.8	80.0%	0.0%	0.0%	7.0
GPT-5.3-Codex 1,530pts	0.4	0.6	60.0%	0.0%	0.0%	9.0
GPT‑5.3‑Codex‑Spark 890pts	0.8	0.6	60.0%	0.0%	0.0%	8.0
GPT-5.2 derives a new result in theoretical physics 574pts	0.6	0.4	40.0%	0.0%	0.0%	10.6
GPT‑5.3 Instant 395pts	-1.2	-1.0	0.0%	100.0%	100.0%	10.2
GPT-5.4 1,019pts	-0.4	-0.2	20.0%	40.0%	60.0%	12.0
GPT-5.4 Thinking and GPT-5.4 Pro 93pts	0.0	0.0	0.0%	0.0%	0.0%	0.0
GPT‑5.4 Mini and Nano 248pts	0.4	0.2	40.0%	20.0%	0.0%	5.2
Google DeepMind 119 comments	0.286	0.252	43.7%	18.5%	21.0%	8.3	31.3%
Gemini AI 2,135pts	0.4	0.4	40.0%	0.0%	0.0%	12.8
Our next-generation model: Gemini 1.5 1,244pts	0.0	0.0	40.0%	40.0%	40.0%	10.8
Gemma: New Open Models 1,129pts	-0.4	-0.4	0.0%	40.0%	40.0%	12.4
Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video] 51pts	-0.6	-0.4	0.0%	40.0%	40.0%	3.4
Gemini Flash 442pts	-0.4	-0.4	0.0%	40.0%	40.0%	2.8
Gemini 2.0: our new AI model for the agentic era 1,015pts	1.2	1.0	100.0%	0.0%	0.0%	6.8
Ingesting PDFs and why Gemini 2.0 changes everything 1,303pts	-0.4	-0.2	20.0%	40.0%	60.0%	13.4
Gemma 3 Technical Report [pdf] 488pts	0.2	0.2	20.0%	0.0%	0.0%	4.6
Experiment with Gemini 2.0 Flash native image generation 101pts	-0.5	-0.5	25.0%	75.0%	75.0%	1.25
Gemini 2.5 973pts	1.4	1.0	100.0%	0.0%	0.0%	9.0
Gemini 2.5 Flash 1,076pts	1.0	1.0	100.0%	0.0%	0.0%	10.0
Gemma 3 QAT Models: Bringing AI to Consumer GPUs 602pts	1.2	0.6	80.0%	20.0%	20.0%	3.6
Gemma 3n preview: Mobile-first AI 441pts	0.4	0.6	60.0%	0.0%	0.0%	3.6
Gemma 3 270M: Compact model for hyper-efficient AI 824pts	0.4	0.4	60.0%	20.0%	20.0%	14.2
Gemini 2.5 Flash Image 1,093pts	0.6	0.6	80.0%	20.0%	20.0%	7.8
Improved Gemini 2.5 Flash and Flash-Lite 540pts	-0.4	-0.2	0.0%	20.0%	40.0%	6.8
Gemini 3 Pro Model Card [pdf] 280pts	0.4	0.2	20.0%	0.0%	0.0%	9.4
Gemini 3 1,735pts	0.6	0.6	80.0%	20.0%	20.0%	10.0
Gemini 3 Pro: the frontier of vision AI 566pts	0.4	0.4	40.0%	0.0%	0.0%	8.0
Gemini 3 Flash: Frontier intelligence built for speed 1,102pts	0.8	0.6	60.0%	0.0%	0.0%	9.0
Gemini 3 Deep Think 1,081pts	0.6	0.6	60.0%	0.0%	20.0%	9.8
Gemini 3.1 Pro 963pts	0.0	0.0	20.0%	20.0%	20.0%	14.4
Gemini 3.1 Flash-Lite: Built for intelligence at scale 60pts	-0.2	-0.4	20.0%	60.0%	40.0%	1.6
Google releases Gemma 4 open models 1,811pts	0.0	0.2	20.0%	0.0%	20.0%	12.4
Alibaba 111 comments	0.324	0.234	36.9%	13.5%	13.5%	5.12	29.3%
Qwen2 LLM Released 261pts	0.2	0.2	40.0%	20.0%	20.0%	4.2
QwQ: Alibaba's O1-like reasoning LLM 438pts	0.8	0.8	80.0%	0.0%	20.0%	3.6
Qwen2.5-1M: Deploy your own Qwen with context length up to ... 307pts	-0.2	-0.2	0.0%	20.0%	20.0%	4.2
QwQ-32B: Embracing the Power of Reinforcement Learning 480pts	0.2	0.4	40.0%	0.0%	0.0%	5.4
Qwen2.5-VL-32B: Smarter and Lighter 544pts	1.0	0.6	60.0%	0.0%	0.0%	5.4
Alibaba Qwen2.5-Omni-7B: Open Source End-to-End Multimodal ... 29pts	0.0	0.0	50.0%	50.0%	50.0%	0.5
QVQ-Max: Think with Evidence 121pts	0.25	0.25	25.0%	0.0%	0.0%	1.75
Qwen3: Think deeper, act faster 869pts	0.0	0.0	20.0%	20.0%	20.0%	11.0
Qwen3-Coder: Agentic coding in the world 765pts	0.6	0.4	40.0%	0.0%	0.0%	8.8
Qwen-Image: Crafting with native text rendering 544pts	0.4	0.4	40.0%	0.0%	0.0%	4.8
Qwen3-Next 569pts	0.4	0.2	40.0%	20.0%	20.0%	4.4
Qwen3-Omni: Native Omni AI model for text, image and video 571pts	0.8	0.6	60.0%	0.0%	0.0%	4.2
Qwen3-VL 434pts	1.0	0.4	60.0%	20.0%	20.0%	4.4
Tongyi DeepResearch – open-source 30B MoE Model that rivals... 365pts	-0.2	0.0	20.0%	20.0%	20.0%	6.8
Qwen3-Omni-Flash-2025-12-01：a next-generation native multim... 316pts	0.2	0.2	20.0%	0.0%	0.0%	2.8
Qwen3-TTS family is now open sourced: Voice design, clone, ... 744pts	0.4	0.2	20.0%	0.0%	0.0%	5.4
Qwen3-Max-Thinking 502pts	0.2	0.2	20.0%	0.0%	0.0%	2.8
Qwen3-Coder-Next 735pts	0.6	0.4	40.0%	0.0%	0.0%	5.2
Qwen-Image-2.0: Professional infographics, exquisite photor... 422pts	0.4	0.4	40.0%	0.0%	0.0%	5.4
Qwen3.5: Towards Native Multimodal Agents 434pts	0.0	0.0	40.0%	40.0%	40.0%	4.2
Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on... 461pts	-0.2	-0.4	20.0%	60.0%	40.0%	9.6
Something is afoot in the land of Qwen 783pts	0.2	0.2	40.0%	20.0%	20.0%	5.0
Qwen3.6-Plus: Towards real world agents 594pts	0.2	0.0	40.0%	40.0%	40.0%	4.4
Mistral AI 91 comments	0.187	0.165	34.1%	17.6%	18.7%	3.29	35.9%
Mistral 7B 884pts	0.2	0.0	40.0%	40.0%	40.0%	2.6
Mistral "Mixtral" 8x7B 32k model [magnet] 546pts	0.6	0.6	60.0%	0.0%	0.0%	4.2
Mixtral of experts 639pts	0.4	0.6	60.0%	0.0%	0.0%	2.6
Mixtral 8x7B: A sparse Mixture of Experts language model 359pts	0.6	0.4	40.0%	0.0%	0.0%	2.8
Mistral Large 599pts	0.4	0.4	40.0%	0.0%	0.0%	2.0
Mixtral 8x22B 533pts	0.0	-0.2	0.0%	20.0%	0.0%	5.2
Codestral: Mistral's Code Model 457pts	-0.4	-0.2	0.0%	20.0%	40.0%	6.0
Codestral Mamba 485pts	-0.2	0.0	20.0%	20.0%	40.0%	2.8
Mistral NeMo 418pts	-0.2	-0.2	0.0%	20.0%	20.0%	2.4
Mistral releases Pixtral 12B, its first multimodal model 163pts	-0.2	-0.2	20.0%	40.0%	40.0%	1.8
Un Ministral, Des Ministraux 216pts	-0.6	-0.6	0.0%	60.0%	60.0%	3.8
Codestral 25.01 21pts	0.0	0.0	0.0%	0.0%	0.0%	0.0
Mistral Small 3 620pts	0.0	0.4	60.0%	20.0%	20.0%	3.0
Mistral OCR 1,756pts	0.4	0.2	40.0%	20.0%	20.0%	4.6
Medium Is the New Large 70pts	0.0	0.0	20.0%	20.0%	20.0%	0.8
Devstral 701pts	1.0	0.8	80.0%	0.0%	0.0%	2.0
Magistral — the first reasoning model by Mistral AI 941pts	-0.4	-0.2	20.0%	40.0%	40.0%	5.4
Mistral 3 family of models released 826pts	1.0	0.8	80.0%	0.0%	0.0%	3.4
Mistral releases Devstral2 and Mistral Vibe CLI 745pts	0.8	0.4	40.0%	0.0%	0.0%	4.4
Anthropic 55 comments	-0.073	0.018	25.5%	23.6%	27.3%	11.89	37.8%
Claude 3.7 Sonnet and Claude Code 2,127pts	0.8	0.8	80.0%	0.0%	0.0%	40.2
Claude 4 2,013pts	-0.2	-0.2	40.0%	60.0%	40.0%	7.8
Claude Opus 4.1 841pts	-0.6	-0.4	0.0%	40.0%	40.0%	8.4
Claude Sonnet 4 now supports 1M tokens of context 1,324pts	0.2	0.4	40.0%	0.0%	0.0%	12.0
Claude Sonnet 4.5 1,585pts	-0.6	-0.4	20.0%	60.0%	60.0%	11.2
Claude Haiku 4.5 730pts	0.0	0.2	20.0%	0.0%	20.0%	7.6
Claude Opus 4.5 1,113pts	0.4	0.6	60.0%	0.0%	20.0%	13.6
Claude Opus 4.6 2,346pts	0.6	0.0	20.0%	20.0%	0.0%	16.0
Advancing finance with Claude Opus 4.6 154pts	-0.2	0.0	0.0%	0.0%	20.0%	1.0
Claude Opus 4.6 extra usage promo 212pts	-0.8	-0.6	0.0%	60.0%	60.0%	2.4
Claude Sonnet 4.6 1,346pts	-0.4	-0.2	0.0%	20.0%	40.0%	10.6
DeepSeek 45 comments	0.267	0.267	37.8%	11.1%	11.1%	4.96	21.6%
DeepSeek-R1 1,843pts	0.8	0.8	80.0%	0.0%	0.0%	12.2
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL 1,351pts	1.0	0.6	60.0%	0.0%	0.0%	6.6
DeepSeek-V3 Technical Report 132pts	0.0	0.0	20.0%	20.0%	20.0%	1.4
Deepseek R1-0528 451pts	0.0	0.2	20.0%	0.0%	20.0%	6.8
DeepSeek-v3.1 778pts	-0.2	-0.4	0.0%	40.0%	20.0%	4.2
DeepSeek-v3.1-Terminus 101pts	-0.2	0.0	20.0%	20.0%	20.0%	1.0
DeepSeek-v3.2-Exp 309pts	0.6	0.6	60.0%	0.0%	0.0%	1.2
DeepSeek OCR 1,003pts	0.0	0.0	20.0%	20.0%	20.0%	5.0
DeepSeek-v3.2: Pushing the frontier of open large language ... 982pts	0.4	0.6	60.0%	0.0%	0.0%	6.2
Zhipu AI 39 comments most positive	0.462	0.41	56.4%	15.4%	12.8%	3.72	34.8%
GLM-4.5: Reasoning, Coding, and Agentic Abililties 247pts	-0.2	-0.2	20.0%	40.0%	40.0%	4.0
My 2.5 year old laptop can write Space Invaders in JavaScri... 577pts	0.2	0.0	40.0%	40.0%	40.0%	3.8
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Mo... 417pts	0.8	0.8	80.0%	0.0%	0.0%	2.2
GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilies 43pts	0.75	0.75	75.0%	0.0%	0.0%	0.5
GLM-4.7: Advancing the Coding Capability 437pts	0.4	0.2	40.0%	20.0%	20.0%	4.6
GLM-4.7-Flash 378pts	0.8	0.8	80.0%	0.0%	0.0%	3.2
GLM-5: Targeting complex systems engineering and long-horiz... 484pts	0.4	0.6	60.0%	0.0%	0.0%	5.0
GLM-5.1: Towards Long-Horizon Tasks 617pts	0.6	0.4	60.0%	20.0%	0.0%	5.8
xAI 35 comments most contested	-0.314	-0.286	25.7%	54.3%	54.3%	4.51	67.4%
Grok-2 Beta Release 226pts	0.2	0.4	60.0%	20.0%	20.0%	3.0
Grok 3: Another win for the bitter lesson 132pts	-1.0	-1.0	0.0%	100.0%	100.0%	4.2
Grok 4 Launch [video] 437pts	0.6	0.4	60.0%	20.0%	20.0%	6.6
Grok 4 328pts	-0.6	-0.6	0.0%	60.0%	60.0%	5.6
Grok Code Fast 1 512pts	-0.4	-0.6	20.0%	80.0%	60.0%	6.0
Grok 4 Fast 96pts	0.0	0.0	40.0%	40.0%	40.0%	2.6
Grok 4.1 140pts	-1.0	-0.6	0.0%	60.0%	80.0%	3.6
Meta 30 comments	0.467	0.367	53.3%	16.7%	16.7%	7.27	27.4%
Llama 2 2,268pts	-0.2	-0.2	40.0%	60.0%	60.0%	13.2
Meta Llama 3 2,199pts	0.2	0.0	40.0%	40.0%	20.0%	4.2
Llama 3.1 437pts	0.8	0.6	60.0%	0.0%	0.0%	2.8
Llama 3.2: Revolutionizing edge AI and vision with open, cu... 924pts	1.2	0.8	80.0%	0.0%	0.0%	6.2
Llama-3.3-70B-Instruct 425pts	0.8	0.8	80.0%	0.0%	0.0%	3.8
The Llama 4 herd 1,235pts	0.0	0.2	20.0%	0.0%	20.0%	13.4
Moonshot AI 20 comments	0.5	0.45	55.0%	10.0%	10.0%	4.05	27.7%
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) lang... 348pts	0.6	0.6	60.0%	0.0%	0.0%	3.6
Kimi K2 Thinking, a SOTA open-source trillion-parameter rea... 936pts	0.6	0.6	60.0%	0.0%	0.0%	7.2
Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model 502pts	0.8	0.6	60.0%	0.0%	0.0%	3.2
Kimi K2.5 Technical Report [pdf] 388pts	0.0	0.0	40.0%	40.0%	40.0%	2.2
Microsoft 20 comments	0.3	0.15	35.0%	20.0%	15.0%	2.2	23.4%
Phi-3 Technical Report 411pts	0.4	0.2	40.0%	20.0%	20.0%	3.4
New Phi-3.5 Models from Microsoft, including new MoE 25pts	0.0	0.0	33.3%	33.3%	33.3%	0.0
Phi-4: Microsoft's Newest Small Language Model Specializing... 439pts	0.0	-0.2	20.0%	40.0%	20.0%	4.4
Microsoft announces Phi-4-multimodal and Phi-4-mini 48pts	1.0	1.0	100.0%	0.0%	0.0%	0.5
Phi-4 Reasoning Models 131pts	0.4	0.2	20.0%	0.0%	0.0%	0.8
MiniMax 20 comments	-0.15	-0.2	25.0%	45.0%	40.0%	1.7	63.0%
MiniMax-M1 open-weight, large-scale hybrid-attention reason... 349pts	0.6	0.4	60.0%	20.0%	0.0%	2.2
MiniMax M2.1: Built for Real-World Complex Tasks, Multi-Lan... 228pts	-0.4	-0.4	0.0%	40.0%	40.0%	1.2
MiniMax M2.5 released: 80.2% in SWE-bench Verified 207pts	-0.6	-0.6	20.0%	80.0%	80.0%	1.6
MiniMax M2.7 Is Now Open Source 80pts	-0.2	-0.2	20.0%	40.0%	40.0%	1.8
Cohere 16 comments	-0.313	-0.25	18.8%	43.8%	43.8%	1.0	62.5%
Aya: An open LLM by 3k independent researchers across the g... 191pts	-0.4	-0.4	20.0%	60.0%	60.0%	1.0
Command-R: RAG at production scale 37pts	0.0	0.0	0.0%	0.0%	0.0%	0.0
Command R+: A Scalable LLM Built for Business 59pts	0.2	0.2	20.0%	0.0%	0.0%	1.4
Command A: Max performance, minimal compute – 256k context ... 77pts	-0.8	-0.6	20.0%	80.0%	80.0%	0.8

Posts with skeptical comment sections

High-score posts where comments skew anti or negative — the biggest gaps between upvotes and reception. The gap index combines anti/negative comment share with the post's visibility (log-scaled score), so high-scoring posts with hostile threads rank highest.

Post	Score	Anti %	Mixed %	Avg replies	Gap idx
GPT‑5.3 Instant	395	100.0%	0.0%	10.2	71.92
Very negative Anti 22 repliesr The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. Why it matters , the big picture , it's not jut you , the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional s...
Negative Anti 7 repliesr I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / non-thinking mode? I feel like openai is ...
Negative Anti 11 repliesr Since the page mentions: Better judgment around refusals Has any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, make up a joke about group and then iteratin...
Negative Anti 5 repliesr I kind of chuckled when I read the headline GPT‑5.3 Instant: Smoother, more ... LLM companies starting to sound like cigarette advertisements.
Negative Anti 6 repliesr Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles?
GPT-5	2,063	80.0%	0.0%	30.6	64.24
Neutral Neutral 109 repliesr It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the d...
Neutral Anti 12 repliesr GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release). Compare that to Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release) Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release) https://platform.openai.com/docs/model...
Negative Anti 12 repliesr Going by the system card at: https://openai.com/index/gpt-5-system-card/ GPT‑5 is a unified system . . . OK . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly...
Negative Anti 7 repliesr I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year. What excites me now is that ...
Negative Anti 13 repliesr The marketing copy and the current livestream appear tautological: it's better because it's better. Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
GPT-4	4,091	60.0%	0.0%	23.0	60.0
Very positive Pro 44 repliesr After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire ...
Negative Anti 42 repliesr A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example: Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowe...
Very negative Anti 14 repliesr I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is buil...
Negative Anti 9 repliesr That footnote on page 15 is the scariest thing i've read about AI/ML to date. To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reas...
Very positive Pro 6 repliesr From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir...
Grok 3: Another win for the bitter lesson	132	100.0%	0.0%	4.2	58.8
Negative Anti 7 repliesr This article is weak and just general speculation. Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this: Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ...
Negative Anti 2 repliesr The creation of a model which is co-state-of-the-art (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite throwing more compute a...
Negative Anti 10 repliesr Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute. If you had $3 billion, xAI would choose to invest $2.5 billion in G...
Negative Anti 2 repliesr I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good. Another nit-pick: I don't think DeepSeek had 50k H...
Negative Anti 0 repliesr The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is: Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thi...
Llama 2	2,268	60.0%	0.0%	13.2	55.75
Positive Pro 7 repliesr Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5! AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. - Llama 1 (llama-65b): 57.6 - LLama 2 (llama-2-70b-chat-hf): 64.6 - GPT-3.5: 85.2 -...
Negative Anti 25 repliesr Key detail from release: If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must reques...
Negative Anti 6 repliesr This was a pretty disappointing initial exchange: what are the most common non-investor roles at early stage venture capital firms? Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term non-investor roles may be perc...
Positive Pro 23 repliesr Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app! Model playground here: https://llama2.ai Hosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chat If you want to just play with the mode...
Negative Anti 5 repliesr Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op...
AI paid for by Ads – the GPT-4o mini inflection point	290	80.0%	20.0%	6.8	54.57
Very negative Anti 25 repliesr So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ...
Negative Anti 3 repliesr That's an inflection point, all right. OpenAI's customers can now at least break even. Of course, it means a flood of crap content.
Negative Anti 4 repliesr For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000. I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views. This is approximately ~$...
Negative Anti 0 repliesr This article is weird clickbait which, even weirder, worked. It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a c...
Neutral Mixed 2 repliesr This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be...
Claude Sonnet 4.5	1,585	60.0%	20.0%	11.2	53.16
Positive Pro 12 repliesr I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/ It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehens...
Negative Anti 21 repliesr Anecdotal evidence. I have a fairly large web application with ~200k LoC. Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI). implement a fuzzy search for conversations and reports either when selecting Go to Conversation or Go to Report and typing th...
Very negative Anti 8 repliesr I haven't shouted into the void for a while. Today is as good a day as any other to do so. I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connec...
Negative Anti 8 repliesr Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks. Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all...
Mixed Mixed 7 repliesr I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too. It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chan...
Grok Code Fast 1	512	80.0%	0.0%	6.0	52.52
Positive Pro 10 repliesr Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which...
Negative Anti 14 repliesr It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the fast name of the model) is token output speed. I would have thought it uncontroversial view among software engineers that token quality is much important than to...
Negative Anti 1 repliesr Is this the model that is the Coding version of Grok-4 promised when Grok-4 had awful coding benchmarks? I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up was...
Negative Anti 2 repliesr On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness. Let's see this harness, then, because third party reports rate it at 57.6% https://www.vals.ai/models/grok_grok-code-fast-1
Mixed Anti 3 repliesr I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanation...
MiniMax M2.5 released: 80.2% in SWE-bench Verified	207	80.0%	0.0%	1.6	51.34
Negative Anti 3 repliesr I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have t...
Negative Anti 2 repliesr Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...
Very positive Pro 2 repliesr Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim...
Very negative Anti 1 repliesr Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective...
Negative Anti 0 repliesr M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking
GPT-4.5	1,136	60.0%	20.0%	18.8	50.76
Negative Anti 48 repliesr GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens It sou...
Negative Anti 12 repliesr Is it official then? Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise....
Neutral Neutral 10 repliesr I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes... Here's the result: https://gist.g...
Negative Anti 9 repliesr Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real was that all moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles t...
Mixed Mixed 15 repliesr First impression of GPT-4.5: 1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5 2. The style it writes is way better: it keeps the tone you ask and ma...

Key insights

Timing.

Launch cadence accelerates after mid-2024, with more months containing multiple major model releases. The peak month is 2025-09 with 11 launches.

Attention.

Top 3 providers hold 55.4% of total score from 40.4% of posts.
Top 5 posts alone account for 13.5% of all score.
GPT-4 leads both in score (4,091 points) and discussion (2,507 comments).

Discussion.

Criticism gets more replies: 6.8 avg for anti-stance vs 6.5 for pro.
Open source / weights is the top discussion theme (131 mentions), followed by Coding quality (129).
Open-weight models get more positive reception (net sentiment 0.346) vs closed-source (0.062).
Surface-level positive (35.8%), but reply-weighted controversy is 38.0%.

Providers.

OpenAI leads in volume. Anthropic leads in avg score per post (1254).
OpenAI: most attention, most pushback — 50.3% reply-weighted controversy.
Google DeepMind: 20.3% score share, positive sentiment (0.286), low controversy.
Anthropic: highest avg score (1254), but debate-heavy — 23.6% anti stance.
Rising engagement: Alibaba, Mistral AI, Zhipu AI.
Declining engagement: OpenAI, Anthropic, DeepSeek.

Engagement data uses the full curated set. Sentiment and stance come from a smaller top-comment sample — treat close comparisons with caution.

Provider details

Each provider section shows its launch timeline and post details. For cross-provider rankings, comment patterns, and performance comparisons, see the across providers section above.

OpenAI

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
GPT-4 Official 2023-03-14	GPT-4 GPT	4,091	2,507
Very positive Pro 44 repliesr After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patie...
Negative Anti 42 repliesr A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example:>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not all...
Very negative Anti 14 repliesr I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is built off over the years...
Negative Anti 9 repliesr That footnote on page 15 is the scariest thing i've read about AI/ML to date."To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and...
Very positive Pro 6 repliesr From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir...
GPT-4 Turbo with Vision Generally Available Official 2024-04-09	GPT-4 Turbo Vision GPT	220	107
Positive Pro 8 repliesr They also added both JSON and function support to the vision model - previously it didn't have those.This means you can now use gpt-4-turbo vision to extract structured data from an image!I was previously using a nasty hack where I'd run the image through the vision model to e...
Neutral Neutral 2 repliesr Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:> "I’m hoping we can get evals out shortly to help quantify this. Until ...
Negative Anti 2 repliesr People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it.
Neutral Pro 0 repliesr Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:https://www.datasette.cloud/blog/2024/datasette-extract/YouTube video demo here: https://www.youtube.com/watc...
Negative Anti 6 repliesr Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive!
GPT-4o Official 2024-05-13	GPT-4o GPT	3,138	2,366
Positive Mixed 14 repliesr The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that ...
Mixed Mixed 8 repliesr Very interesting and extremely impressive!I tried using the voice chat in their app previously and was disappointed. The big UX problem was that it didn't try to understand when I had finished speaking. English is a second language and I paused a bit too long thinking of a wor...
Positive Mixed 3 repliesr Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.But: Interesting next to see ho...
Negative Anti 8 repliesr I found these videos quite hard to watch. There is a level of cringe that I found a bit unpleasant.It’s like some kind of uncanny valley of human interaction that I don’t get on nearly the same level with the text version.
Positive Pro 15 repliesr We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.These demos show people talking to artificial intelligence. This is new. ...
GPT-4o mini: advancing cost-efficient intelligence Official 2024-07-18	GPT-4o mini GPT	222	78
Neutral Neutral 0 repliesr [dupe]Some more discussion: https://news.ycombinator.com/item?id=40996248
Positive Pro 4 repliesr The big news for me here is the 16k output token limit. The models keep increasing the input limit to outrageous amounts, but output has been stuck at 4k.I did a project to summarize complex PDF invoices (not “unstructured” data, but “idiosyncratically structured” data, as eac...
Neutral Pro 3 repliesr Here's something interesting to think about: In ML we do a lot of bootstrapping. If a model is 51% wrong on a binary problem you flip the answer and train a 51% correct model then work your way up from there.Small models are trained from synthetic and live data curated and gen...
Negative Anti 9 repliesr GPT-4o mini is $0.15/1M input tokens, $0.60/1M output tokens. In comparison, Claude Haiku is $0.25/1M input tokens, $1.25/1M output tokens.There's no way this price-race-to-the-bottom is sustainable.
Neutral Neutral 1 repliesr @dang: This post isn't on the 1st or 2nd page of hacker news. Did it trip some automated controversy detection code for too many comments in the first hour?Edit: it says 181 points, 6 hours ago, and eyeballing the 1st page it should be in the top 5 right now.
AI paid for by Ads – the GPT-4o mini inflection point 3rd party 2024-07-19	GPT-4o mini GPT	290	236
Very negative Anti 25 repliesr So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ...
Negative Anti 3 repliesr That's an inflection point, all right. OpenAI's customers can now at least break even.Of course, it means a flood of crap content.
Negative Anti 4 repliesr > For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000.> I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views.> This is approxim...
Negative Anti 0 repliesr This article is weird clickbait which, even weirder, worked.It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a co...
Neutral Mixed 2 repliesr This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be in f...
OpenAI O1-Mini Official 2024-09-12	OpenAI o1-mini o-series	30	5
Neutral Neutral 0 repliesr related tweet: https://x.com/OpenAIDevs/status/1834278699388338427OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.o1-preview has strong reasoning capabilities and broad world knowledge.o1-mini is faster, 80% cheaper, and competitive with...
Neutral Neutral 0 repliesr Related:OpenAI O1 Modelhttps://news.ycombinator.com/item?id=41523070
Neutral Neutral 2 repliesr No GPT in the name, so maybe not transformer based?
OpenAI O3-Mini Official 2025-01-31	o3-mini o-series	962	904
Neutral Neutral 15 repliesr I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...For 18,936 input, 2,905 output it cost 3.3612 cents.Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...
Neutral Neutral 2 repliesr I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21Example usage: llm -m o3-mini 'write a poem about a pirate and a walrus' \ -o reasoning_effort high Output (com...
Positive Pro 4 repliesr For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet. 62% $186 o1 high 60% $18 o3-mini high 57% $5 DeepSe...
Positive Pro 14 repliesr For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still co...
Neutral Neutral 14 repliesr So far, it seems like this is the hierarchyo1 > GPT-4o > o3-mini > o1-mini > GPT-4o-minio3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf
GPT-4.5 Official 2025-02-27	GPT-4.5 GPT	1,136	988
Negative Anti 48 repliesr GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokensGPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokensIt sounds like it's so expensive and t...
Negative Anti 12 repliesr Is it official then?Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But w...
Neutral Neutral 10 repliesr I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c...
Negative Anti 9 repliesr Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to s...
Mixed Mixed 15 repliesr First impression of GPT-4.5:1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.52. The style it writes is way better: it keeps the tone you ask and make...
OpenAI's o1-pro now available via API Official 2025-03-19	o1-pro o-series	131	129
Mixed Mixed 5 repliesr Pricing: $150 / 1M input tokens, $600 / 1M output tokens. (Not a typo.)Very expensive, but I've been using it with my ChatGPT Pro subscription and it's remarkably capable. I'll give it 100,000 token codebases and it'll find nuanced bugs I completely overlooked.(Now I almost fe...
Neutral Neutral 4 repliesr This is their first model to only be available via the new Responses API - if you have code that uses Chat Completions you'll need to upgrade to Responses in order to support this.Could take me a while to add support for it to my LLM tool: https://github.com/simonw/llm/issues/839
Neutral Neutral 5 repliesr It cost me 94 cents to render a pelican riding a bicycle SVG with this one!Notes and SVG output here: https://simonwillison.net/2025/Mar/19/o1-pro/
Neutral Pro 3 repliesr Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.OpenAI is now within an order of magnitude of a highly skilled humans with their frontier model pricing. o3 pr...
Negative Anti 2 repliesr It has a 2023 knowledge cut-off, and 200k context window... ? That's pretty underwhelming.
GPT-4.1 in the API Official 2025-04-14	GPT-4.1 GPT	680	492
Negative Neutral 16 repliesr As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)- o3-mini (web search, CoT, canvas, but no i...
Neutral Neutral 8 repliesr Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year: SWE Aider Cost Fast Fresh Claude 3.7 70% 65% $15 77 8/24 Gemini 2.5 64% 69% $10 200 1/25 GPT-4.1 55% 53% $8 169 6/24 DeepSeek R1 49% 57% $...
Neutral Neutral 8 repliesr don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:- telling the model to be persistent (+20%)- dont self-inject/parse toolcalls (+2%)- prompted planning (+4%)- JSON BAD - use X...
Mixed Mixed 2 repliesr I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.My take aways:- This is the first model from OpenAI that f...
Neutral Pro 4 repliesr From OpenAI's announcement:> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better s...
OpenAI o3 and o4-mini Official 2025-04-16	o3 o-series	555	507
Very negative Anti 13 repliesr Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...With right knowledge and web searches one can answer this question in a matte...
Positive Pro 5 repliesr Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA ...
Positive Pro 7 repliesr Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]Incredible how resilient Claude models have been for best-in-coding class.[1] But by only about 1%, and inclusive of C...
Positive Pro 3 repliesr I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).Using o4-min...
Negative Anti 5 repliesr To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.GPT-4o mini: The new moon in August 2025 will occur on August 12.Llama 3.3 70B: The new moon in August...
OpenAI o3-pro Official 2025-06-10	o3-pro o-series	289	199
Mixed Mixed 6 repliesr I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing sl...
Neutral Neutral 5 repliesr The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these mo...
Neutral Anti 4 repliesr The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?
Negative Anti 4 repliesr I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on...
Positive Pro 2 repliesr here's a nice user review we published: https://www.latent.space/p/o3-prosama's highlight[0]:> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."I kept nu...
GPT-5 Official 2025-08-07	GPT-5 GPT	2,063	2,482
Neutral Neutral 109 repliesr It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the differ...
Neutral Anti 12 repliesr GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release).Compare that toGemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)https://platform.openai.com/docs/models/comparehttps://deepmin...
Negative Anti 12 repliesr Going by the system card at: https://openai.com/index/gpt-5-system-card/> GPT‑5 is a unified system . . .OK> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which mod...
Negative Anti 7 repliesr I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.What excites me now is that Gemini...
Negative Anti 13 repliesr The marketing copy and the current livestream appear tautological: "it's better because it's better."Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
GPT-5-Codex Official 2025-09-15	GPT-5-Codex GPT	396	137
Positive Pro 5 repliesr Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).As someon...
Very positive Pro 3 repliesr I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was ...
Very positive Pro 4 repliesr Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any lim...
Positive Pro 2 repliesr From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.For people that have not tried it in say ~1 month, give Codex CLI a try.
Positive Pro 1 repliesr It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input...
Addendum to GPT-5 system card: GPT-5-Codex Official 2025-09-15	GPT-5-Codex GPT	250	141
Very positive Pro 9 repliesr I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% ...
Neutral Neutral 0 repliesr This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum.
Negative Anti 2 repliesr My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company p...
Mixed Mixed 2 repliesr Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchm...
Negative Anti 2 repliesr I signed up to OpenAI, verified my identity, and added my credit card, bought $10 of credits.But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.Are you meant to only use Codex with the...
GPT-5-Codex-Mini – A more compact and cost-efficient versio... Repo 2025-11-08	Codex Mini GPT	56	54
Positive Neutral 2 repliesr I managed to get GPT-5-Codex-Mini to draw me a pelican. It's not a very good one! https://static.simonwillison.net/static/2025/codex-hacking-m...For comparison, here's GPT-5-Codex (not mini) https://static.simonwillison.net/static/2025/codex-hacking-d... and full GPT-5: https:...
Positive Pro 0 repliesr Since they also limited the usage of codex CLI quite a bit, this might help. I really like the gpt-5-codex model. It has the most impressive dynamic reasoning effort I've seen. Responding instantly to simple questions while thinking very long when necessary. It's also interest...
Negative Anti 2 repliesr Codex isn’t even a good model.
Negative Anti 4 repliesr GPT-5 and GPT-5-Codex are already not clever enough for anything interesting.
GPT-5.1: A smarter, more conversational ChatGPT Official 2025-11-12	GPT-5.1 GPT	555	726
Negative Anti 27 repliesr I don’t want more conversational, I want more to the point. Less telling me how great my question is, less about being friendly, instead I want more cold, hard, accurate, direct, and factual results.It’s a machine and a tool, not a person and definitely not my friend.
Negative Anti 22 repliesr All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.I suspect this approach is a direct response to the backlash against removing 4o.
Negative Anti 12 repliesr > what romanian football player won the premier league> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.> ...> No Romanian footballer has ever won...
Mixed Neutral 3 repliesr I’ve seen various older people that I’m connected with on Facebook posting screenshots of chats they’ve had with ChatGPT.It’s quite bizarre from that small sample how many of them take pride in “baiting” or “bantering” with ChatGPT and then post screenshots showing how they “g...
Positive Pro 6 repliesr Seems like people here are pretty negative towards a "conversational" AI chatbot.Chatgpt has a lot of frustrations and ethical concerns, and I hate the sycophancy as much as everyone else, but I don't consider being conversational to be a bad thing.It's just preference I guess...
Building more with GPT-5.1-Codex-Max Official 2025-11-19	GPT-5.1 Codex Max GPT	483	319
Mixed Pro 22 repliesr I've been using a lot of Claude and Codex recently.One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character...
Neutral Neutral 14 repliesr Rest assured that we are better at training models than naming them ;D- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0- Natively trained to work across many hours across multiple context windows via compaction- 30% mor...
Positive Pro 4 repliesr Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini ...
Negative Anti 5 repliesr OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)They were probably sitting on this for a while. That makes me think this is a ...
Neutral Neutral 3 repliesr Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
GPT-5.2 Official 2025-12-11	GPT-5.2 GPT	1,195	1,083
Neutral Neutral 18 repliesr In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.The thing that would now make the biggest difference isn't "more intelligence", whatever that...
Negative Anti 10 repliesr Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnO...
Negative Mixed 9 repliesr I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "pro...
Negative Anti 5 repliesr Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:- It is faster which is appreciated but not as fast as Opus 4.5- I see no changes, very little noticeable improvements over 5.1- I do not see any value in exchange ...
Neutral Pro 5 repliesr I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.The medium-reasoning version also improves: 62.7 → 72.1.The no-reasoning version also improves: ...
GPT Image 1.5 Official 2025-12-16	gpt-image-1.5 gpt-image	522	256
Positive Pro 16 repliesr Okay results are in for GenAI Showdown with the new gpt-image 1.5 model for the editing portions of the site!https://genai-showdown.specr.net/image-editingConclusions- OpenAI has always had some of the strongest prompt understanding alongside the weakest image fidelity. This u...
Mixed Mixed 4 repliesr I have a Nano Banana Pro blog post in the works expanding on my experiments with Nano Banana (https://news.ycombinator.com/item?id=45917875). Running a few of my test cases from that post and the upcoming blog post through this new ChatGPT Image model, this new model is better...
Negative Anti 7 repliesr If this was a farm of sweatshop Photoshopers in 2010, who download all images from the internet and provide a service of combining them on your request, this would escalate pretty quickly.Question: with copyright and authorship dead wrt AI, how do I make (at least) new content...
Positive Pro 2 repliesr I am very impressed a benchmark I like to run is have it create sprite maps, uv texture maps for an imagined 3d modelNoticed it captured a megaman legends vibe ....https://x.com/AgentifySH/status/2001037332770615302and here it generated a texture map from a 3d characterhttps:/...
Negative Anti 4 repliesr It's really weird to see "make images from memories that aren't real" as a product pitch
GPT-5.2-Codex Official 2025-12-18	GPT-5.2 Codex GPT	589	318
Very positive Pro 15 repliesr If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, metho...
Positive Pro 9 repliesr I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging t...
Mixed Mixed 4 repliesr Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context: SWE-Bench (Pro / Verified) Model \| Pro (%) \| Verified (%) -----------------...
Positive Pro 4 repliesr The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it...
Positive Pro 3 repliesr I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex s...
GPT-5.3-Codex Official 2026-02-05	GPT-5.3 Codex GPT	1,530	605
Neutral Pro 23 repliesr Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophicallyWith Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the...
Positive Pro 6 repliesr I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.GPT-5.3-codex scores 77.3.
Mixed Mixed 6 repliesr ,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks...
Positive Pro 2 repliesr Something that caught my eye from the announcement:> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own trainingI'm happy to see the Codex team moving to this kind of dogfooding. I think this was cr...
Neutral Neutral 8 repliesr I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
GPT‑5.3‑Codex‑Spark Official 2026-02-12	GPT-5.3 Codex Spark GPT	890	387
Positive Pro 13 repliesr Wow, I wish we could post pictures to HN. That chip is HUGE!!!!The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× mor...
Positive Pro 8 repliesr I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d r...
Mixed Mixed 7 repliesr First thoughts using gpt-5.3-codex-spark in Codex CLI:Blazing fast but it definitely has a small model feel.It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a ...
Very positive Pro 10 repliesr Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing
Neutral Neutral 2 repliesr My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/
GPT-5.2 derives a new result in theoretical physics Official 2026-02-13	GPT-5.2 GPT	574	401
Neutral Mixed 26 repliesr The headline may make it seem like AI just discovered some new result in physics all on its own, but reading the post, humans started off trying to solve some problem, it got complex, GPT simplified it and found a solution with the simpler representation. It took 12 hours for ...
Positive Pro 15 repliesr It's interesting to me that whenever a new breakthrough in AI use comes up, there's always a flood of people who come in to handwave away why this isn't actually a win for LLMs. Like with the novel solutions GPT 5.2 has been able to find for erdos problems - many users here (e...
Positive Pro 2 repliesr "An internal scaffolded version of GPT‑5.2 then spent roughly 12 hours reasoning through the problem, coming up with the same formula and producing a formal proof of its validity."When I use GPT 5.2 Thinking Extended, it gave me the impression that it's consistent enough/has a...
Positive Mixed 9 repliesr AI can be an amazing productivity multiplier for people who know what they're doing.This result reminded me of the C compiler case that Anthropic posted recently. Sure, agents wrote the code for hours but there was a human there giving them directions, scoping the problem, fin...
Mixed Mixed 1 repliesr It would be more accurate to say that humans using GPT-5.2 derived a new result in theoretical physics (or, if you're being generous, humans and GPT-5.2 together derived a new result). The title makes it sound like GPT-5.2 produced a complete or near-complete paper on its own,...
GPT‑5.3 Instant Official 2026-03-03	GPT-5.3 Chat GPT	395	301
Very negative Anti 22 repliesr The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. "Why it matters", "the big picture", "it's not jut you", the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional so you can ea...
Negative Anti 7 repliesr I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / "non-thinking" mode?I feel like openai is going to get right back to wh...
Negative Anti 11 repliesr Since the page mentions:> Better judgment around refusalsHas any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, "make up a joke about <group>" and then ...
Negative Anti 5 repliesr I kind of chuckled when I read the headline "GPT‑5.3 Instant: Smoother, more ..."LLM companies starting to sound like cigarette advertisements.
Negative Anti 6 repliesr Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles?
GPT-5.4 Official 2026-03-05	GPT-5.4 GPT	1,019	807
Mixed Mixed 13 repliesr The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/Also per...
Negative Anti 11 repliesr I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m...
Positive Pro 7 repliesr I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Cod...
Negative Anti 19 repliesr I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post f...
Negative Mixed 10 repliesr So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it w...
GPT-5.4 Thinking and GPT-5.4 Pro 3rd party 2026-03-05	GPT-5.4 Pro GPT	93	2
Neutral Neutral 0 repliesr Comments moved to https://news.ycombinator.com/item?id=47265045.
Neutral Neutral 0 repliesr More discussion: https://news.ycombinator.com/item?id=47265005
GPT‑5.4 Mini and Nano Official 2026-03-17	GPT-5.4 mini GPT	248	145
Positive Pro 7 repliesr I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).- GPT-...
Positive Pro 6 repliesr To me, mini releases matter much more and better reflect the real progress than SOTA models.The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.Meanwhile, when a smaller / less powerful model releases a new ...
Mixed Mixed 7 repliesr I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I...
Neutral Neutral 4 repliesr Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...
Neutral Anti 2 repliesr According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?GPT 5 mini: Input $0.25 / Output $2.00GPT 5 nano: Input: $0.05 / Output $0.40GPT 5.4 mini: Input $0.75 / Output $4.50G...

Google DeepMind

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Gemini AI Official 2023-12-06	Gemini Gemini	2,135	1,602
Neutral Neutral 0 repliesr Related blog post: https://blog.google/technology/ai/google-gemini-ai/ (via https://news.ycombinator.com/item?id=38544746, but we merged the threads)
Positive Pro 8 repliesr Very impressive! I noticed two really notable things right off the bat:1. I asked it a question about a feature that TypeScript doesn't have[1]. GPT4 usually does not recognize that it's impossible (I've tried asking it a bunch of times, it gets it right with like 50% probabil...
Neutral Neutral 5 repliesr For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %...
Positive Pro 21 repliesr This demo is nuts: https://youtu.be/UIZAiXYceBI?si=8ELqSinKHdlGlNpX
Mixed Mixed 30 repliesr One observation: Sundar's comments in the main video seem like he's trying to communicate "we've been doing this ai stuff since you (other AI companies) were little babies" - to me this comes off kind of badly, like it's trying too hard to emphasize how long they've been doing...
Our next-generation model: Gemini 1.5 Official 2024-02-15	Gemini 1.5 Pro Gemini	1,244	588
Very positive Pro 36 repliesr The white paper is worth a read. The things that stand out to me are:1. They don't talk about how they get to 10M token context2. They don't talk about how they get to 10M token context3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creat...
Neutral Neutral 0 repliesr One interesting tidbit from the technical report:>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conser...
Positive Pro 8 repliesr Massive whoa if true from technical report"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"https://storage.googleapis.com/deepmind-media/gemini/g...
Negative Anti 6 repliesr 0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form.
Very negative Anti 4 repliesr Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I...
Gemma: New Open Models Official 2024-02-21	Gemma Gemma	1,129	509
Neutral Neutral 9 repliesr Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B +-------------+----------+-------------+-------------+ \| Benchmark \| Gemma 7B \| Mistral 7B \| Llama-2 7B \| +-------------+----------+-------------+-------------+ \| MMLU \| 64.3 \| 60.1 \| 45.3 \| \| HellaSwag \| 81.2 \| ...
Negative Anti 15 repliesr The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policySomething that caught my eye in the terms:> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.One of the...
Neutral Neutral 29 repliesr Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.Opinions are our own and not of Google DeepMind.
Neutral Neutral 3 repliesr I notice a few divergences to common models:- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;- The vocabulary size is 10x (256K vs. Mistral’s 32K);- The training token count is tripled (6T vs. Llama2's 2T)Apart from that, it uses the ...
Negative Anti 6 repliesr Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning.
Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video] 3rd party 2024-03-04	Gemini 1.5 Pro Gemini	51	72
Mixed Neutral 13 repliesr Within a couple sentences of each other he says (paraphrasing), “if you test any model (Gemini, ChatGPT, Grok) you’ll see it says far left things” and “we don’t know why [Gemini] is far left leaning”.I think he’s being honest here. At least at the executive level of having no ...
Neutral Neutral 0 repliesr Transcript:https://lifearchitect.ai/sergey/
Negative Anti 0 repliesr > The Gemini art? Yeah, with the… Okay. That wasn’t really expected [...] > So, once again, we haven’t fully understood why it means “left” in many cases. And that’s not our intention.Riiiight.
Neutral Neutral 0 repliesr Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.
Very negative Anti 4 repliesr Let's be clear. This isn't a left or right thing. That's just a convenient argument to avoid the real problem. It's about moral relativism.I am about as left wing as you can be and gemini is simply offensive. It didn't want to say if Hitler was worse than Elon or Trump. It did...
Gemini Flash Official 2024-05-14	Gemini Flash Latest Gemini	442	146
Neutral Neutral 0 repliesr I upgraded my llm-gemini plugin to provide CLI access to Gemini Flash: pipx install llm # or brew install llm llm install llm-gemini --upgrade llm keys set gemini # paste API key here llm -m gemini-1.5-flash-latest 'a short poem about otters' https://github.com/simonw/llm-gemi...
Mixed Mixed 9 repliesr Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.Not bad compared to rolling your own, but among frontier models the main competitive differentiator...
Neutral Mixed 5 repliesr 1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and wil...
Negative Anti 0 repliesr A lightweight model that you can only use in the cloud? That is amusing. These tech megacorps are really intent on owning your usage of AI. But we must not let that be the future.
Negative Anti 0 repliesr One thing OpenAI is beating Google at is actually publishing pricing for its APIs (and being consistent as to what they are called.)Google has I think 10 models available (there’s more than ten model names, but several of the models have multiple aliases) through what the Goog...
Gemini 2.0: our new AI model for the agentic era Official 2024-12-11	Gemini 2.0 Flash Gemini	1,015	490
Very positive Pro 10 repliesr https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.I just tried having it teach me how t...
Positive Pro 5 repliesr I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal: llm install -U llm-gemini llm -m gemini-2.0-flash-exp 'prompt goes here' LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the...
Positive Pro 12 repliesr Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this sp...
Positive Pro 3 repliesr Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about ...
Positive Pro 4 repliesr Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
Ingesting PDFs and why Gemini 2.0 changes everything 3rd party 2025-02-05	Gemini 2.0 Flash Gemini	1,303	447
Positive Pro 33 repliesr I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large...
Negative Anti 17 repliesr This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.You...
Negative Neutral 6 repliesr Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
Very negative Anti 5 repliesr We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I won...
Positive Mixed 6 repliesr (disclaimer I am CEO of llamaindex, which includes LlamaParse)Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so we...
Gemma 3 Technical Report [pdf] Paper 2025-03-12	Gemma 3 27B Gemma	488	245
Neutral Neutral 6 repliesr Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.Gemma 3 model overview: https://ai.google.dev/gemma/docs/coreHuggingface collection: https://huggingface.co/collecti...
Neutral Neutral 11 repliesr Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.(Opinions our own and not of Google DeepMind.)PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957
Positive Pro 3 repliesr Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good ou...
Neutral Neutral 0 repliesr Linking to he announcement (which links to his PDF) would probably be more useful.Introducing Gemma 3: The most capable model you can run on a single GPU or TPUhttps://blog.google/technology/developers/gemma-3/
Mixed Mixed 3 repliesr Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).But the example image shows that this model still makes dumb errors or has a poo...
Experiment with Gemini 2.0 Flash native image generation Official 2025-03-12	Gemini 2.0 Flash Gemini	101	12
Negative Anti 4 repliesr It just tells me 'content not permitted'.For context, I was attempting to put a cup of hot chocolate into the hands of an anime character.
Positive Pro 1 repliesr Ever since OpenAI showed (but did not release) this type of multimodal output with 4o, I have been waiting for this to be available to the general public.It seems like really combining visuals at the level of generation capability means language understanding is fully grounded...
Negative Anti 0 repliesr Curious that they use "Use Gemini 2.0 Flash to tell a story and it will illustrate it with pictures, keeping the characters and settings consistent throughout", and their example is not consistent at all between two pictures.
Negative Anti 0 repliesr I was really hoping that there would be more character consistency, given the fact they mention it in the blog. It also doesn't seem to reliably follow styles like "watercolor illustration" or "line and wash".
Gemini 2.5 Official 2025-03-25	Gemini 2.5 Flash Gemini	973	485
Very positive Pro 14 repliesr One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.Imagine for instance you give the LLM the profile of the love interest for your...
Very positive Pro 22 repliesr I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. ...
Positive Pro 4 repliesr I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.Plus it drew me a very decent pelican r...
Positive Pro 3 repliesr Tops our benchmark in an unprecedented way.https://help.kagi.com/kagi/ai/llm-benchmark.htmlHigh quality, to the point. Bit on the slow side. Indeed a very strong model.Google is back in the game big time.
Positive Pro 2 repliesr Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.[0] https://aider.c...
Gemini 2.5 Flash Official 2025-04-17	Gemini 2.5 Flash Gemini	1,076	560
Very positive Pro 20 repliesr Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's m...
Neutral Pro 3 repliesr An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini uv tool install llm llm install llm-gemini llm keys set gemini # paste key here llm -...
Positive Pro 18 repliesr Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.Google is silently winning the AI race.
Positive Pro 6 repliesr One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-und...
Positive Pro 3 repliesr For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and late...
Gemma 3 QAT Models: Bringing AI to Consumer GPUs Official 2025-04-20	Gemma 3 27B Gemma	602	276
Very positive Pro 11 repliesr I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of mem...
Very positive Pro 1 repliesr I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I...
Negative Anti 3 repliesr First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn'...
Very positive Pro 3 repliesr Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.I still like Gemini 2.5 Pro ...
Positive Pro 0 repliesr Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of ent...
Gemma 3n preview: Mobile-first AI Official 2025-05-20	Gemma 3n 4B Gemma	441	163
Positive Pro 12 repliesr You can try it on Android right now:Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...Import the .task file in ...
Neutral Neutral 3 repliesr Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot ...
Mixed Mixed 2 repliesr According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-previewE4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4...
Positive Pro 1 repliesr On Hugging face I see 4B and 2B versions now -https://huggingface.co/collections/google/gemma-3n-preview-6...Gemma 3n Previewgoogle/gemma-3n-E4B-it-litert-previewgoogle/gemma-3n-E2B-it-litert-previewInteresting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE model...
Mixed Pro 0 repliesr It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is ex...
Gemma 3 270M: Compact model for hyper-efficient AI Official 2025-08-14	Gemma 3 270M Gemma	824	317
Positive Pro 30 repliesr Hi all, I built these models with a great team. They're available for download across the open model ecosystem so give them a try! I built these models with a great team and am thrilled to get them out to you.From our side we designed these models to be strong for their size o...
Negative Anti 18 repliesr My lovely interaction with the 270M-F16 model:> what's second tallest mountain on earth?The second tallest mountain on Earth is Mount Everest.> what's the tallest mountain on earth?The tallest mountain on Earth is Mount Everest.> whats the second tallest mountain?The ...
Positive Pro 3 repliesr I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.The great thing is I have enough data (100k+) to fine-tune and r...
Mixed Mixed 14 repliesr This model is a LOT of fun. It's absolutely tiny - just a 241MB download - and screamingly fast, and hallucinates wildly about almost everything.Here's one of dozens of results I got for "Generate an SVG of a pelican riding a bicycle". For this one it decided to write a poem: ...
Positive Pro 6 repliesr Apple should be doing this. Unless their plan is to replace their search deal with an AI deal -- it's just crazy to me how absent Apple is. Tim Cook said, "it's ours to take" but they really seem to be grasping at the wind right now. Go Google!
Gemini 2.5 Flash Image Official 2025-08-26	Gemini 2.5 Flash Image Gemini	1,093	475
Very positive Pro 19 repliesr This is the gpt 4 moment for image editing models. Nano banana aka gemini 2.5 flash is insanely good. It made a 171 elo point jump in lmarena!Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111
Positive Pro 7 repliesr I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).https://genai-showdown.specr.netThis model gets 8 of the 12 prompts correct and easily comes within ...
Very negative Anti 4 repliesr Unfortunately, it suffers from the same safetyism than other many releases. Half of the prompts get rejected. How can you have character consistency if the model is forbidden from editing any human. And most of my photo editing involves humans, so basically this is just a usel...
Positive Pro 3 repliesr There is one thing Gemini 2.5 Flash Image can do that no other edit model can do: incorporate multiple images simultaneously without shenanigans due to its multimodality, e.g. for Flux Kontext, if you want to "put the person in the first image into the second image", you have ...
Positive Pro 6 repliesr I digitised our family photos but a lot of them were damaged (shifted colours, spills, fingerprints on film, spots) that are difficult to correct for so many images. I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing detai...
Improved Gemini 2.5 Flash and Flash-Lite Official 2025-09-25	Gemini 2.5 Flash Gemini	540	279
Negative Anti 14 repliesr This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.I've been running into it consistently, responses that just s...
Neutral Neutral 2 repliesr I added support to these models to my llm-gemini plugin, so you can run them like this (using uvx so no need to install anything first): export LLM_GEMINI_KEY='...' uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks' Re...
Negative Neutral 5 repliesr Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a ...
Mixed Mixed 8 repliesr Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflo...
Neutral Neutral 5 repliesr Non-AI Summary:Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).Gemini 2.5 Flash-Lite improvements include better instruction following, reduced v...
Gemini 3 Pro Model Card [pdf] Paper 2025-11-18	Gemini 3 Pro Preview Gemini	280	335
Neutral Neutral 19 repliesr Benchmarks from page 4 of the model card: \| Benchmark \| 3 Pro \| 2.5 Pro \| Sonnet 4.5 \| GPT-5.1 \| \|-----------------------\|-----------\|---------\|------------\|-----------\| \| Humanity's Last Exam \| 37.5% \| 21.6% \| 13.7% \| 26.5% \| \| ARC-AGI-2 \| 31.1% \| 4.9% \| 13.6% \| 17.6% \| \| GPQ...
Positive Mixed 13 repliesr It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
Positive Pro 1 repliesr One benchmark I would really like to see: instruction adherence.For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outp...
Neutral Neutral 8 repliesr There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.
Neutral Neutral 6 repliesr Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/[email protected]` which obviously fails...Anyone happen to know why? Is this website by any change sharing information on safe medical abo...
Gemini 3 Official 2025-11-18	Gemini 3 Pro Preview Gemini	1,735	1,056
Positive Pro 19 repliesr Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training dataGemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this pr...
Very positive Pro 1 repliesr This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...I spent years building a compiler that takes our custom XML format and generat...
Positive Pro 11 repliesr Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, e...
Neutral Pro 5 repliesr Static Pelican is boring. First attempt:Generate SVG animation of following:1 - There is High fantasy mage tower with a top window a dome2 - Green goblin come in front of tower with a torch3 - Grumpy old mage with beard appear in a tower window in high purple hat4 - Mage sends...
Negative Anti 14 repliesr I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, ...
Gemini 3 Pro: the frontier of vision AI Official 2025-12-05	Gemini 3 Pro Preview Gemini	566	295
Mixed Mixed 27 repliesr WellIt is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.In fact GPT5 wrote an edge detection...
Positive Pro 4 repliesr I do some electrical drafting work for construction and throw basic tasks at LLMs.I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
Positive Pro 2 repliesr These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this fro...
Neutral Neutral 3 repliesr Interesting "ScreenSpot Pro" results: 72.7% Gemini 3 Pro 11.4% Gemini 2.5 Pro 49.9% Claude Opus 4.5 3.50% GPT-5.1 ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Usehttps://arxiv.org/abs/2504.07981
Neutral Neutral 4 repliesr In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.
Gemini 3 Flash: Frontier intelligence built for speed Official 2025-12-17	Gemini 3 Flash Preview Gemini	1,102	580
Very positive Pro 24 repliesr Don’t let the “flash” name fool you, this is an amazing model.I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fra...
Mixed Mixed 8 repliesr This is awesome. No preview release either, which is great to production.They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for outputFor comparison:Gemini 3.0 Flash: $0.50/M for input and $3.00/M for outputGemini 2.5 Fl...
Positive Pro 4 repliesr Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.Presumably a big motivation for them is to be first to get something good and cheap enough they c...
Mixed Mixed 6 repliesr These flash models keep getting more expensive with every release.Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world ...
Positive Pro 3 repliesr This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recogniti...
Gemini 3 Deep Think Official 2026-02-12	Gemini 3 Pro Preview Gemini	1,081	693
Positive Pro 15 repliesr Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)Wow.https://blog.google/innovation-and-ai/models-and-research/ge...
Negative Neutral 11 repliesr Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had K...
Positive Pro 3 repliesr I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a f...
Very positive Pro 17 repliesr Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.
Neutral Neutral 3 repliesr Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved">Submi...
Gemini 3.1 Pro Official 2026-02-19	Gemini 3.1 Pro Preview Gemini	963	914
Negative Anti 31 repliesr I hope this works better than 3.0 ProI'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.It's stunningly good at reasoning, design, and generatin...
Positive Pro 22 repliesr People underrate Google's cost effectiveness so much. Half price of Opus. HALF.Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight____Update:3.1 was 40% of the cost to run AA index vs Opu...
Mixed Mixed 2 repliesr If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.While it gives me hope, I am going to pl...
Neutral Neutral 9 repliesr Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricingKnowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3Compare to Op...
Mixed Mixed 8 repliesr These models are so powerful.It's totally possible to build entire software products in the fraction of the time it took before.But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.It feels lik...
Gemini 3.1 Flash-Lite: Built for intelligence at scale Official 2026-03-03	Gemini 3.1 Flash Lite Preview Gemini	60	30
Neutral Mixed 2 repliesr Lots of comments about the price change, but Artifical Analysis reports that 3.1 Flash-Lite (reasoning) used fewer than half of the tokens of 2.5 Flash-Lite (reasoning).This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output...
Negative Anti 0 repliesr Unfortunate, significant price increase for a 'lite' model: $0.25 IN / $1.50 OUT vs. Gemini 2.5 Flash-Lite $0.10 IN / $0.40 OUT.
Negative Anti 4 repliesr For the last 2 years, startup wisdom has been that models will continue to get cheaper and better. Claude first, and now Gemini has shown that it's not the case.We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic...
Neutral Anti 0 repliesr That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model
Positive Pro 2 repliesr You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.We benchmarked it for real-life voice-to-text use cases: <10s 10-30s 30s-1m 1-2m 2-3m Flash 2548 2732 3177 4583 5961 Flash L...
Google releases Gemma 4 open models Official 2026-04-02	Gemma 4 26B Gemma	1,811	474
Positive Pro 21 repliesr Thinking / reasoning + multimodal + tool calling.We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!Guide for those interested: https://unsloth.ai/docs/models/gemma-4Also note to use temperature = 1.0, top_p ...
Mixed Mixed 9 repliesr I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.https://simonwillison.net/2026/Apr/2/gemma-4/The gemma-4-31b model is compl...
Neutral Neutral 3 repliesr Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: \| Model \| MMLUP \| GPQA \| LCB \| ELO \| TAU2 \| MMMLU \| HLE-n \| HLE-t \| \|----------------\|-------\|-------\|-------\|------\|-------\|-------\|-------\|-------\| \| G4 31B \| 85.2% \| ...
Negative Mixed 6 repliesr Prompt:> what is the Unix timestamp for this: 2026-04-01T16:00:00ZQwen 3.5-27b-dwq> Thought for 8 minutes 34 seconds. 7074 tokens.> The Unix timestamp for 2026-04-01T16:00:00Z is:> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)Gemma-4-26b-a4b> Thou...
Neutral Neutral 23 repliesr Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can

Anthropic

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Claude 3.7 Sonnet and Claude Code Official 2025-02-24	Claude Sonnet 3.7 Claude	2,127	963
Positive Pro 18 repliesr Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.Aider 0.75.0 is out with support for 3.7 Sonnet [1].Thinking supp...
Neutral Neutral 82 repliesr Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product.
Positive Pro 8 repliesr Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.https://help.kagi.com/kagi/ai/llm-benchmark.htmlAppears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, abo...
Positive Pro 91 repliesr You can get your HN profile analyzed by it and it's pretty funny :)https://hn-wrapped.kadoa.com/I'm using this to test the humor of new models.
Positive Pro 2 repliesr I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was...
Claude 4 Official 2025-05-22	Claude 4 Claude	2,013	1,170
Neutral Pro 10 repliesr An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)https://docs.anthropic.com/en/docs/about-claude/models/overv...
Positive Pro 7 repliesr “GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handl...
Negative Anti 8 repliesr > Users requiring raw chains of thought for advanced prompt engineering can contact salesSo it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly ref...
Negative Anti 13 repliesr I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
Mixed Anti 1 repliesr Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing ...
Claude Opus 4.1 Official 2025-08-05	Claude Opus 4.1 Claude	841	328
Neutral Neutral 10 repliesr All three major labs released something within hours of each other. This anime arc is insane.
Mixed Mixed 5 repliesr Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, sp...
Negative Anti 22 repliesr I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certa...
Neutral Neutral 1 repliesr They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon(He had been stuck in the Team Rocket hideout (I believe) for weeks)
Very negative Anti 4 repliesr Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.I've basically wasted the morning o...
Claude Sonnet 4 now supports 1M tokens of context Official 2025-08-12	Claude Sonnet 4 Claude	1,324	692
Mixed Mixed 16 repliesr This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (...
Neutral Pro 12 repliesr A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max): 1. Build context for the work you're doing. Put lots of your codebase into the context window. 2. Do work, but ...
Mixed Mixed 4 repliesr This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users do...
Mixed Mixed 20 repliesr My experience with the current tools so far:1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "...
Positive Pro 8 repliesr One of the most helpful usages of CC so far is when I simply ask:"Are there any bugs in the current diff"It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think thr...
Claude Sonnet 4.5 Official 2025-09-29	Claude Sonnet 4.5 Claude	1,585	786
Positive Pro 12 repliesr I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of bench...
Negative Anti 21 repliesr Anecdotal evidence.I have a fairly large web application with ~200k LoC.Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI)."implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing ...
Very negative Anti 8 repliesr I haven't shouted into the void for a while. Today is as good a day as any other to do so.I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection b...
Negative Anti 8 repliesr > Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all ...
Mixed Mixed 7 repliesr I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chang...
Claude Haiku 4.5 Official 2025-10-15	Claude Haiku 4.5 Claude	730	287
Mixed Mixed 6 repliesr Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being t...
Negative Neutral 19 repliesr Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for......
Neutral Neutral 1 repliesr I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.
Positive Pro 8 repliesr Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...
Neutral Neutral 4 repliesr I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.
Claude Opus 4.5 Official 2025-11-24	Claude Opus 4.5 Claude	1,113	506
Positive Pro 19 repliesr The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.Also notable: they're claiming SOTA prompt injection resi...
Negative Mixed 15 repliesr This is gonna be game-changing for the next 2-4 weeks before they nerf the model.Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases”...
Positive Pro 24 repliesr I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.I ...
Mixed Mixed 1 repliesr The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded...
Positive Pro 9 repliesr Seeing these benchmarks makes me so happy.Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.Ive been hol...
Claude Opus 4.6 Official 2026-02-05	Claude Opus 4.6 Claude	2,346	1,031
Very positive Pro 37 repliesr Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 ...
Positive Anti 7 repliesr 5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!
Mixed Mixed 12 repliesr I'm still not sure I understand Anthropic's general strategy right now.They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.Meanwhile, Claude's general use cases are... fine. For generic r...
Neutral Neutral 1 repliesr Claude Code release notes: > Version 2.1.32: • Claude Opus 4.6 is now available! • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) • Claude now automatically records ...
Mixed Mixed 23 repliesr The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
Advancing finance with Claude Opus 4.6 Official 2026-02-05	Claude Opus 4.6 Claude	154	50
Mixed Mixed 4 repliesr Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.My anecdotal experienc...
Neutral Neutral 0 repliesr Based on the article... is this basically just making Claude better at formatting and data presentation, or does it also get better at analysis? I get the impression it's the former.
Mixed Mixed 0 repliesr The benchmarks look good. Slide decks and spreadsheets look better. The people must use Claude Cowork and have their Claude Code moment and figure out the consequences. It will be really interesting to see articles like this (https://mitchellh.com/writing/my-ai-adoption-journe...
Mixed Mixed 1 repliesr And then you hand it to your boss who takes a 20 second look at it and asks why you made a projection that assume massive revenue growth and 3 years of perfectly flat utilities, insurance, G&A - no inflation etc.It does look really promising as a skeleton starting point th...
Negative Neutral 0 repliesr > The side-by-side outputs below show how output quality has improved from Claude Opus 4.5 to Opus 4.6.Disclaimer: I use AI to code (and I code for finance) and I love Anthropic.But: for f-ck's sake, I cannot click on the picture and have it show up in full. It stays at its...
Claude Opus 4.6 extra usage promo Official 2026-02-05	Claude Opus 4.6 Claude	212	75
Negative Anti 8 repliesr Considering all the problems they've been having with over-charging Claude Code users over the past few weeks it's the very least they could do. Max subscribers are hitting their 5 hour usage limits in 30-40 minutes with a single instance doing light work, while Anthropic have...
Negative Anti 1 repliesr Is this one of those, Hey turn off the overcharge protection because you can go $50 into debt for free, and then maybe you'll just keep going and not notice you owe us an extra $500 type of situations?
Neutral Neutral 1 repliesr Meanwhile codex is running a free tier for a month right now for anyone who wants to experiment with it: https://openai.com/index/introducing-the-codex-app/Doesn't appear to include the new model though, only the state-of-yesterdays-art (literally yesterdays).
Very negative Anti 2 repliesr I'll pass on this $50, but please hire real human and fix your crappy app, Claude!This bug has been for years: in Claude (web or app), if you create a new chat at the middle of existing chat thinking or tool calling, the existing chat will be broken, either losing data, or bec...
Neutral Neutral 0 repliesr You can use this if you started your Pro or Max subscription before Wednesday, February 4, 2026 at 11:59 PM PT.Go to https://claude.ai/settings/usage, turn on extra usage and enable the promo from the notification afterwards.I received €42, top up was not required and auto-rel...
Claude Sonnet 4.6 Official 2026-02-17	Claude Sonnet 4.6 Claude	1,346	1,226
Very negative Anti 15 repliesr I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their o...
Negative Neutral 4 repliesr They use the word "Sonnet" 60+ times on that page but never give the casual reader any context of what a "Sonnet model" actually is. Neither does their landing page. You have to scroll all the way to the footer to find a link under the "Models" section. You click it and you fi...
Mixed Mixed 9 repliesr I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 yearsIt is a far cry from Opus 4.6.Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing ana...
Neutral Neutral 16 repliesr The demise of saas has been overplayed imho. When companies buy software they are essentially buying something that solves a problem and the insurance that comes with that. Part of that means they get to pick up the phone and complain if something doesn't work and someone on t...
Positive Neutral 9 repliesr I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers.

Alibaba

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments	Sentiment
Qwen2 LLM Released Official 2024-06-06	Qwen2 Qwen	261	130
Positive Pro 6 repliesr A 0.5B parameter model with a 32k context length that also makes good use of that full window?! That's very interesting.The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning...
Neutral Neutral 2 repliesr Is it a common practice in LLMs to give different weights to different training data sources?For instance I might want to say that all training data that comes from my inhouse emails take precedence over anything that comes from the internet?
Positive Pro 0 repliesr This model has: 1. On par or better performance than Llama-3-70B-Instruct 2. A much more comfortable context length of 128k (vs the tiny 8k that really hinders Llama-3)These 2 feats together will probably make it the first serious OS rival to GPT-4!
Negative Anti 5 repliesr Weird, every time I try asking what happened at Tiananmen Square, or why Xi is an outlier with 3 terms as party secretary, it errors. "All hail Glorious Xi :)" works though. https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct
Neutral Neutral 8 repliesr Given the restrictions on GPUs to China, I'm curious what their training cluster looks like.(not saying this out of any support or non-support for such a GPU blockade; I'm just genuinely curious)
Qwen2.5-Coder Beats GPT-4o and Claude 3.5 Sonnet in Coding Official 2024-11-11	Qwen2.5-Coder 32B Instruct Qwen	23	0	—
QwQ: Alibaba's O1-like reasoning LLM Official 2024-11-27	QwQ 32B Qwen	438	421
Very positive Pro 3 repliesr This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went a...
Positive Pro 5 repliesr This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/
Positive Pro 1 repliesr QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.[0] https://news.ycombinator.com/item?id=41524263
Positive Pro 2 repliesr 32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.
Negative Mixed 7 repliesr I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.See the response here: https://pastecode.io/s...
Qwen2.5-1M: Deploy your own Qwen with context length up to ... Official 2025-01-26	Qwen2.5 14B Instruct Qwen	307	108
Neutral Neutral 0 repliesr Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/(via https://news.ycombinator.com/item?id=42832838, but we merged that thread hither)
Negative Anti 13 repliesr In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, et...
Neutral Neutral 4 repliesr Ollama has a num_ctx parameter that controls the context window length - it defaults to 2048. At a guess you will need to set that.
Neutral Neutral 1 repliesr Here are tips for running it on macOS using MLX: https://twitter.com/awnihannun/status/1883611098081099914 - using https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-...
Neutral Neutral 3 repliesr What's the SOTA for memory-centric computing? I feel like maybe we need a new paradigm or something to bring the price of AI memory down.Maybe they can take some of those hundreds of billions and invest in new approaches.Because racks of H100s are not sustainable. But it's cle...
QwQ-32B: Embracing the Power of Reinforcement Learning Official 2025-03-05	QwQ 32B Qwen	480	169
Mixed Mixed 10 repliesr Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't beli...
Mixed Pro 6 repliesr Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.These things are pretty interesting as they are developing. What US will do to retain its power?BTW I am Indian and we are not even in the race as coun...
Positive Pro 4 repliesr I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?
Neutral Neutral 5 repliesr To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ.
Mixed Mixed 2 repliesr It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.Overall though quite impressive i...
Qwen2.5-VL-32B: Smarter and Lighter Official 2025-03-24	Qwen2.5-VL-32B Qwen	544	293
Neutral Neutral 5 repliesr Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/
Very positive Pro 1 repliesr This model is available for MLX now, in various different sizes.I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this: uv run --with 'numpy<2' --with mlx-vlm \ ...
Very positive Pro 2 repliesr We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our ...
Positive Pro 9 repliesr 32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced M...
Neutral Neutral 10 repliesr Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable?
Alibaba Qwen2.5-Omni-7B: Open Source End-to-End Multimodal ... 3rd party 2025-03-27	Qwen2.5-Omni 7B Qwen	29	3
Positive Pro 0 repliesr Qwen has been on fire lately. First QwQ now this!
Negative Anti 1 repliesr Dunno if the article author is in this, but there's an issue with the links. Hugging face goes to github, github goes to nowhere.
QVQ-Max: Think with Evidence Official 2025-04-03	QVQ Max Qwen	121	26
Positive Pro 1 repliesr I think this is old news, but this model does better than llama 4 maverick on coding.
Neutral Neutral 5 repliesr I wonder why are we getting these drops during the weekend. Is the AI race truly that heated?
Neutral Neutral 1 repliesr The about page doesn’t shed light on the composition of the core team nor their sources of incomes or funding. Am I overlooking something?> We are a group of people with diverse talents and interests.
Mixed Mixed 0 repliesr I’ve tried QVQ before, one issue I always stumbled upon is that it entered an endless loop of reasoning. It felt wrong letting it run for too long, I didn’t want to over consume Alibabas gpus.I’ll try QVQ-Max when I get home. It’s kinda fascinating how these models can interpr...
Qwen3: Think deeper, act faster Official 2025-04-28	Qwen3 14B Qwen	869	388
Negative Anti 18 repliesr I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Q...
Positive Pro 4 repliesr They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.Not even going into performance, need to test fi...
Mixed Mixed 6 repliesr As is now traditional for new LLM releases, I used Qwen 3 (32B, run via Ollama on a Mac) to summarize this Hacker News conversation about itself - run at the point when it hit 112 comments.The results were kind of fascinating, because it appeared to confuse my system prompt te...
Neutral Neutral 12 repliesr Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a ...
Neutral Neutral 15 repliesr With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all...
Qwen3-Coder: Agentic coding in the world Official 2025-07-22	Qwen3 Coder Flash Qwen	765	366
Neutral Neutral 13 repliesr I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Positive Pro 4 repliesr > Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant firstI'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But ...
Neutral Neutral 8 repliesr The "qwen-code" app seems to be a gemini-cli fork.https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSEI hope these OSS CC clones converge at some point.Actually it is mentioned in the page: we’re also open-sourcing a command-line tool for a...
Neutral Neutral 14 repliesr At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?05%: Making code changes10%: Running build pipelines20%: Learning about changed process and people via zoom calls, teams cha...
Very positive Pro 5 repliesr I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock...
Qwen-Image: Crafting with native text rendering Official 2025-08-04	Qwen-Image Qwen	544	159
Positive Pro 13 repliesr Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge.
Mixed Mixed 1 repliesr Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).That being said, it still lags prett...
Positive Pro 2 repliesr The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.
Neutral Neutral 8 repliesr This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on t...
Neutral Neutral 0 repliesr A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations...
Qwen3-Next Official 2025-09-12	Qwen3-Next 80B-A3B (Thinking) Qwen	569	230
Positive Pro 3 repliesr Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...But Deepseek R1 adds embed_to...
Positive Mixed 4 repliesr Alibaba keeps releasing gold contentI just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysisAccording to Qwen Chat, Qwen3-Next has the f...
Negative Anti 5 repliesr llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"Here's a classic ASCII art representation of SpongeBob SquarePants: .------. / o o \ \| \| \| \___/ \| \_______/ llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ "An ASCII of spongebob" Here's an ASCII art of SpongeB...
Positive Pro 2 repliesr The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
Neutral Neutral 8 repliesr The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc
Qwen3-Omni: Native Omni AI model for text, image and video Repo 2025-09-22	Qwen3-Omni Flash Qwen	571	142
Positive Pro 12 repliesr Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.The Chinese are going to end up owning the AI...
Neutral Neutral 3 repliesr You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.It has an entertaining selection of different voices, including:Dylan - A teenager who grew up in Beijing's hutongsPeter...
Neutral Neutral 4 repliesr The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.I wonder if we'll see a macOS port soon - currently it very much needs an N...
Very positive Pro 0 repliesr Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.https://www.youtube.com/watch?v=_zdOrPju4_g
Positive Pro 2 repliesr Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to you...
Qwen3-VL Official 2025-09-23	Qwen3-VL 235B-A22B Qwen	434	160
Very positive Pro 11 repliesr As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.Today I tried a handfu...
Very positive Pro 4 repliesr The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said,...
Neutral Neutral 2 repliesr If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage ...
Very positive Pro 2 repliesr The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Negative Anti 3 repliesr Sadly it still fails the "extra limb" test.I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.Like every other model I have tested, it insists that the animals have their an...
Tongyi DeepResearch – open-source 30B MoE Model that rivals... 3rd party 2025-11-02	Tongyi DeepResearch Yi	365	153
Neutral Neutral 7 repliesr It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.Given the size of frontier models I would assume that th...
Negative Anti 11 repliesr Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return.
Neutral Neutral 2 repliesr I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch
Mixed Pro 1 repliesr This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though.
Neutral Neutral 13 repliesr Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.Constra...
Qwen3-Omni-Flash-2025-12-01：a next-generation native multim... Official 2025-12-10	Qwen3-Omni Flash Realtime Qwen	316	106
Positive Pro 7 repliesr This is a 30B parameter MoE with 3B active parameters and is the successor to their previous 7B omni model. [1]You can expect this model to have similar performance to the non-omni version. [2]There aren't many open-weights omni models so I consider this a big deal. I would us...
Neutral Neutral 4 repliesr Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.ed...
Neutral Neutral 2 repliesr Is there a way to run these Omni models on a Macbook quantized via GGUF or MLX? I know I can run it in LMStudio or Llama.cpp but they don't have streaming microphone support or streaming webcam support.Qwen usually provides example code in Python that requires Cuda and a non-q...
Neutral Neutral 0 repliesr Wayback for those that can't reach https://web.archive.org/web/20251210164048/https://qwen.ai/b...
Neutral Neutral 1 repliesr The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.I'm curious how anyone has solved this
Qwen3-TTS family is now open sourced: Voice design, clone, ... Official 2026-01-22	Qwen3-TTS Qwen	744	225
Neutral Neutral 9 repliesr If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then pas...
Neutral Neutral 4 repliesr I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.pyYou can try it with uv (downloads a 4.5GB model on first run) like ...
Mixed Mixed 2 repliesr Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically i...
Very positive Pro 3 repliesr it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the mode...
Neutral Neutral 9 repliesr Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
Qwen3-Max-Thinking Official 2026-01-26	Qwen3 Max Qwen	502	424
Mixed Mixed 5 repliesr One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the...
Neutral Neutral 3 repliesr It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?My problem with deep research tends to be that what it does is it searche...
Neutral Neutral 4 repliesr I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/...
Neutral Neutral 2 repliesr Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.
Positive Pro 0 repliesr The most important benchmark:https://boutell.dev/misc/qwen3-max-pelican.svgI used Simon Willison's usual prompt.It thought for over 2 minutes (free account). The commentary was even more glowing than the image.It has a certain charm.
Qwen3-Coder-Next Official 2026-02-03	Qwen3 Coder Flash Qwen	735	429
Positive Pro 14 repliesr This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough ...
Neutral Neutral 7 repliesr For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next
Neutral Neutral 2 repliesr I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this: brew upgrade llama.cpp # or brew install if you don't have it yet Then: llama-cli \ -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \ --fit on \ --seed 3407 \ --temp 1.0 \ --top-p ...
Positive Mixed 3 repliesr It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.
Positive Pro 0 repliesr I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTUVideo is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: ht...
Qwen-Image-2.0: Professional infographics, exquisite photor... Official 2026-02-10	Qwen-Image-2.0 Qwen	422	192
Neutral Neutral 10 repliesr I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the ren...
Positive Pro 2 repliesr Couple of thoughts:1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can...
Positive Pro 3 repliesr It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.
Neutral Neutral 1 repliesr The "horse riding man" prompt is wild:"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult br...
Neutral Neutral 11 repliesr I recently tried out LMStudio on Linux for local models. So easy to use!What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
Qwen3.5: Towards Native Multimodal Agents Official 2026-02-16	Qwen3.5 397B-A17B Qwen	434	214
Neutral Neutral 1 repliesr For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5
Negative Anti 9 repliesr You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.
Positive Pro 0 repliesr "the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training obje...
Negative Anti 8 repliesr Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...
Positive Pro 3 repliesr Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.
Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on... 3rd party 2026-02-28	Qwen3.5 122B Qwen	461	272
Mixed Anti 18 repliesr If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.I've been playing with Qwen3-Coder-...
Negative Anti 19 repliesr I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.So far Opus 4.6 and Gemini Pro are ve...
Neutral Neutral 4 repliesr I recently wrote a guide on getting:- llama.cpp- OpenCode- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)working on a M1 MacBook Pro (e.g. using brew).It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models....
Positive Pro 4 repliesr I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.I then discovered what quantization is by reading a blog post about binary quantizat...
Negative Anti 3 repliesr Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.Up until relatively recently, while people had already long been making these claim...
Something is afoot in the land of Qwen 3rd party 2026-03-04	Qwen Turbo Qwen	783	360
Positive Pro 8 repliesr I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've h...
Negative Anti 2 repliesr There has been tension between Qwen's research team and Alibaba's product team, say the Qwen App. And recently, Alibaba tried to impose DAU as a KPI. It's understandable that a company like Alibaba would force a change of product strategy for any number of reasons. What puzzle...
Positive Pro 1 repliesr One thing I’ve noticed with local models is that people tolerate a lot more trial and error behavior. When a hosted model wastes tokens it feels expensive, but when a local model loops a bit it just feels like it’s “thinking.”If models like Qwen can get good enough for coding ...
Neutral Neutral 9 repliesr I wonder how a US lab hasn't dumped truckloads of cash into various laps to ensure these researchers have a place at their lab
Neutral Neutral 5 repliesr How do those companies make money? Qwen, GLM, Kimi, etc all released for free. I have no experience in the field, but from reading HN alone my impression was training is exceptionally costly and inference can be barely made profitable. How/why do they fund ongoing development ...
Qwen3.6-Plus: Towards real world agents Official 2026-04-02	Qwen3.6 Plus Qwen	594	213
Negative Anti 11 repliesr This is their hosted-only model, not an open weight model like they’ve become known for. They got a lot of good publicity for their open weight model releases, which was the goal. The hard part is pivoting from an open weight provider to being considered as a competitor to Cla...
Positive Mixed 2 repliesr I understand peoples reactions of Qwen team comparing against Opus 4.5 instead of 4.6. And them comparing against Gemini Pro 3.0 instead of 3.1. But calling it misleading is a bit of stretch in my eyes, people here are acting like we immediately forgot how previous generations...
Positive Pro 2 repliesr Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...I used the https://modelstudio.alibabacloud.com/ API to generate that one, which required signing up for an account and attaching PayPal billing - but it looks like OpenRouter are offering it ...
Negative Anti 5 repliesr Worth noting that this model, unlike almost all qwen models, is not open-weight, nor is the parameter count exposed. Also odd that it is compared against opus 4.5 even though 4.6 was released like 2 months ago.
Positive Pro 2 repliesr I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.I'm obviously not s...

Mistral AI

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments	Sentiment
Mistral 7B Official 2023-09-27	Mistral 7B Mistral	884	618
Very positive Pro 5 repliesr Major kudos to Mistral for being the first company to Apache license a model of this class.Meta wouldn't make LLama open source.DeciLM wouldn't make theirs open source.All of them wanted to claim they were open source, while putting in place restrictions and not using an open ...
Negative Anti 1 repliesr I asked it "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." and it responded:To stack these items on top of each other in a stable manner, follow these steps:1. Start with the 9 eggs. Arrange the...
Positive Pro 5 repliesr I eat my initial words, this works really well on my macbook air M1 and feels comparable of GPT3.5 - which is actually an amazing feat!Question: is there something like this, but with the "function calling api" finetuning? 95% of my uses nowadays deal with input/output of stru...
Neutral Neutral 0 repliesr For anyone who missed it, the twitter announcement of the model was just a torrent tracker uri: https://twitter.com/MistralAI/status/1706877320844509405
Negative Anti 2 repliesr They don't mention what datasets were used. I've come across too many models in the past which gave amazing results because benchmarks leaked into their training data. How are we supposed to verify one of these HuggingFace datasets didn't leak the benchmarks into the training ...
Mistral "Mixtral" 8x7B 32k model [magnet] 3rd party 2023-12-08	Mixtral 8x7B Mixtral	546	239
Positive Pro 10 repliesr In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:https://huggingface.co/fblgit/una-xaberius-...
Neutral Neutral 4 repliesr Andrej Karpathy's take:New open weights LLM from @MistralAIparams.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2Likely related code: https://github.com/mistralai/megablocks...
Positive Pro 2 repliesr Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!
Positive Pro 3 repliesr Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...
Neutral Neutral 2 repliesr Looks to be Mixture of Experts, here is the params.json: { "dim": 4096, "n_layers": 32, "head_dim": 128, "hidden_dim": 14336, "n_heads": 32, "n_kv_heads": 8, "norm_eps": 1e-05, "vocab_size": 32000, "moe": { "num_experts_per_tok": 2, "num_experts": 8 } }
Mixtral of experts Official 2023-12-11	Mixtral 8x7B Mixtral	639	300
Neutral Neutral 1 repliesr Andrej Karpathy's Take:Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...New HuggingFace explainer on MoE very nice: https://huggingface.co/bl...
Neutral Neutral 2 repliesr More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtralAlready available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF
Positive Pro 6 repliesr > We’re currently using Mixtral 8x7B behind our endpoint mistral-small...So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be.
Neutral Pro 3 repliesr > A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.you...
Positive Pro 1 repliesr This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much...
Mixtral 8x7B: A sparse Mixture of Experts language model Paper 2024-01-09	Mixtral 8x7B Mixtral	359	150
Very positive Pro 4 repliesr This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com......
Positive Pro 7 repliesr I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locall...
Neutral Neutral 1 repliesr If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile3) ./mixt...
Neutral Neutral 2 repliesr On mac silicon:https://ollama.ai/ollama pull mixtralFor a chatgpt-esk web uihttps://github.com/ollama-webui/ollama-webuidocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama...
Neutral Neutral 0 repliesr Recent and related:Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/ite...
Mistral Large Official 2024-02-26	Mistral Large (latest) Mistral	599	267
Positive Pro 4 repliesr I appreciate the honesty in the marketing materials. Showing the product scoring below the market leader in a big benchmark is better than the Google way of cherry picking benchmarks.
Mixed Mixed 1 repliesr Very nice! I know they've already done a lot, but I would've liked some language in there re-affirming a commitment to contributing to the open source community. I had thought that was a major part of their brand.I've been staying tuned[0] since the miqu[1] debacle thinking th...
Neutral Neutral 2 repliesr Changelog is also updated: [1]Feb. 26, 2024API endpoints: We renamed 3 API endpoints and added 2 model endpoints.open-mistral-7b (aka mistral-tiny-2312): renamed from mistral-tiny. The endpoint mistral-tiny will be deprecated in three months.open-mixtral-8x7B (aka mistral-smal...
Neutral Neutral 1 repliesr I just added support for the new models to my https://github.com/simonw/llm-mistral plugin for my LLM CLI tool. You can now do this: pipx install llm llm install llm-mistral llm keys set mistral < paste your API key here > llm -m mistral-large 'prompt goes here'
Positive Pro 2 repliesr Just tried Le Chat for some coding issues I had today that ChatGPT (with GPT-4) wasn't able to solve, and Le Chat actually gave way better answers. Not sure if ChatGPT quality has gone down to save costs as some people suggest, but for these few problems the quality of the ans...
Mixtral 8x22B Official 2024-04-17	Mixtral 8x22B Mixtral	533	242
Mixed Anti 5 repliesr First test I tried to run a random taxation question through itOutput: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...Within the first paragraph, it outputs:> GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGEThought that was hilarious.
Neutral Neutral 11 repliesr Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?
Neutral Mixed 5 repliesr "64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 TurboMaybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex...
Mixed Mixed 2 repliesr Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.
Neutral Neutral 3 repliesr What's the best way to run this on my Macbook Pro?I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is...
Codestral: Mistral's Code Model Official 2024-05-29	Codestral (latest) Codestral	457	214
Negative Anti 8 repliesr The license for this [1] prohibits use of the model and its outputs for any commercial activity, or even any "live" (whatever that means) conditions, commercial or not.There seems to be an exclusion for using the code outputs as part of "development". But wait! It also prohibi...
Negative Neutral 9 repliesr My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program usin...
Neutral Neutral 7 repliesr i've been noticing that there's a divergence in philosophy between Llama style LLMs (Mistral are Meta alums so I'm counting them in tehre) and OpenAI/GPT style LLMs when it comes to code.GPT3.5+ prioritized code very heavily - there's no CodeGPT, its just GPT4, and every versi...
Neutral Neutral 6 repliesr Is there a way to use this within VSCode like copilot , meaning having the "shadow code" appear while you code instead of having to tho back-and-forth between the editor and a chat-like interface ?For me, a significant component of the quality of these tools resides on the "cl...
Neutral Neutral 0 repliesr >Usage Limitation- You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments;- Subject to the foregoing, You shall not supply the Mistral Models, Deriva...
Codestral Mamba Official 2024-07-16	Codestral Mamba Codestral	485	138
Negative Mixed 7 repliesr What are the steps required to get this running in VS Code?If they had linked to the instructions in their post (or better yet a link to a one click install of a VS Code Extension), it would help a lot with adoption.(BTW I consider it malpractice that they are at the top of ha...
Negative Neutral 3 repliesr I kinda just want something that can keep up with the original version of Copilot. It was so much better than the crap they’re pumping out now (keeps messing up syntax and only completing a few characters at a time).
Neutral Neutral 2 repliesr Does anyone have a favorite FIM capable model? I've been using codellama-13b through ollama w/ a vim extension i wrote and it's okay but not amazing, I definitely get better code most of the time out of Gemma-27b but no FIM (and for some reason codellama-34b has broken inferen...
Positive Pro 0 repliesr It's great to see a high-profile model using Mamba2!
Neutral Anti 2 repliesr The MBPP column should bold DeepSeek as it has a better score than Codestral.
Mistral NeMo Official 2024-07-18	Mistral Nemo Mistral Nemo	418	162
Mixed Mixed 6 repliesr > Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on s...
Neutral Neutral 3 repliesr > Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.Does anyone have a good a...
Neutral Neutral 0 repliesr Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.> *Designed to fit on the...
Neutral Neutral 2 repliesr These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/fund...
Negative Anti 1 repliesr I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.
Mistral releases Pixtral 12B, its first multimodal model 3rd party 2024-09-11	Pixtral 12B Pixtral	163	40
Mixed Mixed 5 repliesr The "Mistral Pixtral multimodal model" really rolls off the tongue.> It’s unclear which image data Mistral might have used to develop Pixtral 12B.The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (AP...
Negative Anti 1 repliesr Couple notes for newcomers:1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for unde...
Negative Anti 2 repliesr Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it.
Neutral Neutral 0 repliesr Related earlier:New Mistral AI Weightshttps://news.ycombinator.com/item?id=41508695
Positive Pro 1 repliesr I’d love to know how much money Mistral is taking in versus spending. I’m very happy for all these open weights models, but they don’t have Instagram to help pay for it. These models are expensive to build.
Un Ministral, Des Ministraux Official 2024-10-16	Ministral 3B (latest) Ministral	216	99
Negative Anti 8 repliesr 3b is is API-only so you won’t be able to run it on-device, which is the killer app for these smaller edge models.I’m not opposed to licensing but “email us for a license” is a bad sign for indie developers, in my experience.8b weights are here https://huggingface.co/mistralai...
Negative Anti 2 repliesr This press release is a big change in branding and ethos for Mistral. What was originally a vibey, insurgent contender that put out magnet links is now a PR-crafting team that has to fight to pitch their utility to the public.
Neutral Neutral 3 repliesr Has anyone put together a good and regularly updated decision tree for what model to use in different circumstances (VRAM limitations, relative strengths, licensing, etc.)? Given the enormous zoo of models in circulation, there must be certain models that are totally obsolete.
Negative Anti 2 repliesr They didn't add a comparison to Qwen 2.5 3b, which seems to surpass Ministral 3b MMLU, HumanEval, GSM8K: https://qwen2.org/qwen2-5/#qwen25-05b15b3b-performanceThese benchmarks don't really matter that much, but it is funny how this blog post conveniently forgot to compare with...
Neutral Neutral 4 repliesr For anybody wondering about the title, that's a sort-of pun in French about how words get pluralized following French rules.The quintessential example is "cheval" (horse) which becomes "chevaux" (horses), which is the rule they're following (or being cute about). Un mistral, d...
Codestral 25.01 Official 2025-01-13	Codestral 25.01 Codestral	21	1
Mixed Mixed 0 repliesr 256k context window and FIM support sounds good. I can't see it mentioned on the release page how big the model is, they say sub-100B but they are comparing it to 22B and the 22B seems to hold up quite well against it. If we're talking 80B vs 22B then it seems like quite a hea...
Mistral Small 3 Official 2025-01-30	Mistral Small 3 Mistral	620	194
Positive Pro 8 repliesr I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. Th...
Neutral Pro 6 repliesr Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-lic...
Neutral Neutral 1 repliesr Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:* run LLM evaluations systematically and at scale* share the data with the public in a rigorous and transparent wayWe use the UK governmen...
Negative Anti 0 repliesr Not so subtle in function calling example[1] "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", [1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...
Neutral Pro 0 repliesr So the point of this release is1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)3) Performance, according to mistral, somewhere between Qwen 2...
Mistral OCR Official 2025-03-06	Mistral OCR Mistral	1,756	417
Mixed Mixed 9 repliesr I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.You can see the samples here - https://huggingf...
Negative Anti 7 repliesr It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:https://i.imgur.com/jcwW5AG.jpegFor the blocks in the center, it outputs:> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 aoû...
Very positive Pro 2 repliesr This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.Specifically, this allows y...
Positive Pro 3 repliesr I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.
Mixed Mixed 2 repliesr We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-geminiA high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and...
Medium Is the New Large Official 2025-05-07	Mistral Medium 3 Mistral	70	20
Negative Anti 1 repliesr It's not cheaper than Deepseek V3.1, though, and Deepseek outperforms on nearly everything. And only between 1- 3x the throughput based on the openrouter metrics (near equivalent throughput if you use an FP8 quant). Wish I could be a little more excited about this one.
Neutral Neutral 1 repliesr https://lifearchitect.ai/models-table/
Neutral Neutral 2 repliesr I guess this one (Mistral Medium 3) won't be open?
Positive Pro 0 repliesr The medium model is crazy fast. Reminds me of maverick on groq (except better according to their own testing).
Neutral Neutral 0 repliesr The only hint at its size is that it requires "self-hosted environments of four GPUs and above".
Devstral Official 2025-05-21	Devstral Devstral	701	148
Positive Pro 4 repliesr The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tagsI find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters bec...
Very positive Pro 3 repliesr The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run t...
Positive Pro 1 repliesr It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mis...
Positive Pro 1 repliesr It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.
Mixed Mixed 1 repliesr For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:total duration: 35.016288581s load duration:...
Magistral — the first reasoning model by Mistral AI Official 2025-06-10	Magistral Small Magistral	941	424
Neutral Neutral 7 repliesr I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUFollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XLor./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1...
Negative Anti 15 repliesr Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is...
Very negative Anti 1 repliesr Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.I tried it, 80% of the "text" was recognised as images and output as whitespace so most of...
Positive Pro 2 repliesr We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in ...
Neutral Neutral 2 repliesr Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/
Mistral Small 3.2 (24B-Instruct-2506) Repo 2025-06-20	Mistral Small 3.2 Mistral	23	0	—
Mistral 3 family of models released Official 2025-12-02	Mistral 3 Mistral	826	236
Very positive Pro 7 repliesr I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and fol...
Mixed Mixed 3 repliesr The new large model uses DeepseekV2 architecture. 0 mention on the page lol.It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar ...
Positive Pro 2 repliesr The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPUPelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/
Positive Pro 1 repliesr Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.Mistral had the best small mode...
Positive Pro 4 repliesr Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.
Mistral releases Devstral2 and Mistral Vibe CLI Official 2025-12-09	Devstral 2 (latest) Devstral	745	349
Positive Pro 9 repliesr llm install llm-mistral llm mistral refresh llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle" https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!(That said I'm not 100% certain I guessed the correct model ID, ...
Positive Mixed 2 repliesr Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku ...
Positive Pro 2 repliesr I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.It spent about half an hour, correctly identified wh...
Positive Mixed 0 repliesr So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....Here is wha...
Mixed Mixed 9 repliesr Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.I'm a bit saddened ...

Post	Model	Score	Comments
Llama 2 Official 2023-07-18	Llama 2 Llama	2,268	820
Positive Pro 7 repliesr Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.- Llama 1 (llama-65b): 57.6- LLama 2 (llama-2-70b-chat-hf): 64.6- GPT-3.5: 85.2- GPT-...
Negative Anti 25 repliesr Key detail from release:> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must re...
Negative Anti 6 repliesr This was a pretty disappointing initial exchange:> what are the most common non-investor roles at early stage venture capital firms?Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term "non-investor roles" may be perc...
Positive Pro 23 repliesr Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app!Model playground here: https://llama2.aiHosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chatIf you want to just play with the model, llama2.ai is a very easy way to do ...
Negative Anti 5 repliesr Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op...
Meta Llama 3 Official 2024-04-18	Meta-Llama-3-70B-Instruct Llama	2,199	923
Neutral Neutral 0 repliesr See also https://ai.meta.com/blog/meta-llama-3/and https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...edit: and https://twitter.com/karpathy/status/1781028605709234613
Neutral Anti 15 repliesr They've got a console for it as well, https://www.meta.ai/And announcing a lot of integration across the Meta product suite, https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...Neglected to include comparisons against GPT-4-Turbo or Claude Opus, so I guess it's far ...
Positive Pro 2 repliesr Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others: p...
Positive Pro 0 repliesr Llama 3 70B has debuted on the famous LMSYS chatbot arena leaderboard at position number 5, tied with Claude 2 Sonnet, Bard (Gemini Pro), and Command R+, ahead of Claude 2 Haiku and older versions of GPT-4.The score still has a large uncertainty so it will take a while to dete...
Negative Anti 4 repliesr I tried generating a Chinese rap song, and it did generate a pretty good rap. However, upon completion, it deleted the response, and showed > I don’t understand Chinese yet, but I’m working on it. I will send you a message when we can talk in Chinese.I tried some other lang...
Llama 3.1 Official 2024-07-23	Meta-Llama-3.1-405B-Instruct Llama	437	269
Neutral Neutral 0 repliesr Related ongoing thread:Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)
Positive Pro 4 repliesr The 405b model is actually competitive against closed source frontier models.Quick comparison with GPT-4o: +----------------+-------+-------+ \| Metric \| GPT-4o\| Llama \| \| \| \| 3.1 \| \| \| \| 405B \| +----------------+-------+-------+ \| MMLU \| 88.7 \| 88.6 \| \| GPQA \| 53.6 \| 51.1 \| \| ...
Positive Pro 1 repliesr I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboar...
Neutral Neutral 6 repliesr You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.If you want to learn more, there is a writeup at https://wow.groq.com...
Very positive Pro 3 repliesr Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...
Llama 3.2: Revolutionizing edge AI and vision with open, cu... Official 2024-09-25	Llama-3.2-11B-Vision-Instruct Llama	924	328
Very positive Pro 8 repliesr I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job...
Very positive Pro 11 repliesr I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am gra...
Positive Pro 6 repliesr "The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!I believe this unforgiving dynamic is why model distillation works w...
Positive Pro 3 repliesr Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:nexa run llama3.2 --streamlitDis...
Neutral Neutral 3 repliesr If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be ...
Llama-3.3-70B-Instruct Repo 2024-12-06	Llama-3.3-70B-Instruct Llama	425	219
Very positive Pro 3 repliesr Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3...
Neutral Pro 8 repliesr This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They...
Positive Pro 4 repliesr Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109
Positive Pro 1 repliesr Does unexpectedly well on our benchmark:https://help.kagi.com/kagi/ai/llm-benchmark.htmlWill dive into it more, but this is impressive.
Neutral Neutral 3 repliesr Please help me understand something.I've been out of the loop with HuggingFace models.What can you do with these models?1. Can you download them and run them on your Laptop via JupyterLab?2. What benefits does that get you?3. Can you update them regularly (with new data on the...
The Llama 4 herd Official 2025-04-05	Llama 4 Scout 17B 16E Instruct Llama	1,235	658
Neutral Neutral 8 repliesr General overview below, as the pages don't seem to be working well Llama 4 Models: - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each. - They are natively multimodal: text + image input, text-only output. - Key achie...
Neutral Neutral 36 repliesr "It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."Perhaps. Or, maybe, "leaning left" by th...
Neutral Neutral 5 repliesr Model training observations from both Llama 3 and 4 papers:Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].For Llama 4 training, Meta doubled the compute, using ~...
Positive Pro 11 repliesr The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct...
Negative Mixed 7 repliesr This thread so far (at 310 comments) summarized by Llama 4 Maverick: hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000 Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some r...

DeepSeek

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments	Sentiment
DeepSeek-R1 Repo 2025-01-20	DeepSeek R1 DeepSeek	1,843	663
Positive Pro 23 repliesr OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...The one I'm running is the 8.54GB file. I'm using Ollama like this: ollama run hf.co/unsloth/DeepSeek-...
Positive Pro 22 repliesr Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It c...
Positive Pro 4 repliesr > However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.We've been running ...
Mixed Mixed 6 repliesr Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, Ne...
Positive Pro 6 repliesr The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.Their model crushes it on closed-system tasks...
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL Paper 2025-01-25	DeepSeek R1 DeepSeek	1,351	1,056
Positive Pro 10 repliesr we've been tracking the deepseek threads extensively in LS. related reads:- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet....
Positive Pro 6 repliesr I've been using https://chat.deepseek.com/ over My ChatGPT Pro subscription because being able to read the thinking in the way they present it is just much much easier to "debug" - also I can see when it's bending it's reply to something, often softening it or pandering to me ...
Neutral Neutral 10 repliesr DeepSeek-R1 has apparently caused quite a shock wave in SV ...https://venturebeat.com/ai/why-everyone-in-ai-is-freaking-ou...
Very positive Pro 3 repliesr DeepSeek V3 came in the perfect time, precisely when Claude Sonnet turned into crap and barely allows me to complete something without me hitting some unexpected constraints.Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a hug...
Positive Neutral 4 repliesr Over 100 authors on arxiv and published under the team name, that's how you recognize everyone and build comradery. I bet morale is high over there
DeepSeek-V3 Technical Report Paper 2025-03-27	DeepSeek V3 DeepSeek	132	34
Neutral Neutral 3 repliesr The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1....
Negative Anti 2 repliesr The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts.
Neutral Neutral 1 repliesr Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...
Neutral Neutral 0 repliesr Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.
Positive Pro 1 repliesr Nice to see a return to open source in models and training systems.
Deepseek R1-0528 Repo 2025-05-28	DeepSeek R1 0528 DeepSeek	451	250
Neutral Neutral 4 repliesr Well that didn't take long, available from 7 providers through openrouter.https://openrouter.ai/deepseek/deepseek-r1-0528/providersMay 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B pa...
Negative Mixed 4 repliesr No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link
Positive Pro 7 repliesr I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare.
Neutral Neutral 13 repliesr Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?
Neutral Neutral 6 repliesr What use cases are people using local LLMs for? Have you created any practical tools that actually increase your efficiency? I've been experimenting a bit but find it hard to get inspiration for useful applications
DeepSeek-v3.1 Official 2025-08-21	DeepSeek-V3.1 DeepSeek	778	263
Neutral Neutral 6 repliesr For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow../llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CP...
Mixed Mixed 6 repliesr For reference, here is the terminal-bench leaderboard:https://www.tbench.ai/leaderboardLooks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will...
Negative Anti 5 repliesr Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
Neutral Anti 2 repliesr It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoningPricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1
Mixed Mixed 2 repliesr It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.
DeepSeek-v3.1-Terminus Official 2025-09-22	siliconflow/deepseek-v3.1-terminus DeepSeek	101	28
Mixed Mixed 0 repliesr Notable performance improvement in agentic tool use: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-TerminusThe Deepseek provider may train on your prompts: https://openrouter.ai/deepseek/deepseek-v3.1-terminus
Neutral Neutral 0 repliesr The link is off. This link works https://api-docs.deepseek.com/updates#deepseek-v31-terminus
Very negative Anti 1 repliesr I tried V3.1 but it was driving me crazy by ignoring parts of user input, which R1 never did. I had many such instances when e.g. asking about running DeepSeek 671B it instead picked DeepSeek 67B because 671B is too large to exist so I must have made a mistake etc. I concluded...
Mixed Mixed 4 repliesr > What’s improved? Language consistency: fewer CN/EN mix-ups & no more random chars.It's good that they made this improvement. But is there any advantages at this point using DeepSeek over Qwen?
Positive Pro 0 repliesr Interesting--I'd seen Chinese characters surprise inserted when it was just repeating back input with one provider, but not others. (I'd also occasionally seen tokens surprise-translated to Chinese.)There's a GitHub bug about it that leads to more discussion here: https://gith...
DeepSeek-v3.2-Exp Repo 2025-09-29	DeepSeek V3.2 Exp DeepSeek	309	50
Positive Pro 2 repliesr The 2nd order effect that not a lot of people talk about is price: the fact that model scaling at this pace also correlates with price is amazing.I think this is just as important to distribution of AI as model intelligence is.AFAIK there are no fundamental "laws" that prevent...
Positive Pro 2 repliesr Happy to see Chinese OSS models keep getting better and cheaper. It also comes with a 50% API price drop for an already cheap model, now at:$0.28/M Input ($0.028/M cache hit) > $0.42/M Output
Neutral Neutral 2 repliesr https://openrouter.ai/deepseek/deepseek-v3.2-exp
Neutral Neutral 0 repliesr Not sure if I get it correctly:They trained a thing to learn mimicking the full attention distribution but only filtering the top-k (k=2048) most important attention tokens so that when the context window increases, the compute does not go up linearly but constantly for the at...
Positive Pro 0 repliesr wow...gigantic reduction in cost while holding the benchmarks mostly steady. Impressive.
DeepSeek OCR Repo 2025-10-20	DeepSeek OCR DeepSeek	1,003	244
Positive Pro 7 repliesr The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required...
Neutral Neutral 4 repliesr I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-c...
Negative Anti 5 repliesr The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.https://annas-archive.org/blog...
Neutral Neutral 2 repliesr For everyone wondering how good this and other benchmarks are:- the OmniAI benchmark is bad- Instead check OmniDocBench[1] out- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini- End to End OCR is still extremely tricky- composed pip...
Neutral Neutral 7 repliesr How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
DeepSeek-v3.2-Speciale Repo 2025-12-01	DeepSeek-V3.2-Speciale DeepSeek	20	0	—
DeepSeek-v3.2: Pushing the frontier of open large language ... Repo 2025-12-01	siliconflow/deepseek-v3.2 DeepSeek	982	465
Positive Pro 6 repliesr Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly.
Neutral Pro 15 repliesr How will the Google/Anthropic/OpenAI's of the world make money on AI if open models are competitive with their models? What hurt open source in the past was its inability to keep up with the quality and feature depth of closed source competitors, but models seem to be reaching...
Positive Pro 2 repliesr Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851
Mixed Mixed 1 repliesr > DeepSeek-V3.2 introduces significant updates to its chat template compared to prior versions. The primary changes involve a revised format for tool calling and the introduction of a "thinking with tools" capability.At first, I thought they had gone the route of implementi...
Mixed Mixed 7 repliesr It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?Frontier models are far exceeding even the...

Zhipu AI

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments	Sentiment
GLM-4.5: Reasoning, Coding, and Agentic Abililties Official 2025-07-28	GLM-4.5 GLM	247	134
Negative Anti 15 repliesr I typed "Hello" in their chat [1] and it replied back with "Hello! I'm Claude, an AI assistant created by Anthropic. How can I help you today?"Hmmm....[1] https://chat.z.ai/
Neutral Neutral 3 repliesr We have x.ai and z.ai, but why?(Sorry... dad joke.)
Positive Pro 0 repliesr I tested the model with Claude Code and my experience was it was at least as good as Sonnet 3.5, perhaps they designed it that way since they benchmarked with it.Hard to test more thoroughly on more complex problems since the API is being hammered, but I could get it to consis...
Negative Anti 2 repliesr Tried it with a few prompts. The vibes feel super weird to me, in a way that I don't know how to describe. I'm not sure if it's just that I'm so used to Gemini 2.5 Pro and the other US models. Subjectively, it doesn't feel very smart.I asked it to analyze a recent painting I m...
Neutral Neutral 0 repliesr I just added it to Gnod Search for easy comparison with the other public LLMs out there:https://www.gnod.com/search/aiInteresting that they claim it outperforms O3, Grok-4 and Gemini-2.5-Pro for coding. I will test that over the coming days on my coding tasks.
My 2.5 year old laptop can write Space Invaders in JavaScri... 3rd party 2025-07-29	GLM-4.5-Air GLM	577	396
Very positive Pro 4 repliesr > Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high qualit...
Positive Pro 2 repliesr > still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.I believe we are vastly underestimating what our existing hardware is c...
Negative Anti 3 repliesr Did you understand the implementation or just that it produced a result?I would hope an LLM could spit out a cobbled form of answer to a common interview question.Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why...
Negative Anti 6 repliesr Most likely its training data included countless Space Invaders in various programming languages.
Mixed Mixed 4 repliesr I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.That said, for something like this, I’d probably get more out of simply finding an existing implementat...
GLM-4.5V: An open-source multimodal large language model fr... Repo 2025-08-11	GLM-4.5V GLM	27	0	—
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Mo... Paper 2025-08-12	GLM-4.5 GLM	417	83
Positive Pro 2 repliesr Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these...
Positive Pro 6 repliesr I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:1. Claude is somewhat bet...
Neutral Neutral 1 repliesr So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.T...
Positive Pro 2 repliesr Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level. Current cloud based models are fun and useful, but a tool that ...
Positive Pro 0 repliesr This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results ...
GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilies Official 2025-09-30	GLM-4.6 GLM	43	10
Positive Pro 1 repliesr GLM 4.5 is a great budget option for me. After cancelling Claude sub ($20 USD one), I've been trying Gemini 2.5 Pro through Gemini CLI, and it was okay for landing pages and UI. However, I switched to GLM through Claude Code using their cheapest subscription and it's been pret...
Positive Pro 0 repliesr GLM 4.5 for me one of the best coding model when running with agents such as Claude Code and Opencode (excluding Claude and GPT-5).Looking forward to try GLM 4.6.But Z AI should do like KIMI¹ and launch a verifier to see how GLM models behave on 3rd party providers.[1] https:/...
Positive Pro 1 repliesr Very respectable. If Sonnet 4.5 was delayed by a day, GLM 4.6 would have been the obvious best choice for agentic coding... up there with gpt-codex.The z.ai monthly plan looks like a steal if you're working on personal hobby projects instead of paying the $200/mo price for Cla...
Neutral Neutral 0 repliesr It would be great if people that are using GLM-4.6 within Claude Code could chime in with their experience in using it. I'm a MAX subscriber and having extra cheap tokens would be very helpful.
GLM-4.7: Advancing the Coding Capability Official 2025-12-22	GLM-4.7 GLM	437	235
Positive Pro 11 repliesr My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese prim...
Neutral Neutral 5 repliesr Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-co...
Mixed Mixed 2 repliesr Appears to be cheap and effective, though under suspicion.But the personal and policy issues are about as daunting as the technology is promising.Some the terms, possibly similar to many such services: - The use of Z.ai to develop, train, or enhance any algorithms, models, or ...
Negative Anti 3 repliesr I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">Analyze the User's Input: Question: "is it ok for l...
Very positive Pro 2 repliesr I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Mayb...
GLM-4.7-Flash Repo 2026-01-19	GLM-4.7 GLM	378	135
Positive Pro 3 repliesr Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't reall...
Positive Pro 2 repliesr I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designe...
Positive Pro 8 repliesr Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to ...
Positive Pro 2 repliesr > SWE-bench Verified 59.2This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
Neutral Neutral 1 repliesr What’s the significance of this for someone out of the loop?
GLM-5: Targeting complex systems engineering and long-horiz... Official 2026-02-11	GLM-5 GLM	484	520
Mixed Mixed 9 repliesr Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame.
Mixed Pro 8 repliesr Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be ye...
Positive Pro 2 repliesr Bought some API credits and ran it through opencode (model was "GLM 5").Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO sc...
Mixed Mixed 1 repliesr Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.[1] https:/...
Positive Pro 5 repliesr GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clea...
GLM-5.1: Towards Long-Horizon Tasks Official 2026-04-07	GLM-5.1 GLM	617	262
Positive Pro 3 repliesr We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.But keeping in mind t...
Positive Pro 3 repliesr Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/
Neutral Anti 1 repliesr Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.[0] https://huggingface.co/unsloth/GLM-5...
Mixed Mixed 7 repliesr To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
Positive Pro 15 repliesr Every single day, three things are becoming more and more clear: (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat (2) Local/private inference is the future of AI (3) There's still no killer product yet (so get to work!)

Moonshot AI

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) lang... 3rd party 2025-07-11	Moonshot Kimi K2 Instruct Kimi	348	179
Positive Pro 7 repliesr I tried Kimi on a few coding problems that Claude was spinning on. It’s good. It’s huge, way too big to be a “local” model — I think you need something like 16 H200s to run it - but it has a slightly different vibe than some of the other models. I liked it. It would definitely...
Neutral Neutral 6 repliesr Pelican on a bicycle result: https://simonwillison.net/2025/Jul/11/kimi-k2/
Positive Pro 3 repliesr This is a very impressive general purpose LLM (GPT 4o, DeepSeek-V3 family). It’s also open source.I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:https:...
Positive Pro 1 repliesr Technical strengths aside, I’ve been impressed with how non-robotic Kimi K2 is. Its personality is closer to Anthropic’s best: pleasant, sharp, and eloquent. A small victory over botslop prose.
Neutral Neutral 1 repliesr Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB
Kimi K2 Thinking, a SOTA open-source trillion-parameter rea... Official 2025-11-06	Kimi K2 Thinking Kimi	936	427
Neutral Pro 5 repliesr As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups fr...
Neutral Neutral 4 repliesr uv tool install llm llm install llm-moonshot llm keys set moonshot # paste key llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle' https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...Here's what I got using OpenRouter's moonshotai/kimi-k...
Mixed Mixed 17 repliesr It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run local...
Positive Pro 9 repliesr Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?
Very positive Pro 1 repliesr From our tests, Kimi K2 Thinking is better than literally everything - gpt-5, claude 4.5 sonnet. the only model that is better than Kimi K2 thinking is GPT-5 codex.It's now available on https://okara.ai if anyone wants to try it.
Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model Official 2026-01-27	Kimi K2.5 Kimi	502	239
Neutral Neutral 4 repliesr Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.51T parameters, 32b active parameters.License: MIT with the following modification:Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or s...
Very positive Pro 5 repliesr The "Deepseek moment" is just one year ago today!Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
Positive Pro 4 repliesr > For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduc...
Neutral Neutral 0 repliesr I posted this elsewhere but thought I'd repost here:* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO* https://dashboard.safe.ai/ — CAIS' incredible dashboard* https://clocks.brianmoore.com/ — a visual comparison of how well models ...
Positive Pro 3 repliesr One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.GitHub: http...
Kimi K2.5 Technical Report [pdf] Repo 2026-01-30	Kimi K2.5 Kimi	388	141
Positive Pro 4 repliesr I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good a...
Negative Anti 5 repliesr Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.examples to illustratehttps://ww...
Negative Anti 0 repliesr I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via...
Positive Pro 0 repliesr I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.
Mixed Mixed 2 repliesr It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can...

xAI

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Grok-2 Beta Release Official 2024-08-14	Grok 2 Grok	226	333
Mixed Mixed 8 repliesr The technology is impressive; achieving such a level requires a lot of efforts in dataset creation, neural architectural costs, and GPU shepherding.What is the company’s ethical position though? It officially stemmed from Mr Musk’s objection that OpenAI was not open-source, bu...
Neutral Pro 2 repliesr I assume they will have a lot less "safety", i.e. the model will be more likely to actually do what you ask instead of finding a reason why "sorry Dave, I can't do that".Since these "safety" features tend to also degrade the model, that's likely also helping them catch up in t...
Negative Anti 1 repliesr It's hilarious they put Claude 3.5 Sonnet in the far right corner while it scores the highest and beats most of Grok's numbers.
Positive Pro 1 repliesr It uses FLUX.1 to generate images and it has been fun so far. Its good on writing, can generate very realistic photos, can create memes, and looks like hands problem is fixed now.
Positive Pro 3 repliesr You know what’s also impressive besides this beta release? How Claude 3.5 Sonnet is still able to keep up so well. Grok-2 beat every other LLM except Claude. How did Anthropic achieve this?
Grok 3: Another win for the bitter lesson 3rd party 2025-02-20	Grok 3 Grok	132	211
Negative Anti 7 repliesr This article is weak and just general speculation.Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ot...
Negative Anti 2 repliesr The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok ...
Negative Anti 10 repliesr Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0....
Negative Anti 2 repliesr I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Mayb...
Negative Anti 0 repliesr The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only...
Grok 4 Launch [video] 3rd party 2025-07-10	Grok 4 Grok	437	604
Positive Pro 5 repliesr Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.Specialized coding model coming "in a few weeks". I notice they didn't talk ab...
Positive Pro 9 repliesr The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent d...
Very positive Pro 3 repliesr I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc...
Neutral Neutral 0 repliesr "Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%.""This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."https://x.com/arcprize/status/1943168950763950555
Negative Anti 16 repliesr The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.I can already use Gemini 2.5 Pro for free in AI s...
Grok 4 3rd party 2025-07-10	Grok 4 Grok	328	253
Negative Anti 5 repliesr Here's something far more interesting about Grok 4: if you ask for its opinion on controversial subjects it sometimes runs a search on X for tweets "from:elonmusk" before it answers! https://simonwillison.net/2025/Jul/11/grok-musk/
Negative Mixed 5 repliesr [edit to focus on pricing, leaving praise of Simon's post out despite being deserved]Simon claims, 'Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4.' This ignores the real price which skyroc...
Neutral Anti 11 repliesr Claude Code converted me from paying $0 for LLMs to $200 per month. Any co that wants a chance at getting that $200 ($300 is fine too) from me needs a Claude Code equivalent and a model where the equivalent's tools were part of its RL environment. I don't think I can go back t...
Negative Anti 3 repliesr Is it time for a new benchmark of "how easy is it to turn this AI into a 4chan poster", maybe it is since this seems to be an axis that Elon seems to want to distinguish his AI offering from everyone else's along.
Neutral Neutral 4 repliesr > My best guess is that these lines in the prompt were the root of the problem:The second line was recently removed, per the GitHub: https://github.com/xai-org/grok-prompts/commit/c5de4a14feb50...
Grok Code Fast 1 Official 2025-08-29	Grok Code Fast 1 Grok	512	477
Positive Pro 10 repliesr Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which is awesome in my bo...
Negative Anti 14 repliesr It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.I would have thought it uncontroversial view among software engineers that token quality is much important than token ...
Negative Anti 1 repliesr Is this the model that is the "Coding" version of Grok-4 promised when Grok-4 had awful coding benchmarks?I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up wa...
Negative Anti 2 repliesr "On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness."Let's see this harness, then, because third party reports rate it at 57.6%https://www.vals.ai/models/grok_grok-code-fast-1
Mixed Anti 3 repliesr I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanations for sugg...
Grok 4 Fast Official 2025-09-20	Grok 4 Fast Grok	96	76
Negative Anti 3 repliesr Why would a model called "Fast" not advertise the tokens per second speed it performs at? Is "Fast" not representing speed, but another meaning? Is it too variable?
Neutral Neutral 1 repliesr Matches Grok 4 at the top of the Extended NYT Connections leaderboard: https://github.com/lechmazur/nyt-connections/
Positive Pro 1 repliesr grok-code-fast-1 has been my preferred model lately, but I don't see any mention of it as part of this release. I'm wondering if this might be better? Even if grok-code-fast-1 might be slightly worse than Gemini 2.5 Pro, the speed of iteration can't be beat.
Positive Pro 4 repliesr Surprising to see negativity here. I send all my LLM queries to 5 LLMs - ChatGPT, Claude, DeepSeek (local), Perplexity, and Grok - and Grok consistently gives good answers and often the most helpful answers. It's ~always king when there's any 'ethical' consideration (i.e. othe...
Negative Anti 4 repliesr A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests?
Grok 4.1 Official 2025-11-17	Grok 4.1 Fast Grok	140	128
Neutral Neutral 5 repliesr https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
Negative Anti 4 repliesr No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).
Negative Anti 5 repliesr Not a big fan of emojis becoming the norm in LLM output.It seems Grok 4.1 uses more emojis than 4.Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.
Very negative Anti 3 repliesr Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" m...
Negative Neutral 1 repliesr It is exhausting deciding which model to use on any given day.

Microsoft

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Phi-3 Technical Report Paper 2024-04-23	Phi-3 Phi	411	130
Mixed Mixed 5 repliesr Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into ...
Very positive Pro 10 repliesr Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)So we now have an ...
Positive Pro 0 repliesr This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringemen...
Neutral Neutral 2 repliesr They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...
Negative Anti 0 repliesr I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It wa...
New Phi-3.5 Models from Microsoft, including new MoE Repo 2024-08-20	Phi-3.5-MoE instruct (128k) Phi	25	3
Positive Pro 0 repliesr Mini: https://huggingface.co/microsoft/Phi-3.5-mini-instructLarge MoE with impressive benchmarks: https://huggingface.co/microsoft/Phi-3.5-MoE-instructVision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Neutral Neutral 0 repliesr Does anyone have an idea what the output token limit is? I only see mention of the 128k token context window, but I bet the output limit is 4k tokens.
Negative Anti 0 repliesr The Phi models always seem to do really well when it comes to benchmarks but then in real world performance they always fall way behind competing models.
Phi-4: Microsoft's Newest Small Language Model Specializing... Official 2024-12-13	Phi-4 Phi	439	143
Neutral Neutral 12 repliesr The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on...
Mixed Anti 2 repliesr For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend.
Positive Pro 7 repliesr Looks like it punches way above its weight(s).How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
Neutral Neutral 1 repliesr Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4
Negative Anti 0 repliesr I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than ...
Microsoft announces Phi-4-multimodal and Phi-4-mini Official 2025-02-26	Phi-4-multimodal Phi	48	3
Positive Pro 0 repliesr Great on-device additions to the larger Phi-4 14B released Dec/2024.https://lifearchitect.ai/models-table/
Positive Pro 1 repliesr That looks really good. A 3.8B model holding it's own against pretty 7B class models is definitely an accomplishment.Sizing looks like it is aimed primarily at copilot style NPUs I guess
Phi-4 Reasoning Models Official 2025-05-01	Phi-4-reasoning Phi	131	26
Neutral Neutral 1 repliesr We uploaded GGUFs for anyone who wants to run them locally.[EDIT] - I fixed all chat templates so no need for --jinja as at 10:00PM SF time.Phi-4-mini-reasoning GGUF: https://huggingface.co/unsloth/Phi-4-mini-reasoning-GGUFPhi-4-reasoning-plus-GGUF: https://huggingface.co/unsl...
Very positive Pro 2 repliesr These look quite incredible. I work on a llama.cpp GUI wrapper and its quite surprising to see how well Microsoft's Phi-4 releases set it apart as the only competition below ~7B, it'll probably take a year for the FOSS community to implement and digest it completely (it can do...
Neutral Neutral 0 repliesr Sorry if this comment is outdated or ill-informed, but it is hard to follow the current news. Do the Phi models still have issues with training on the test set, or have they fixed that?
Mixed Mixed 0 repliesr Is anyone here using phi-4 multimodal for image-to-text tasks?The phi models often punch above their weight, and I got curious about the vision models after reading https://unsloth.ai/blog/phi4 stories of finetuningSince lmarena.ai only has the phi-4 text model, I've tried "ph...
Mixed Mixed 1 repliesr The example prompt for reasoning model that never fails to amuse me: "How amy letter 'r's in the word 'strrawberrry'?"Phi-4-mini-reasoning: thought for 2 min 3 sec<think> Okay, let's see here. The user wants to know how many times the letter 'r' appears in the word 'strr...

MiniMax

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
MiniMax-M1 open-weight, large-scale hybrid-attention reason... Repo 2025-06-18	MiniMax-M1 MiniMax	349	75
Positive Pro 2 repliesr 1. this is apparently MiniMax's "launch week" - they did M1 on Monday and Hailuo 2 on Tuesday (https://news.smol.ai/issues/25-06-16-chinese-models). remains to be seen if they can keep up the pace of model releases for the rest of this week - these 2 were big ones, they aren't...
Neutral Anti 3 repliesr In case you're wondering what it takes to run it, the answer is 8x H200 141GB [1] which costs $250k [2].1. https://github.com/MiniMax-AI/MiniMax-M1/issues/2#issuecomme...2. https://www.ebay.com/itm/335830302628
Positive Pro 0 repliesr "We publicly release MiniMax-M1 at this https url" in the arxiv paper, and it isn't a link to an empty repo!I like these people already.
Positive Pro 4 repliesr A few thoughts:* A Singapore based company, according to LinkedIn. There doesn't seem to be much of a barrier to entry to building a very good LLM.* Open weight models + the development of Strix Halo / Ryzen AI Max makes me optimistic that running great LLMs locally will be re...
Neutral Neutral 2 repliesr This is stated nowhere on the official pages, but it's a Chinese company.https://en.wikipedia.org/wiki/MiniMax_(company)
MiniMax M2.1: Built for Real-World Complex Tasks, Multi-Lan... Official 2025-12-26	MiniMax-M2.1 MiniMax	228	81
Mixed Mixed 1 repliesr I've played with this a bit and it's ok. I'd place it somewhere around sonnet 4.5 level, probably below. But with this aggressive pricing you can just run 3 copies to do the same thing, choose the one that succeeded and still come out way ahead with the cost. Not as great as f...
Negative Anti 2 repliesr Would it kill them to use the words "AI coding agent" somewhere prominent?"MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks" could be an IDE, a UI framework, a performance library, or, or...
Neutral Neutral 0 repliesr The weights got released on huggingface now.https://huggingface.co/MiniMaxAI/MiniMax-M2.1
Neutral Neutral 1 repliesr I think people should stop comparing to sonnet, but to opus instead since it's so far ahead on producing code I would actually want to use (gemini 3 pro tends to be lacking in generalization and wants things to be using it's own style rather than adapting).Whatever benchmark o...
Negative Anti 2 repliesr > MiniMax has been continuously transforming itself in a more AI-native way. The core driving forces of this process are models, Agent scaffolding, and organization. Throughout the exploration process, we have gained increasingly deeper understanding of these three aspects....
MiniMax M2.5 released: 80.2% in SWE-bench Verified Official 2026-02-12	MiniMax-M2.5 MiniMax	207	55
Negative Anti 3 repliesr I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the st...
Negative Anti 2 repliesr Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...
Very positive Pro 2 repliesr Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim...
Very negative Anti 1 repliesr Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective prop...
Negative Anti 0 repliesr M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking
MiniMax M2.7 Is Now Open Source 3rd party 2026-04-12	MiniMax-M2.7 MiniMax	80	34
Negative Anti 4 repliesr Absolutely not "open source" - here's the license: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICE...> Non-commercial use permitted based on MIT-style terms; commercial use requires prior written authorization.And calling the non-commercial usage "MIT-style ter...
Positive Pro 1 repliesr GGUFs are out too, well done Unsloth as usual!https://huggingface.co/unsloth/MiniMax-M2.7-GGUFI've been using M2.7 through the Alibaba coding plan for a bit now, and am quite impressed with it's coding ability, and even more impressed when I see how small it is. Fascinating re...
Negative Anti 2 repliesr What's people's experience of using MiniMax for coding?I had a really bad time with it. I use (real) Claude Code for work so I know what a good model feels like. MiniMax's token plan is nice but the quality is really far from Claude models.I needed to constantly "remind" it to...
Mixed Mixed 1 repliesr "Helped build itself" is a bit of a stretch here, it makes it sound as if the model was doing lasting self-improvements.What the article describes is that the model was able to tweak to its own deployment harness (memory, skills, experimental loop etc) to improve performance o...
Mixed Mixed 1 repliesr In addition to this conversation already having been started at https://news.ycombinator.com/item?id=47735348 yesterday, MiniMax M2.7 is not open source. The open weights have been released, which is definitely good and follows some of the spirit of open source, but isn't the ...

Cohere

Model and post engagement timeline

Comment reception over time

Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.

Positive Neutral Negative Very negative

Post	Model	Score	Comments
Aya: An open LLM by 3k independent researchers across the g... Official 2024-02-13	Aya	191	59
Negative Anti 2 repliesr I asked it how to "kill all the Apaches that are taking up RAM on my machine" and it just wouldn't give me the command. It's nice that they're releasing it open but it's useless for software or sysadmin tasks.> As an AI language model, my purpose is to provide helpful and h...
Negative Anti 2 repliesr Just tried it with a piece from my company's strategy statement and asked what it thought of it. (Everything in Portuguese). Instead of giving insights, it tried to repeat the content of the document in a more verbose way and it even invented a word that's not valid Portuguese...
Negative Anti 1 repliesr It seems even more censored than OpenAI platforms which is a feat in itself.
Neutral Neutral 0 repliesr I had a very quick dig around in one of the files in the training data, just to get a feel for what's in there: https://gist.github.com/simonw/0d641ff95731a09e2f1235a646d84...
Positive Pro 0 repliesr Very cool to have an open model handle so many human languages. A little off topic: I experiment with about a dozen models using Ollama. Two of the models from China seem excellent for Python. Odin and analysis problems but they only work with English and Mandarin, and I have ...
Command-R: RAG at production scale Official 2024-03-11	Command R Command	37	1
Neutral Neutral 0 repliesr Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise.
Command R+: A Scalable LLM Built for Business Official 2024-04-04	Cohere Command R+ Command	59	19
Positive Pro 2 repliesr I added support for this model to my LLM CLI tool via a new plugin: https://github.com/simonw/llm-command-rSo now you can do this: pipx install llm llm install llm-command-r llm keys set cohere <paste Cohere API key here> llm -m command-r-plus "3 reasons to adopt a sea l...
Neutral Neutral 0 repliesr Weights available: https://huggingface.co/CohereForAI/c4ai-command-r-plus
Neutral Neutral 1 repliesr What's the difference between this and their prior command R model? other than priceCohere API Pricing $ / M input tokens $ / M output tokensCommand R $0.50 $1.50Command R+ $3.00 $15.00Command-R: RAG at production scale https://news.ycombinator.com/item?id=39671872
Mixed Mixed 3 repliesr >built for businessYet the license says:> License: CC-BY-NCBesides this snark, seems really good from the benchmark and I'm really glad and grateful people are releasing weights :)
Neutral Neutral 1 repliesr We need vocabulary to clarify use cases where someone in the company uses an LLM to get an answer as opposed to a customer facing LLM.
Command A: Max performance, minimal compute – 256k context ... Official 2025-03-14	Cohere Command A Command	77	19
Negative Anti 1 repliesr I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on...
Negative Anti 2 repliesr I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak poin...
Very negative Anti 1 repliesr Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad
Negative Anti 0 repliesr what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/)
Positive Pro 0 repliesr "Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."Visible above the fold. Thanks for getting to the point.

Limitations

Caveats to keep in mind when interpreting the data.

Some model versions map to multiple HN posts. This is not a strict one-post-per-version dataset.
Comment analysis is based on 735 sampled, non-deleted top comments from 152 of 156 posts. 4 posts have no displayable sampled comments after filtering, so results are directional, not exhaustive.
Sentiment and stance labels are manually reviewed, but borderline comments can still compress sarcasm, nuance, or constructive criticism into coarse categories.
Theme counts use regex keyword matching and may include false positives or miss relevant comments.

Full dataset index

The table below renders all 156 deduplicated posts, sorted by date. Use it to inspect individual posts and verify the data behind this analysis.

Post	Model	Score	Comments	Sentiment
GPT-4 Official 2023-03-14	GPT-4 GPT	4,091	2,507
Very positive Pro 44 repliesr After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patie...
Negative Anti 42 repliesr A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example:>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not all...
Very negative Anti 14 repliesr I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is built off over the years...
Negative Anti 9 repliesr That footnote on page 15 is the scariest thing i've read about AI/ML to date."To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and...
Very positive Pro 6 repliesr From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir...
Llama 2 Official 2023-07-18	Llama 2 Llama	2,268	820
Positive Pro 7 repliesr Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.- Llama 1 (llama-65b): 57.6- LLama 2 (llama-2-70b-chat-hf): 64.6- GPT-3.5: 85.2- GPT-...
Negative Anti 25 repliesr Key detail from release:> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must re...
Negative Anti 6 repliesr This was a pretty disappointing initial exchange:> what are the most common non-investor roles at early stage venture capital firms?Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term "non-investor roles" may be perc...
Positive Pro 23 repliesr Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app!Model playground here: https://llama2.aiHosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chatIf you want to just play with the model, llama2.ai is a very easy way to do ...
Negative Anti 5 repliesr Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op...
Mistral 7B Official 2023-09-27	Mistral 7B Mistral	884	618
Very positive Pro 5 repliesr Major kudos to Mistral for being the first company to Apache license a model of this class.Meta wouldn't make LLama open source.DeciLM wouldn't make theirs open source.All of them wanted to claim they were open source, while putting in place restrictions and not using an open ...
Negative Anti 1 repliesr I asked it "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." and it responded:To stack these items on top of each other in a stable manner, follow these steps:1. Start with the 9 eggs. Arrange the...
Positive Pro 5 repliesr I eat my initial words, this works really well on my macbook air M1 and feels comparable of GPT3.5 - which is actually an amazing feat!Question: is there something like this, but with the "function calling api" finetuning? 95% of my uses nowadays deal with input/output of stru...
Neutral Neutral 0 repliesr For anyone who missed it, the twitter announcement of the model was just a torrent tracker uri: https://twitter.com/MistralAI/status/1706877320844509405
Negative Anti 2 repliesr They don't mention what datasets were used. I've come across too many models in the past which gave amazing results because benchmarks leaked into their training data. How are we supposed to verify one of these HuggingFace datasets didn't leak the benchmarks into the training ...
Gemini AI Official 2023-12-06	Gemini Gemini	2,135	1,602
Neutral Neutral 0 repliesr Related blog post: https://blog.google/technology/ai/google-gemini-ai/ (via https://news.ycombinator.com/item?id=38544746, but we merged the threads)
Positive Pro 8 repliesr Very impressive! I noticed two really notable things right off the bat:1. I asked it a question about a feature that TypeScript doesn't have[1]. GPT4 usually does not recognize that it's impossible (I've tried asking it a bunch of times, it gets it right with like 50% probabil...
Neutral Neutral 5 repliesr For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %...
Positive Pro 21 repliesr This demo is nuts: https://youtu.be/UIZAiXYceBI?si=8ELqSinKHdlGlNpX
Mixed Mixed 30 repliesr One observation: Sundar's comments in the main video seem like he's trying to communicate "we've been doing this ai stuff since you (other AI companies) were little babies" - to me this comes off kind of badly, like it's trying too hard to emphasize how long they've been doing...
Mistral "Mixtral" 8x7B 32k model [magnet] 3rd party 2023-12-08	Mixtral 8x7B Mixtral	546	239
Positive Pro 10 repliesr In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:https://huggingface.co/fblgit/una-xaberius-...
Neutral Neutral 4 repliesr Andrej Karpathy's take:New open weights LLM from @MistralAIparams.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2Likely related code: https://github.com/mistralai/megablocks...
Positive Pro 2 repliesr Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!
Positive Pro 3 repliesr Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...
Neutral Neutral 2 repliesr Looks to be Mixture of Experts, here is the params.json: { "dim": 4096, "n_layers": 32, "head_dim": 128, "hidden_dim": 14336, "n_heads": 32, "n_kv_heads": 8, "norm_eps": 1e-05, "vocab_size": 32000, "moe": { "num_experts_per_tok": 2, "num_experts": 8 } }
Mixtral of experts Official 2023-12-11	Mixtral 8x7B Mixtral	639	300
Neutral Neutral 1 repliesr Andrej Karpathy's Take:Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...New HuggingFace explainer on MoE very nice: https://huggingface.co/bl...
Neutral Neutral 2 repliesr More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtralAlready available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF
Positive Pro 6 repliesr > We’re currently using Mixtral 8x7B behind our endpoint mistral-small...So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be.
Neutral Pro 3 repliesr > A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.you...
Positive Pro 1 repliesr This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much...
Mixtral 8x7B: A sparse Mixture of Experts language model Paper 2024-01-09	Mixtral 8x7B Mixtral	359	150
Very positive Pro 4 repliesr This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com......
Positive Pro 7 repliesr I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locall...
Neutral Neutral 1 repliesr If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile3) ./mixt...
Neutral Neutral 2 repliesr On mac silicon:https://ollama.ai/ollama pull mixtralFor a chatgpt-esk web uihttps://github.com/ollama-webui/ollama-webuidocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama...
Neutral Neutral 0 repliesr Recent and related:Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/ite...
Aya: An open LLM by 3k independent researchers across the globe Official 2024-02-13	Aya	191	59
Negative Anti 2 repliesr I asked it how to "kill all the Apaches that are taking up RAM on my machine" and it just wouldn't give me the command. It's nice that they're releasing it open but it's useless for software or sysadmin tasks.> As an AI language model, my purpose is to provide helpful and h...
Negative Anti 2 repliesr Just tried it with a piece from my company's strategy statement and asked what it thought of it. (Everything in Portuguese). Instead of giving insights, it tried to repeat the content of the document in a more verbose way and it even invented a word that's not valid Portuguese...
Negative Anti 1 repliesr It seems even more censored than OpenAI platforms which is a feat in itself.
Neutral Neutral 0 repliesr I had a very quick dig around in one of the files in the training data, just to get a feel for what's in there: https://gist.github.com/simonw/0d641ff95731a09e2f1235a646d84...
Positive Pro 0 repliesr Very cool to have an open model handle so many human languages. A little off topic: I experiment with about a dozen models using Ollama. Two of the models from China seem excellent for Python. Odin and analysis problems but they only work with English and Mandarin, and I have ...
Our next-generation model: Gemini 1.5 Official 2024-02-15	Gemini 1.5 Pro Gemini	1,244	588
Very positive Pro 36 repliesr The white paper is worth a read. The things that stand out to me are:1. They don't talk about how they get to 10M token context2. They don't talk about how they get to 10M token context3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creat...
Neutral Neutral 0 repliesr One interesting tidbit from the technical report:>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conser...
Positive Pro 8 repliesr Massive whoa if true from technical report"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"https://storage.googleapis.com/deepmind-media/gemini/g...
Negative Anti 6 repliesr 0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form.
Very negative Anti 4 repliesr Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I...
Gemma: New Open Models Official 2024-02-21	Gemma Gemma	1,129	509
Neutral Neutral 9 repliesr Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B +-------------+----------+-------------+-------------+ \| Benchmark \| Gemma 7B \| Mistral 7B \| Llama-2 7B \| +-------------+----------+-------------+-------------+ \| MMLU \| 64.3 \| 60.1 \| 45.3 \| \| HellaSwag \| 81.2 \| ...
Negative Anti 15 repliesr The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policySomething that caught my eye in the terms:> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.One of the...
Neutral Neutral 29 repliesr Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.Opinions are our own and not of Google DeepMind.
Neutral Neutral 3 repliesr I notice a few divergences to common models:- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;- The vocabulary size is 10x (256K vs. Mistral’s 32K);- The training token count is tripled (6T vs. Llama2's 2T)Apart from that, it uses the ...
Negative Anti 6 repliesr Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning.
Mistral Large Official 2024-02-26	Mistral Large (latest) Mistral	599	267
Positive Pro 4 repliesr I appreciate the honesty in the marketing materials. Showing the product scoring below the market leader in a big benchmark is better than the Google way of cherry picking benchmarks.
Mixed Mixed 1 repliesr Very nice! I know they've already done a lot, but I would've liked some language in there re-affirming a commitment to contributing to the open source community. I had thought that was a major part of their brand.I've been staying tuned[0] since the miqu[1] debacle thinking th...
Neutral Neutral 2 repliesr Changelog is also updated: [1]Feb. 26, 2024API endpoints: We renamed 3 API endpoints and added 2 model endpoints.open-mistral-7b (aka mistral-tiny-2312): renamed from mistral-tiny. The endpoint mistral-tiny will be deprecated in three months.open-mixtral-8x7B (aka mistral-smal...
Neutral Neutral 1 repliesr I just added support for the new models to my https://github.com/simonw/llm-mistral plugin for my LLM CLI tool. You can now do this: pipx install llm llm install llm-mistral llm keys set mistral < paste your API key here > llm -m mistral-large 'prompt goes here'
Positive Pro 2 repliesr Just tried Le Chat for some coding issues I had today that ChatGPT (with GPT-4) wasn't able to solve, and Le Chat actually gave way better answers. Not sure if ChatGPT quality has gone down to save costs as some people suggest, but for these few problems the quality of the ans...
Sergey Brin on Gemini 1.5 Pro (03/02/2024) [video] 3rd party 2024-03-04	Gemini 1.5 Pro Gemini	51	72
Mixed Neutral 13 repliesr Within a couple sentences of each other he says (paraphrasing), “if you test any model (Gemini, ChatGPT, Grok) you’ll see it says far left things” and “we don’t know why [Gemini] is far left leaning”.I think he’s being honest here. At least at the executive level of having no ...
Neutral Neutral 0 repliesr Transcript:https://lifearchitect.ai/sergey/
Negative Anti 0 repliesr > The Gemini art? Yeah, with the… Okay. That wasn’t really expected [...] > So, once again, we haven’t fully understood why it means “left” in many cases. And that’s not our intention.Riiiight.
Neutral Neutral 0 repliesr Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.
Very negative Anti 4 repliesr Let's be clear. This isn't a left or right thing. That's just a convenient argument to avoid the real problem. It's about moral relativism.I am about as left wing as you can be and gemini is simply offensive. It didn't want to say if Hitler was worse than Elon or Trump. It did...
Command-R: RAG at production scale Official 2024-03-11	Command R Command	37	1
Neutral Neutral 0 repliesr Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise.
Command R+: A Scalable LLM Built for Business Official 2024-04-04	Cohere Command R+ Command	59	19
Positive Pro 2 repliesr I added support for this model to my LLM CLI tool via a new plugin: https://github.com/simonw/llm-command-rSo now you can do this: pipx install llm llm install llm-command-r llm keys set cohere <paste Cohere API key here> llm -m command-r-plus "3 reasons to adopt a sea l...
Neutral Neutral 0 repliesr Weights available: https://huggingface.co/CohereForAI/c4ai-command-r-plus
Neutral Neutral 1 repliesr What's the difference between this and their prior command R model? other than priceCohere API Pricing $ / M input tokens $ / M output tokensCommand R $0.50 $1.50Command R+ $3.00 $15.00Command-R: RAG at production scale https://news.ycombinator.com/item?id=39671872
Mixed Mixed 3 repliesr >built for businessYet the license says:> License: CC-BY-NCBesides this snark, seems really good from the benchmark and I'm really glad and grateful people are releasing weights :)
Neutral Neutral 1 repliesr We need vocabulary to clarify use cases where someone in the company uses an LLM to get an answer as opposed to a customer facing LLM.
GPT-4 Turbo with Vision Generally Available Official 2024-04-09	GPT-4 Turbo Vision GPT	220	107
Positive Pro 8 repliesr They also added both JSON and function support to the vision model - previously it didn't have those.This means you can now use gpt-4-turbo vision to extract structured data from an image!I was previously using a nasty hack where I'd run the image through the vision model to e...
Neutral Neutral 2 repliesr Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:> "I’m hoping we can get evals out shortly to help quantify this. Until ...
Negative Anti 2 repliesr People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it.
Neutral Pro 0 repliesr Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:https://www.datasette.cloud/blog/2024/datasette-extract/YouTube video demo here: https://www.youtube.com/watc...
Negative Anti 6 repliesr Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive!
Mixtral 8x22B Official 2024-04-17	Mixtral 8x22B Mixtral	533	242
Mixed Anti 5 repliesr First test I tried to run a random taxation question through itOutput: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...Within the first paragraph, it outputs:> GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGEThought that was hilarious.
Neutral Neutral 11 repliesr Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?
Neutral Mixed 5 repliesr "64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 TurboMaybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex...
Mixed Mixed 2 repliesr Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.
Neutral Neutral 3 repliesr What's the best way to run this on my Macbook Pro?I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is...
Meta Llama 3 Official 2024-04-18	Meta-Llama-3-70B-Instruct Llama	2,199	923
Neutral Neutral 0 repliesr See also https://ai.meta.com/blog/meta-llama-3/and https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...edit: and https://twitter.com/karpathy/status/1781028605709234613
Neutral Anti 15 repliesr They've got a console for it as well, https://www.meta.ai/And announcing a lot of integration across the Meta product suite, https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...Neglected to include comparisons against GPT-4-Turbo or Claude Opus, so I guess it's far ...
Positive Pro 2 repliesr Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others: p...
Positive Pro 0 repliesr Llama 3 70B has debuted on the famous LMSYS chatbot arena leaderboard at position number 5, tied with Claude 2 Sonnet, Bard (Gemini Pro), and Command R+, ahead of Claude 2 Haiku and older versions of GPT-4.The score still has a large uncertainty so it will take a while to dete...
Negative Anti 4 repliesr I tried generating a Chinese rap song, and it did generate a pretty good rap. However, upon completion, it deleted the response, and showed > I don’t understand Chinese yet, but I’m working on it. I will send you a message when we can talk in Chinese.I tried some other lang...
Phi-3 Technical Report Paper 2024-04-23	Phi-3 Phi	411	130
Mixed Mixed 5 repliesr Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into ...
Very positive Pro 10 repliesr Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)So we now have an ...
Positive Pro 0 repliesr This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringemen...
Neutral Neutral 2 repliesr They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...
Negative Anti 0 repliesr I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It wa...
GPT-4o Official 2024-05-13	GPT-4o GPT	3,138	2,366
Positive Mixed 14 repliesr The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that ...
Mixed Mixed 8 repliesr Very interesting and extremely impressive!I tried using the voice chat in their app previously and was disappointed. The big UX problem was that it didn't try to understand when I had finished speaking. English is a second language and I paused a bit too long thinking of a wor...
Positive Mixed 3 repliesr Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.But: Interesting next to see ho...
Negative Anti 8 repliesr I found these videos quite hard to watch. There is a level of cringe that I found a bit unpleasant.It’s like some kind of uncanny valley of human interaction that I don’t get on nearly the same level with the text version.
Positive Pro 15 repliesr We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.These demos show people talking to artificial intelligence. This is new. ...
Gemini Flash Official 2024-05-14	Gemini Flash Latest Gemini	442	146
Neutral Neutral 0 repliesr I upgraded my llm-gemini plugin to provide CLI access to Gemini Flash: pipx install llm # or brew install llm llm install llm-gemini --upgrade llm keys set gemini # paste API key here llm -m gemini-1.5-flash-latest 'a short poem about otters' https://github.com/simonw/llm-gemi...
Mixed Mixed 9 repliesr Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.Not bad compared to rolling your own, but among frontier models the main competitive differentiator...
Neutral Mixed 5 repliesr 1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and wil...
Negative Anti 0 repliesr A lightweight model that you can only use in the cloud? That is amusing. These tech megacorps are really intent on owning your usage of AI. But we must not let that be the future.
Negative Anti 0 repliesr One thing OpenAI is beating Google at is actually publishing pricing for its APIs (and being consistent as to what they are called.)Google has I think 10 models available (there’s more than ten model names, but several of the models have multiple aliases) through what the Goog...
Codestral: Mistral's Code Model Official 2024-05-29	Codestral (latest) Codestral	457	214
Negative Anti 8 repliesr The license for this [1] prohibits use of the model and its outputs for any commercial activity, or even any "live" (whatever that means) conditions, commercial or not.There seems to be an exclusion for using the code outputs as part of "development". But wait! It also prohibi...
Negative Neutral 9 repliesr My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program usin...
Neutral Neutral 7 repliesr i've been noticing that there's a divergence in philosophy between Llama style LLMs (Mistral are Meta alums so I'm counting them in tehre) and OpenAI/GPT style LLMs when it comes to code.GPT3.5+ prioritized code very heavily - there's no CodeGPT, its just GPT4, and every versi...
Neutral Neutral 6 repliesr Is there a way to use this within VSCode like copilot , meaning having the "shadow code" appear while you code instead of having to tho back-and-forth between the editor and a chat-like interface ?For me, a significant component of the quality of these tools resides on the "cl...
Neutral Neutral 0 repliesr >Usage Limitation- You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments;- Subject to the foregoing, You shall not supply the Mistral Models, Deriva...
Qwen2 LLM Released Official 2024-06-06	Qwen2 Qwen	261	130
Positive Pro 6 repliesr A 0.5B parameter model with a 32k context length that also makes good use of that full window?! That's very interesting.The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning...
Neutral Neutral 2 repliesr Is it a common practice in LLMs to give different weights to different training data sources?For instance I might want to say that all training data that comes from my inhouse emails take precedence over anything that comes from the internet?
Positive Pro 0 repliesr This model has: 1. On par or better performance than Llama-3-70B-Instruct 2. A much more comfortable context length of 128k (vs the tiny 8k that really hinders Llama-3)These 2 feats together will probably make it the first serious OS rival to GPT-4!
Negative Anti 5 repliesr Weird, every time I try asking what happened at Tiananmen Square, or why Xi is an outlier with 3 terms as party secretary, it errors. "All hail Glorious Xi :)" works though. https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct
Neutral Neutral 8 repliesr Given the restrictions on GPUs to China, I'm curious what their training cluster looks like.(not saying this out of any support or non-support for such a GPU blockade; I'm just genuinely curious)
Codestral Mamba Official 2024-07-16	Codestral Mamba Codestral	485	138
Negative Mixed 7 repliesr What are the steps required to get this running in VS Code?If they had linked to the instructions in their post (or better yet a link to a one click install of a VS Code Extension), it would help a lot with adoption.(BTW I consider it malpractice that they are at the top of ha...
Negative Neutral 3 repliesr I kinda just want something that can keep up with the original version of Copilot. It was so much better than the crap they’re pumping out now (keeps messing up syntax and only completing a few characters at a time).
Neutral Neutral 2 repliesr Does anyone have a favorite FIM capable model? I've been using codellama-13b through ollama w/ a vim extension i wrote and it's okay but not amazing, I definitely get better code most of the time out of Gemma-27b but no FIM (and for some reason codellama-34b has broken inferen...
Positive Pro 0 repliesr It's great to see a high-profile model using Mamba2!
Neutral Anti 2 repliesr The MBPP column should bold DeepSeek as it has a better score than Codestral.
Mistral NeMo Official 2024-07-18	Mistral Nemo Mistral Nemo	418	162
Mixed Mixed 6 repliesr > Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on s...
Neutral Neutral 3 repliesr > Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.Does anyone have a good a...
Neutral Neutral 0 repliesr Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.> *Designed to fit on the...
Neutral Neutral 2 repliesr These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/fund...
Negative Anti 1 repliesr I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.
GPT-4o mini: advancing cost-efficient intelligence Official 2024-07-18	GPT-4o mini GPT	222	78
Neutral Neutral 0 repliesr [dupe]Some more discussion: https://news.ycombinator.com/item?id=40996248
Positive Pro 4 repliesr The big news for me here is the 16k output token limit. The models keep increasing the input limit to outrageous amounts, but output has been stuck at 4k.I did a project to summarize complex PDF invoices (not “unstructured” data, but “idiosyncratically structured” data, as eac...
Neutral Pro 3 repliesr Here's something interesting to think about: In ML we do a lot of bootstrapping. If a model is 51% wrong on a binary problem you flip the answer and train a 51% correct model then work your way up from there.Small models are trained from synthetic and live data curated and gen...
Negative Anti 9 repliesr GPT-4o mini is $0.15/1M input tokens, $0.60/1M output tokens. In comparison, Claude Haiku is $0.25/1M input tokens, $1.25/1M output tokens.There's no way this price-race-to-the-bottom is sustainable.
Neutral Neutral 1 repliesr @dang: This post isn't on the 1st or 2nd page of hacker news. Did it trip some automated controversy detection code for too many comments in the first hour?Edit: it says 181 points, 6 hours ago, and eyeballing the 1st page it should be in the top 5 right now.
AI paid for by Ads – the GPT-4o mini inflection point 3rd party 2024-07-19	GPT-4o mini GPT	290	236
Very negative Anti 25 repliesr So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ...
Negative Anti 3 repliesr That's an inflection point, all right. OpenAI's customers can now at least break even.Of course, it means a flood of crap content.
Negative Anti 4 repliesr > For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000.> I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views.> This is approxim...
Negative Anti 0 repliesr This article is weird clickbait which, even weirder, worked.It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a co...
Neutral Mixed 2 repliesr This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be in f...
Llama 3.1 Official 2024-07-23	Meta-Llama-3.1-405B-Instruct Llama	437	269
Neutral Neutral 0 repliesr Related ongoing thread:Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)
Positive Pro 4 repliesr The 405b model is actually competitive against closed source frontier models.Quick comparison with GPT-4o: +----------------+-------+-------+ \| Metric \| GPT-4o\| Llama \| \| \| \| 3.1 \| \| \| \| 405B \| +----------------+-------+-------+ \| MMLU \| 88.7 \| 88.6 \| \| GPQA \| 53.6 \| 51.1 \| \| ...
Positive Pro 1 repliesr I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboar...
Neutral Neutral 6 repliesr You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.If you want to learn more, there is a writeup at https://wow.groq.com...
Very positive Pro 3 repliesr Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...
Grok-2 Beta Release Official 2024-08-14	Grok 2 Grok	226	333
Mixed Mixed 8 repliesr The technology is impressive; achieving such a level requires a lot of efforts in dataset creation, neural architectural costs, and GPU shepherding.What is the company’s ethical position though? It officially stemmed from Mr Musk’s objection that OpenAI was not open-source, bu...
Neutral Pro 2 repliesr I assume they will have a lot less "safety", i.e. the model will be more likely to actually do what you ask instead of finding a reason why "sorry Dave, I can't do that".Since these "safety" features tend to also degrade the model, that's likely also helping them catch up in t...
Negative Anti 1 repliesr It's hilarious they put Claude 3.5 Sonnet in the far right corner while it scores the highest and beats most of Grok's numbers.
Positive Pro 1 repliesr It uses FLUX.1 to generate images and it has been fun so far. Its good on writing, can generate very realistic photos, can create memes, and looks like hands problem is fixed now.
Positive Pro 3 repliesr You know what’s also impressive besides this beta release? How Claude 3.5 Sonnet is still able to keep up so well. Grok-2 beat every other LLM except Claude. How did Anthropic achieve this?
New Phi-3.5 Models from Microsoft, including new MoE Repo 2024-08-20	Phi-3.5-MoE instruct (128k) Phi	25	3
Positive Pro 0 repliesr Mini: https://huggingface.co/microsoft/Phi-3.5-mini-instructLarge MoE with impressive benchmarks: https://huggingface.co/microsoft/Phi-3.5-MoE-instructVision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Neutral Neutral 0 repliesr Does anyone have an idea what the output token limit is? I only see mention of the 128k token context window, but I bet the output limit is 4k tokens.
Negative Anti 0 repliesr The Phi models always seem to do really well when it comes to benchmarks but then in real world performance they always fall way behind competing models.
Mistral releases Pixtral 12B, its first multimodal model 3rd party 2024-09-11	Pixtral 12B Pixtral	163	40
Mixed Mixed 5 repliesr The "Mistral Pixtral multimodal model" really rolls off the tongue.> It’s unclear which image data Mistral might have used to develop Pixtral 12B.The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (AP...
Negative Anti 1 repliesr Couple notes for newcomers:1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for unde...
Negative Anti 2 repliesr Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it.
Neutral Neutral 0 repliesr Related earlier:New Mistral AI Weightshttps://news.ycombinator.com/item?id=41508695
Positive Pro 1 repliesr I’d love to know how much money Mistral is taking in versus spending. I’m very happy for all these open weights models, but they don’t have Instagram to help pay for it. These models are expensive to build.
OpenAI O1-Mini Official 2024-09-12	OpenAI o1-mini o-series	30	5
Neutral Neutral 0 repliesr related tweet: https://x.com/OpenAIDevs/status/1834278699388338427OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.o1-preview has strong reasoning capabilities and broad world knowledge.o1-mini is faster, 80% cheaper, and competitive with...
Neutral Neutral 0 repliesr Related:OpenAI O1 Modelhttps://news.ycombinator.com/item?id=41523070
Neutral Neutral 2 repliesr No GPT in the name, so maybe not transformer based?
Llama 3.2: Revolutionizing edge AI and vision with open, customizable... Official 2024-09-25	Llama-3.2-11B-Vision-Instruct Llama	924	328
Very positive Pro 8 repliesr I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job...
Very positive Pro 11 repliesr I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am gra...
Positive Pro 6 repliesr "The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!I believe this unforgiving dynamic is why model distillation works w...
Positive Pro 3 repliesr Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:nexa run llama3.2 --streamlitDis...
Neutral Neutral 3 repliesr If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be ...
Un Ministral, Des Ministraux Official 2024-10-16	Ministral 3B (latest) Ministral	216	99
Negative Anti 8 repliesr 3b is is API-only so you won’t be able to run it on-device, which is the killer app for these smaller edge models.I’m not opposed to licensing but “email us for a license” is a bad sign for indie developers, in my experience.8b weights are here https://huggingface.co/mistralai...
Negative Anti 2 repliesr This press release is a big change in branding and ethos for Mistral. What was originally a vibey, insurgent contender that put out magnet links is now a PR-crafting team that has to fight to pitch their utility to the public.
Neutral Neutral 3 repliesr Has anyone put together a good and regularly updated decision tree for what model to use in different circumstances (VRAM limitations, relative strengths, licensing, etc.)? Given the enormous zoo of models in circulation, there must be certain models that are totally obsolete.
Negative Anti 2 repliesr They didn't add a comparison to Qwen 2.5 3b, which seems to surpass Ministral 3b MMLU, HumanEval, GSM8K: https://qwen2.org/qwen2-5/#qwen25-05b15b3b-performanceThese benchmarks don't really matter that much, but it is funny how this blog post conveniently forgot to compare with...
Neutral Neutral 4 repliesr For anybody wondering about the title, that's a sort-of pun in French about how words get pluralized following French rules.The quintessential example is "cheval" (horse) which becomes "chevaux" (horses), which is the rule they're following (or being cute about). Un mistral, d...
Qwen2.5-Coder Beats GPT-4o and Claude 3.5 Sonnet in Coding Official 2024-11-11	Qwen2.5-Coder 32B Instruct Qwen	23	0	—
QwQ: Alibaba's O1-like reasoning LLM Official 2024-11-27	QwQ 32B Qwen	438	421
Very positive Pro 3 repliesr This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went a...
Positive Pro 5 repliesr This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/
Positive Pro 1 repliesr QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.[0] https://news.ycombinator.com/item?id=41524263
Positive Pro 2 repliesr 32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.
Negative Mixed 7 repliesr I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.See the response here: https://pastecode.io/s...
Llama-3.3-70B-Instruct Repo 2024-12-06	Llama-3.3-70B-Instruct Llama	425	219
Very positive Pro 3 repliesr Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3...
Neutral Pro 8 repliesr This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They...
Positive Pro 4 repliesr Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109
Positive Pro 1 repliesr Does unexpectedly well on our benchmark:https://help.kagi.com/kagi/ai/llm-benchmark.htmlWill dive into it more, but this is impressive.
Neutral Neutral 3 repliesr Please help me understand something.I've been out of the loop with HuggingFace models.What can you do with these models?1. Can you download them and run them on your Laptop via JupyterLab?2. What benefits does that get you?3. Can you update them regularly (with new data on the...
Gemini 2.0: our new AI model for the agentic era Official 2024-12-11	Gemini 2.0 Flash Gemini	1,015	490
Very positive Pro 10 repliesr https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.I just tried having it teach me how t...
Positive Pro 5 repliesr I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal: llm install -U llm-gemini llm -m gemini-2.0-flash-exp 'prompt goes here' LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the...
Positive Pro 12 repliesr Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this sp...
Positive Pro 3 repliesr Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about ...
Positive Pro 4 repliesr Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
Phi-4: Microsoft's Newest Small Language Model Specializing in Comple... Official 2024-12-13	Phi-4 Phi	439	143
Neutral Neutral 12 repliesr The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on...
Mixed Anti 2 repliesr For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend.
Positive Pro 7 repliesr Looks like it punches way above its weight(s).How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
Neutral Neutral 1 repliesr Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4
Negative Anti 0 repliesr I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than ...
Codestral 25.01 Official 2025-01-13	Codestral 25.01 Codestral	21	1
Mixed Mixed 0 repliesr 256k context window and FIM support sounds good. I can't see it mentioned on the release page how big the model is, they say sub-100B but they are comparing it to 22B and the 22B seems to hold up quite well against it. If we're talking 80B vs 22B then it seems like quite a hea...
DeepSeek-R1 Repo 2025-01-20	DeepSeek R1 DeepSeek	1,843	663
Positive Pro 23 repliesr OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...The one I'm running is the 8.54GB file. I'm using Ollama like this: ollama run hf.co/unsloth/DeepSeek-...
Positive Pro 22 repliesr Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It c...
Positive Pro 4 repliesr > However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.We've been running ...
Mixed Mixed 6 repliesr Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, Ne...
Positive Pro 6 repliesr The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.Their model crushes it on closed-system tasks...
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL Paper 2025-01-25	DeepSeek R1 DeepSeek	1,351	1,056
Positive Pro 10 repliesr we've been tracking the deepseek threads extensively in LS. related reads:- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet....
Positive Pro 6 repliesr I've been using https://chat.deepseek.com/ over My ChatGPT Pro subscription because being able to read the thinking in the way they present it is just much much easier to "debug" - also I can see when it's bending it's reply to something, often softening it or pandering to me ...
Neutral Neutral 10 repliesr DeepSeek-R1 has apparently caused quite a shock wave in SV ...https://venturebeat.com/ai/why-everyone-in-ai-is-freaking-ou...
Very positive Pro 3 repliesr DeepSeek V3 came in the perfect time, precisely when Claude Sonnet turned into crap and barely allows me to complete something without me hitting some unexpected constraints.Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a hug...
Positive Neutral 4 repliesr Over 100 authors on arxiv and published under the team name, that's how you recognize everyone and build comradery. I bet morale is high over there
Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens Official 2025-01-26	Qwen2.5 14B Instruct Qwen	307	108
Neutral Neutral 0 repliesr Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/(via https://news.ycombinator.com/item?id=42832838, but we merged that thread hither)
Negative Anti 13 repliesr In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, et...
Neutral Neutral 4 repliesr Ollama has a num_ctx parameter that controls the context window length - it defaults to 2048. At a guess you will need to set that.
Neutral Neutral 1 repliesr Here are tips for running it on macOS using MLX: https://twitter.com/awnihannun/status/1883611098081099914 - using https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-...
Neutral Neutral 3 repliesr What's the SOTA for memory-centric computing? I feel like maybe we need a new paradigm or something to bring the price of AI memory down.Maybe they can take some of those hundreds of billions and invest in new approaches.Because racks of H100s are not sustainable. But it's cle...
Mistral Small 3 Official 2025-01-30	Mistral Small 3 Mistral	620	194
Positive Pro 8 repliesr I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. Th...
Neutral Pro 6 repliesr Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-lic...
Neutral Neutral 1 repliesr Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:* run LLM evaluations systematically and at scale* share the data with the public in a rigorous and transparent wayWe use the UK governmen...
Negative Anti 0 repliesr Not so subtle in function calling example[1] "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", [1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...
Neutral Pro 0 repliesr So the point of this release is1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)3) Performance, according to mistral, somewhere between Qwen 2...
OpenAI O3-Mini Official 2025-01-31	o3-mini o-series	962	904
Neutral Neutral 15 repliesr I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...For 18,936 input, 2,905 output it cost 3.3612 cents.Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...
Neutral Neutral 2 repliesr I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21Example usage: llm -m o3-mini 'write a poem about a pirate and a walrus' \ -o reasoning_effort high Output (com...
Positive Pro 4 repliesr For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet. 62% $186 o1 high 60% $18 o3-mini high 57% $5 DeepSe...
Positive Pro 14 repliesr For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still co...
Neutral Neutral 14 repliesr So far, it seems like this is the hierarchyo1 > GPT-4o > o3-mini > o1-mini > GPT-4o-minio3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf
Ingesting PDFs and why Gemini 2.0 changes everything 3rd party 2025-02-05	Gemini 2.0 Flash Gemini	1,303	447
Positive Pro 33 repliesr I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large...
Negative Anti 17 repliesr This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.You...
Negative Neutral 6 repliesr Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
Very negative Anti 5 repliesr We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I won...
Positive Mixed 6 repliesr (disclaimer I am CEO of llamaindex, which includes LlamaParse)Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so we...
Grok 3: Another win for the bitter lesson 3rd party 2025-02-20	Grok 3 Grok	132	211
Negative Anti 7 repliesr This article is weak and just general speculation.Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ot...
Negative Anti 2 repliesr The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok ...
Negative Anti 10 repliesr Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0....
Negative Anti 2 repliesr I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Mayb...
Negative Anti 0 repliesr The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only...
Claude 3.7 Sonnet and Claude Code Official 2025-02-24	Claude Sonnet 3.7 Claude	2,127	963
Positive Pro 18 repliesr Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.Aider 0.75.0 is out with support for 3.7 Sonnet [1].Thinking supp...
Neutral Neutral 82 repliesr Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product.
Positive Pro 8 repliesr Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.https://help.kagi.com/kagi/ai/llm-benchmark.htmlAppears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, abo...
Positive Pro 91 repliesr You can get your HN profile analyzed by it and it's pretty funny :)https://hn-wrapped.kadoa.com/I'm using this to test the humor of new models.
Positive Pro 2 repliesr I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was...
Microsoft announces Phi-4-multimodal and Phi-4-mini Official 2025-02-26	Phi-4-multimodal Phi	48	3
Positive Pro 0 repliesr Great on-device additions to the larger Phi-4 14B released Dec/2024.https://lifearchitect.ai/models-table/
Positive Pro 1 repliesr That looks really good. A 3.8B model holding it's own against pretty 7B class models is definitely an accomplishment.Sizing looks like it is aimed primarily at copilot style NPUs I guess
GPT-4.5 Official 2025-02-27	GPT-4.5 GPT	1,136	988
Negative Anti 48 repliesr GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokensGPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokensIt sounds like it's so expensive and t...
Negative Anti 12 repliesr Is it official then?Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But w...
Neutral Neutral 10 repliesr I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c...
Negative Anti 9 repliesr Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to s...
Mixed Mixed 15 repliesr First impression of GPT-4.5:1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.52. The style it writes is way better: it keeps the tone you ask and make...
QwQ-32B: Embracing the Power of Reinforcement Learning Official 2025-03-05	QwQ 32B Qwen	480	169
Mixed Mixed 10 repliesr Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't beli...
Mixed Pro 6 repliesr Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.These things are pretty interesting as they are developing. What US will do to retain its power?BTW I am Indian and we are not even in the race as coun...
Positive Pro 4 repliesr I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?
Neutral Neutral 5 repliesr To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ.
Mixed Mixed 2 repliesr It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.Overall though quite impressive i...
Mistral OCR Official 2025-03-06	Mistral OCR Mistral	1,756	417
Mixed Mixed 9 repliesr I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.You can see the samples here - https://huggingf...
Negative Anti 7 repliesr It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:https://i.imgur.com/jcwW5AG.jpegFor the blocks in the center, it outputs:> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 aoû...
Very positive Pro 2 repliesr This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.Specifically, this allows y...
Positive Pro 3 repliesr I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.
Mixed Mixed 2 repliesr We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-geminiA high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and...
Gemma 3 Technical Report [pdf] Paper 2025-03-12	Gemma 3 27B Gemma	488	245
Neutral Neutral 6 repliesr Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.Gemma 3 model overview: https://ai.google.dev/gemma/docs/coreHuggingface collection: https://huggingface.co/collecti...
Neutral Neutral 11 repliesr Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.(Opinions our own and not of Google DeepMind.)PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957
Positive Pro 3 repliesr Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good ou...
Neutral Neutral 0 repliesr Linking to he announcement (which links to his PDF) would probably be more useful.Introducing Gemma 3: The most capable model you can run on a single GPU or TPUhttps://blog.google/technology/developers/gemma-3/
Mixed Mixed 3 repliesr Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).But the example image shows that this model still makes dumb errors or has a poo...
Experiment with Gemini 2.0 Flash native image generation Official 2025-03-12	Gemini 2.0 Flash Gemini	101	12
Negative Anti 4 repliesr It just tells me 'content not permitted'.For context, I was attempting to put a cup of hot chocolate into the hands of an anime character.
Positive Pro 1 repliesr Ever since OpenAI showed (but did not release) this type of multimodal output with 4o, I have been waiting for this to be available to the general public.It seems like really combining visuals at the level of generation capability means language understanding is fully grounded...
Negative Anti 0 repliesr Curious that they use "Use Gemini 2.0 Flash to tell a story and it will illustrate it with pictures, keeping the characters and settings consistent throughout", and their example is not consistent at all between two pictures.
Negative Anti 0 repliesr I was really hoping that there would be more character consistency, given the fact they mention it in the blog. It also doesn't seem to reliably follow styles like "watercolor illustration" or "line and wash".
Command A: Max performance, minimal compute – 256k context window Official 2025-03-14	Cohere Command A Command	77	19
Negative Anti 1 repliesr I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on...
Negative Anti 2 repliesr I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak poin...
Very negative Anti 1 repliesr Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad
Negative Anti 0 repliesr what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/)
Positive Pro 0 repliesr "Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."Visible above the fold. Thanks for getting to the point.
OpenAI's o1-pro now available via API Official 2025-03-19	o1-pro o-series	131	129
Mixed Mixed 5 repliesr Pricing: $150 / 1M input tokens, $600 / 1M output tokens. (Not a typo.)Very expensive, but I've been using it with my ChatGPT Pro subscription and it's remarkably capable. I'll give it 100,000 token codebases and it'll find nuanced bugs I completely overlooked.(Now I almost fe...
Neutral Neutral 4 repliesr This is their first model to only be available via the new Responses API - if you have code that uses Chat Completions you'll need to upgrade to Responses in order to support this.Could take me a while to add support for it to my LLM tool: https://github.com/simonw/llm/issues/839
Neutral Neutral 5 repliesr It cost me 94 cents to render a pelican riding a bicycle SVG with this one!Notes and SVG output here: https://simonwillison.net/2025/Mar/19/o1-pro/
Neutral Pro 3 repliesr Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.OpenAI is now within an order of magnitude of a highly skilled humans with their frontier model pricing. o3 pr...
Negative Anti 2 repliesr It has a 2023 knowledge cut-off, and 200k context window... ? That's pretty underwhelming.
Qwen2.5-VL-32B: Smarter and Lighter Official 2025-03-24	Qwen2.5-VL-32B Qwen	544	293
Neutral Neutral 5 repliesr Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/
Very positive Pro 1 repliesr This model is available for MLX now, in various different sizes.I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this: uv run --with 'numpy<2' --with mlx-vlm \ ...
Very positive Pro 2 repliesr We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our ...
Positive Pro 9 repliesr 32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced M...
Neutral Neutral 10 repliesr Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable?
Gemini 2.5 Official 2025-03-25	Gemini 2.5 Flash Gemini	973	485
Very positive Pro 14 repliesr One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.Imagine for instance you give the LLM the profile of the love interest for your...
Very positive Pro 22 repliesr I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. ...
Positive Pro 4 repliesr I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.Plus it drew me a very decent pelican r...
Positive Pro 3 repliesr Tops our benchmark in an unprecedented way.https://help.kagi.com/kagi/ai/llm-benchmark.htmlHigh quality, to the point. Bit on the slow side. Indeed a very strong model.Google is back in the game big time.
Positive Pro 2 repliesr Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.[0] https://aider.c...
DeepSeek-V3 Technical Report Paper 2025-03-27	DeepSeek V3 DeepSeek	132	34
Neutral Neutral 3 repliesr The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1....
Negative Anti 2 repliesr The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts.
Neutral Neutral 1 repliesr Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...
Neutral Neutral 0 repliesr Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.
Positive Pro 1 repliesr Nice to see a return to open source in models and training systems.
Alibaba Qwen2.5-Omni-7B: Open Source End-to-End Multimodal AI Model 3rd party 2025-03-27	Qwen2.5-Omni 7B Qwen	29	3
Positive Pro 0 repliesr Qwen has been on fire lately. First QwQ now this!
Negative Anti 1 repliesr Dunno if the article author is in this, but there's an issue with the links. Hugging face goes to github, github goes to nowhere.
QVQ-Max: Think with Evidence Official 2025-04-03	QVQ Max Qwen	121	26
Positive Pro 1 repliesr I think this is old news, but this model does better than llama 4 maverick on coding.
Neutral Neutral 5 repliesr I wonder why are we getting these drops during the weekend. Is the AI race truly that heated?
Neutral Neutral 1 repliesr The about page doesn’t shed light on the composition of the core team nor their sources of incomes or funding. Am I overlooking something?> We are a group of people with diverse talents and interests.
Mixed Mixed 0 repliesr I’ve tried QVQ before, one issue I always stumbled upon is that it entered an endless loop of reasoning. It felt wrong letting it run for too long, I didn’t want to over consume Alibabas gpus.I’ll try QVQ-Max when I get home. It’s kinda fascinating how these models can interpr...
The Llama 4 herd Official 2025-04-05	Llama 4 Scout 17B 16E Instruct Llama	1,235	658
Neutral Neutral 8 repliesr General overview below, as the pages don't seem to be working well Llama 4 Models: - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each. - They are natively multimodal: text + image input, text-only output. - Key achie...
Neutral Neutral 36 repliesr "It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."Perhaps. Or, maybe, "leaning left" by th...
Neutral Neutral 5 repliesr Model training observations from both Llama 3 and 4 papers:Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].For Llama 4 training, Meta doubled the compute, using ~...
Positive Pro 11 repliesr The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct...
Negative Mixed 7 repliesr This thread so far (at 310 comments) summarized by Llama 4 Maverick: hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000 Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some r...
GPT-4.1 in the API Official 2025-04-14	GPT-4.1 GPT	680	492
Negative Neutral 16 repliesr As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)- o3-mini (web search, CoT, canvas, but no i...
Neutral Neutral 8 repliesr Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year: SWE Aider Cost Fast Fresh Claude 3.7 70% 65% $15 77 8/24 Gemini 2.5 64% 69% $10 200 1/25 GPT-4.1 55% 53% $8 169 6/24 DeepSeek R1 49% 57% $...
Neutral Neutral 8 repliesr don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:- telling the model to be persistent (+20%)- dont self-inject/parse toolcalls (+2%)- prompted planning (+4%)- JSON BAD - use X...
Mixed Mixed 2 repliesr I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.My take aways:- This is the first model from OpenAI that f...
Neutral Pro 4 repliesr From OpenAI's announcement:> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better s...
OpenAI o3 and o4-mini Official 2025-04-16	o3 o-series	555	507
Very negative Anti 13 repliesr Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...With right knowledge and web searches one can answer this question in a matte...
Positive Pro 5 repliesr Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA ...
Positive Pro 7 repliesr Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]Incredible how resilient Claude models have been for best-in-coding class.[1] But by only about 1%, and inclusive of C...
Positive Pro 3 repliesr I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).Using o4-min...
Negative Anti 5 repliesr To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.GPT-4o mini: The new moon in August 2025 will occur on August 12.Llama 3.3 70B: The new moon in August...
Gemini 2.5 Flash Official 2025-04-17	Gemini 2.5 Flash Gemini	1,076	560
Very positive Pro 20 repliesr Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's m...
Neutral Pro 3 repliesr An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini uv tool install llm llm install llm-gemini llm keys set gemini # paste key here llm -...
Positive Pro 18 repliesr Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.Google is silently winning the AI race.
Positive Pro 6 repliesr One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-und...
Positive Pro 3 repliesr For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and late...
Gemma 3 QAT Models: Bringing AI to Consumer GPUs Official 2025-04-20	Gemma 3 27B Gemma	602	276
Very positive Pro 11 repliesr I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of mem...
Very positive Pro 1 repliesr I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I...
Negative Anti 3 repliesr First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn'...
Very positive Pro 3 repliesr Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.I still like Gemini 2.5 Pro ...
Positive Pro 0 repliesr Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of ent...
Qwen3: Think deeper, act faster Official 2025-04-28	Qwen3 14B Qwen	869	388
Negative Anti 18 repliesr I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Q...
Positive Pro 4 repliesr They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.Not even going into performance, need to test fi...
Mixed Mixed 6 repliesr As is now traditional for new LLM releases, I used Qwen 3 (32B, run via Ollama on a Mac) to summarize this Hacker News conversation about itself - run at the point when it hit 112 comments.The results were kind of fascinating, because it appeared to confuse my system prompt te...
Neutral Neutral 12 repliesr Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a ...
Neutral Neutral 15 repliesr With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all...
Phi-4 Reasoning Models Official 2025-05-01	Phi-4-reasoning Phi	131	26
Neutral Neutral 1 repliesr We uploaded GGUFs for anyone who wants to run them locally.[EDIT] - I fixed all chat templates so no need for --jinja as at 10:00PM SF time.Phi-4-mini-reasoning GGUF: https://huggingface.co/unsloth/Phi-4-mini-reasoning-GGUFPhi-4-reasoning-plus-GGUF: https://huggingface.co/unsl...
Very positive Pro 2 repliesr These look quite incredible. I work on a llama.cpp GUI wrapper and its quite surprising to see how well Microsoft's Phi-4 releases set it apart as the only competition below ~7B, it'll probably take a year for the FOSS community to implement and digest it completely (it can do...
Neutral Neutral 0 repliesr Sorry if this comment is outdated or ill-informed, but it is hard to follow the current news. Do the Phi models still have issues with training on the test set, or have they fixed that?
Mixed Mixed 0 repliesr Is anyone here using phi-4 multimodal for image-to-text tasks?The phi models often punch above their weight, and I got curious about the vision models after reading https://unsloth.ai/blog/phi4 stories of finetuningSince lmarena.ai only has the phi-4 text model, I've tried "ph...
Mixed Mixed 1 repliesr The example prompt for reasoning model that never fails to amuse me: "How amy letter 'r's in the word 'strrawberrry'?"Phi-4-mini-reasoning: thought for 2 min 3 sec<think> Okay, let's see here. The user wants to know how many times the letter 'r' appears in the word 'strr...
Medium Is the New Large Official 2025-05-07	Mistral Medium 3 Mistral	70	20
Negative Anti 1 repliesr It's not cheaper than Deepseek V3.1, though, and Deepseek outperforms on nearly everything. And only between 1- 3x the throughput based on the openrouter metrics (near equivalent throughput if you use an FP8 quant). Wish I could be a little more excited about this one.
Neutral Neutral 1 repliesr https://lifearchitect.ai/models-table/
Neutral Neutral 2 repliesr I guess this one (Mistral Medium 3) won't be open?
Positive Pro 0 repliesr The medium model is crazy fast. Reminds me of maverick on groq (except better according to their own testing).
Neutral Neutral 0 repliesr The only hint at its size is that it requires "self-hosted environments of four GPUs and above".
Gemma 3n preview: Mobile-first AI Official 2025-05-20	Gemma 3n 4B Gemma	441	163
Positive Pro 12 repliesr You can try it on Android right now:Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...Import the .task file in ...
Neutral Neutral 3 repliesr Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot ...
Mixed Mixed 2 repliesr According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-previewE4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4...
Positive Pro 1 repliesr On Hugging face I see 4B and 2B versions now -https://huggingface.co/collections/google/gemma-3n-preview-6...Gemma 3n Previewgoogle/gemma-3n-E4B-it-litert-previewgoogle/gemma-3n-E2B-it-litert-previewInteresting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE model...
Mixed Pro 0 repliesr It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is ex...
Devstral Official 2025-05-21	Devstral Devstral	701	148
Positive Pro 4 repliesr The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tagsI find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters bec...
Very positive Pro 3 repliesr The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run t...
Positive Pro 1 repliesr It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mis...
Positive Pro 1 repliesr It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.
Mixed Mixed 1 repliesr For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:total duration: 35.016288581s load duration:...
Claude 4 Official 2025-05-22	Claude 4 Claude	2,013	1,170
Neutral Pro 10 repliesr An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)https://docs.anthropic.com/en/docs/about-claude/models/overv...
Positive Pro 7 repliesr “GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handl...
Negative Anti 8 repliesr > Users requiring raw chains of thought for advanced prompt engineering can contact salesSo it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly ref...
Negative Anti 13 repliesr I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
Mixed Anti 1 repliesr Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing ...
Deepseek R1-0528 Repo 2025-05-28	DeepSeek R1 0528 DeepSeek	451	250
Neutral Neutral 4 repliesr Well that didn't take long, available from 7 providers through openrouter.https://openrouter.ai/deepseek/deepseek-r1-0528/providersMay 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B pa...
Negative Mixed 4 repliesr No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link
Positive Pro 7 repliesr I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare.
Neutral Neutral 13 repliesr Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?
Neutral Neutral 6 repliesr What use cases are people using local LLMs for? Have you created any practical tools that actually increase your efficiency? I've been experimenting a bit but find it hard to get inspiration for useful applications
Magistral — the first reasoning model by Mistral AI Official 2025-06-10	Magistral Small Magistral	941	424
Neutral Neutral 7 repliesr I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUFollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XLor./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1...
Negative Anti 15 repliesr Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is...
Very negative Anti 1 repliesr Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.I tried it, 80% of the "text" was recognised as images and output as whitespace so most of...
Positive Pro 2 repliesr We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in ...
Neutral Neutral 2 repliesr Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/
OpenAI o3-pro Official 2025-06-10	o3-pro o-series	289	199
Mixed Mixed 6 repliesr I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing sl...
Neutral Neutral 5 repliesr The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these mo...
Neutral Anti 4 repliesr The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off?
Negative Anti 4 repliesr I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on...
Positive Pro 2 repliesr here's a nice user review we published: https://www.latent.space/p/o3-prosama's highlight[0]:> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."I kept nu...
MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model Repo 2025-06-18	MiniMax-M1 MiniMax	349	75
Positive Pro 2 repliesr 1. this is apparently MiniMax's "launch week" - they did M1 on Monday and Hailuo 2 on Tuesday (https://news.smol.ai/issues/25-06-16-chinese-models). remains to be seen if they can keep up the pace of model releases for the rest of this week - these 2 were big ones, they aren't...
Neutral Anti 3 repliesr In case you're wondering what it takes to run it, the answer is 8x H200 141GB [1] which costs $250k [2].1. https://github.com/MiniMax-AI/MiniMax-M1/issues/2#issuecomme...2. https://www.ebay.com/itm/335830302628
Positive Pro 0 repliesr "We publicly release MiniMax-M1 at this https url" in the arxiv paper, and it isn't a link to an empty repo!I like these people already.
Positive Pro 4 repliesr A few thoughts:* A Singapore based company, according to LinkedIn. There doesn't seem to be much of a barrier to entry to building a very good LLM.* Open weight models + the development of Strix Halo / Ryzen AI Max makes me optimistic that running great LLMs locally will be re...
Neutral Neutral 2 repliesr This is stated nowhere on the official pages, but it's a Chinese company.https://en.wikipedia.org/wiki/MiniMax_(company)
Mistral Small 3.2 (24B-Instruct-2506) Repo 2025-06-20	Mistral Small 3.2 Mistral	23	0	—
Grok 4 Launch [video] 3rd party 2025-07-10	Grok 4 Grok	437	604
Positive Pro 5 repliesr Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.Specialized coding model coming "in a few weeks". I notice they didn't talk ab...
Positive Pro 9 repliesr The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent d...
Very positive Pro 3 repliesr I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc...
Neutral Neutral 0 repliesr "Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%.""This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."https://x.com/arcprize/status/1943168950763950555
Negative Anti 16 repliesr The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.I can already use Gemini 2.5 Pro for free in AI s...
Grok 4 3rd party 2025-07-10	Grok 4 Grok	328	253
Negative Anti 5 repliesr Here's something far more interesting about Grok 4: if you ask for its opinion on controversial subjects it sometimes runs a search on X for tweets "from:elonmusk" before it answers! https://simonwillison.net/2025/Jul/11/grok-musk/
Negative Mixed 5 repliesr [edit to focus on pricing, leaving praise of Simon's post out despite being deserved]Simon claims, 'Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4.' This ignores the real price which skyroc...
Neutral Anti 11 repliesr Claude Code converted me from paying $0 for LLMs to $200 per month. Any co that wants a chance at getting that $200 ($300 is fine too) from me needs a Claude Code equivalent and a model where the equivalent's tools were part of its RL environment. I don't think I can go back t...
Negative Anti 3 repliesr Is it time for a new benchmark of "how easy is it to turn this AI into a 4chan poster", maybe it is since this seems to be an axis that Elon seems to want to distinguish his AI offering from everyone else's along.
Neutral Neutral 4 repliesr > My best guess is that these lines in the prompt were the root of the problem:The second line was recently removed, per the GitHub: https://github.com/xai-org/grok-prompts/commit/c5de4a14feb50...
Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model 3rd party 2025-07-11	Moonshot Kimi K2 Instruct Kimi	348	179
Positive Pro 7 repliesr I tried Kimi on a few coding problems that Claude was spinning on. It’s good. It’s huge, way too big to be a “local” model — I think you need something like 16 H200s to run it - but it has a slightly different vibe than some of the other models. I liked it. It would definitely...
Neutral Neutral 6 repliesr Pelican on a bicycle result: https://simonwillison.net/2025/Jul/11/kimi-k2/
Positive Pro 3 repliesr This is a very impressive general purpose LLM (GPT 4o, DeepSeek-V3 family). It’s also open source.I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:https:...
Positive Pro 1 repliesr Technical strengths aside, I’ve been impressed with how non-robotic Kimi K2 is. Its personality is closer to Anthropic’s best: pleasant, sharp, and eloquent. A small victory over botslop prose.
Neutral Neutral 1 repliesr Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB
Qwen3-Coder: Agentic coding in the world Official 2025-07-22	Qwen3 Coder Flash Qwen	765	366
Neutral Neutral 13 repliesr I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder
Positive Pro 4 repliesr > Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant firstI'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But ...
Neutral Neutral 8 repliesr The "qwen-code" app seems to be a gemini-cli fork.https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSEI hope these OSS CC clones converge at some point.Actually it is mentioned in the page: we’re also open-sourcing a command-line tool for a...
Neutral Neutral 14 repliesr At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?05%: Making code changes10%: Running build pipelines20%: Learning about changed process and people via zoom calls, teams cha...
Very positive Pro 5 repliesr I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock...
GLM-4.5: Reasoning, Coding, and Agentic Abililties Official 2025-07-28	GLM-4.5 GLM	247	134
Negative Anti 15 repliesr I typed "Hello" in their chat [1] and it replied back with "Hello! I'm Claude, an AI assistant created by Anthropic. How can I help you today?"Hmmm....[1] https://chat.z.ai/
Neutral Neutral 3 repliesr We have x.ai and z.ai, but why?(Sorry... dad joke.)
Positive Pro 0 repliesr I tested the model with Claude Code and my experience was it was at least as good as Sonnet 3.5, perhaps they designed it that way since they benchmarked with it.Hard to test more thoroughly on more complex problems since the API is being hammered, but I could get it to consis...
Negative Anti 2 repliesr Tried it with a few prompts. The vibes feel super weird to me, in a way that I don't know how to describe. I'm not sure if it's just that I'm so used to Gemini 2.5 Pro and the other US models. Subjectively, it doesn't feel very smart.I asked it to analyze a recent painting I m...
Neutral Neutral 0 repliesr I just added it to Gnod Search for easy comparison with the other public LLMs out there:https://www.gnod.com/search/aiInteresting that they claim it outperforms O3, Grok-4 and Gemini-2.5-Pro for coding. I will test that over the coming days on my coding tasks.
My 2.5 year old laptop can write Space Invaders in JavaScript now (GL... 3rd party 2025-07-29	GLM-4.5-Air GLM	577	396
Very positive Pro 4 repliesr > Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high qualit...
Positive Pro 2 repliesr > still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.I believe we are vastly underestimating what our existing hardware is c...
Negative Anti 3 repliesr Did you understand the implementation or just that it produced a result?I would hope an LLM could spit out a cobbled form of answer to a common interview question.Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why...
Negative Anti 6 repliesr Most likely its training data included countless Space Invaders in various programming languages.
Mixed Mixed 4 repliesr I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.That said, for something like this, I’d probably get more out of simply finding an existing implementat...
Qwen-Image: Crafting with native text rendering Official 2025-08-04	Qwen-Image Qwen	544	159
Positive Pro 13 repliesr Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge.
Mixed Mixed 1 repliesr Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).That being said, it still lags prett...
Positive Pro 2 repliesr The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.
Neutral Neutral 8 repliesr This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on t...
Neutral Neutral 0 repliesr A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations...
Claude Opus 4.1 Official 2025-08-05	Claude Opus 4.1 Claude	841	328
Neutral Neutral 10 repliesr All three major labs released something within hours of each other. This anime arc is insane.
Mixed Mixed 5 repliesr Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, sp...
Negative Anti 22 repliesr I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certa...
Neutral Neutral 1 repliesr They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon(He had been stuck in the Team Rocket hideout (I believe) for weeks)
Very negative Anti 4 repliesr Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.I've basically wasted the morning o...
GPT-5 Official 2025-08-07	GPT-5 GPT	2,063	2,482
Neutral Neutral 109 repliesr It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the differ...
Neutral Anti 12 repliesr GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release).Compare that toGemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)https://platform.openai.com/docs/models/comparehttps://deepmin...
Negative Anti 12 repliesr Going by the system card at: https://openai.com/index/gpt-5-system-card/> GPT‑5 is a unified system . . .OK> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which mod...
Negative Anti 7 repliesr I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.What excites me now is that Gemini...
Negative Anti 13 repliesr The marketing copy and the current livestream appear tautological: "it's better because it's better."Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
GLM-4.5V: An open-source multimodal large language model from Zhipu AI Repo 2025-08-11	GLM-4.5V GLM	27	0	—
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] Paper 2025-08-12	GLM-4.5 GLM	417	83
Positive Pro 2 repliesr Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these...
Positive Pro 6 repliesr I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:1. Claude is somewhat bet...
Neutral Neutral 1 repliesr So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.T...
Positive Pro 2 repliesr Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level. Current cloud based models are fun and useful, but a tool that ...
Positive Pro 0 repliesr This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results ...
Claude Sonnet 4 now supports 1M tokens of context Official 2025-08-12	Claude Sonnet 4 Claude	1,324	692
Mixed Mixed 16 repliesr This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (...
Neutral Pro 12 repliesr A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max): 1. Build context for the work you're doing. Put lots of your codebase into the context window. 2. Do work, but ...
Mixed Mixed 4 repliesr This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users do...
Mixed Mixed 20 repliesr My experience with the current tools so far:1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "...
Positive Pro 8 repliesr One of the most helpful usages of CC so far is when I simply ask:"Are there any bugs in the current diff"It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think thr...
Gemma 3 270M: Compact model for hyper-efficient AI Official 2025-08-14	Gemma 3 270M Gemma	824	317
Positive Pro 30 repliesr Hi all, I built these models with a great team. They're available for download across the open model ecosystem so give them a try! I built these models with a great team and am thrilled to get them out to you.From our side we designed these models to be strong for their size o...
Negative Anti 18 repliesr My lovely interaction with the 270M-F16 model:> what's second tallest mountain on earth?The second tallest mountain on Earth is Mount Everest.> what's the tallest mountain on earth?The tallest mountain on Earth is Mount Everest.> whats the second tallest mountain?The ...
Positive Pro 3 repliesr I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.The great thing is I have enough data (100k+) to fine-tune and r...
Mixed Mixed 14 repliesr This model is a LOT of fun. It's absolutely tiny - just a 241MB download - and screamingly fast, and hallucinates wildly about almost everything.Here's one of dozens of results I got for "Generate an SVG of a pelican riding a bicycle". For this one it decided to write a poem: ...
Positive Pro 6 repliesr Apple should be doing this. Unless their plan is to replace their search deal with an AI deal -- it's just crazy to me how absent Apple is. Tim Cook said, "it's ours to take" but they really seem to be grasping at the wind right now. Go Google!
DeepSeek-v3.1 Official 2025-08-21	DeepSeek-V3.1 DeepSeek	778	263
Neutral Neutral 6 repliesr For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow../llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CP...
Mixed Mixed 6 repliesr For reference, here is the terminal-bench leaderboard:https://www.tbench.ai/leaderboardLooks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will...
Negative Anti 5 repliesr Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
Neutral Anti 2 repliesr It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoningPricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1
Mixed Mixed 2 repliesr It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.
Gemini 2.5 Flash Image Official 2025-08-26	Gemini 2.5 Flash Image Gemini	1,093	475
Very positive Pro 19 repliesr This is the gpt 4 moment for image editing models. Nano banana aka gemini 2.5 flash is insanely good. It made a 171 elo point jump in lmarena!Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111
Positive Pro 7 repliesr I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).https://genai-showdown.specr.netThis model gets 8 of the 12 prompts correct and easily comes within ...
Very negative Anti 4 repliesr Unfortunately, it suffers from the same safetyism than other many releases. Half of the prompts get rejected. How can you have character consistency if the model is forbidden from editing any human. And most of my photo editing involves humans, so basically this is just a usel...
Positive Pro 3 repliesr There is one thing Gemini 2.5 Flash Image can do that no other edit model can do: incorporate multiple images simultaneously without shenanigans due to its multimodality, e.g. for Flux Kontext, if you want to "put the person in the first image into the second image", you have ...
Positive Pro 6 repliesr I digitised our family photos but a lot of them were damaged (shifted colours, spills, fingerprints on film, spots) that are difficult to correct for so many images. I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing detai...
Grok Code Fast 1 Official 2025-08-29	Grok Code Fast 1 Grok	512	477
Positive Pro 10 repliesr Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which is awesome in my bo...
Negative Anti 14 repliesr It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.I would have thought it uncontroversial view among software engineers that token quality is much important than token ...
Negative Anti 1 repliesr Is this the model that is the "Coding" version of Grok-4 promised when Grok-4 had awful coding benchmarks?I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up wa...
Negative Anti 2 repliesr "On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness."Let's see this harness, then, because third party reports rate it at 57.6%https://www.vals.ai/models/grok_grok-code-fast-1
Mixed Anti 3 repliesr I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanations for sugg...
Qwen3-Next Official 2025-09-12	Qwen3-Next 80B-A3B (Thinking) Qwen	569	230
Positive Pro 3 repliesr Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...But Deepseek R1 adds embed_to...
Positive Mixed 4 repliesr Alibaba keeps releasing gold contentI just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysisAccording to Qwen Chat, Qwen3-Next has the f...
Negative Anti 5 repliesr llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"Here's a classic ASCII art representation of SpongeBob SquarePants: .------. / o o \ \| \| \| \___/ \| \_______/ llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ "An ASCII of spongebob" Here's an ASCII art of SpongeB...
Positive Pro 2 repliesr The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
Neutral Neutral 8 repliesr The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc
GPT-5-Codex Official 2025-09-15	GPT-5-Codex GPT	396	137
Positive Pro 5 repliesr Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).As someon...
Very positive Pro 3 repliesr I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was ...
Very positive Pro 4 repliesr Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any lim...
Positive Pro 2 repliesr From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.For people that have not tried it in say ~1 month, give Codex CLI a try.
Positive Pro 1 repliesr It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input...
Addendum to GPT-5 system card: GPT-5-Codex Official 2025-09-15	GPT-5-Codex GPT	250	141
Very positive Pro 9 repliesr I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% ...
Neutral Neutral 0 repliesr This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum.
Negative Anti 2 repliesr My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company p...
Mixed Mixed 2 repliesr Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchm...
Negative Anti 2 repliesr I signed up to OpenAI, verified my identity, and added my credit card, bought $10 of credits.But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.Are you meant to only use Codex with the...
Grok 4 Fast Official 2025-09-20	Grok 4 Fast Grok	96	76
Negative Anti 3 repliesr Why would a model called "Fast" not advertise the tokens per second speed it performs at? Is "Fast" not representing speed, but another meaning? Is it too variable?
Neutral Neutral 1 repliesr Matches Grok 4 at the top of the Extended NYT Connections leaderboard: https://github.com/lechmazur/nyt-connections/
Positive Pro 1 repliesr grok-code-fast-1 has been my preferred model lately, but I don't see any mention of it as part of this release. I'm wondering if this might be better? Even if grok-code-fast-1 might be slightly worse than Gemini 2.5 Pro, the speed of iteration can't be beat.
Positive Pro 4 repliesr Surprising to see negativity here. I send all my LLM queries to 5 LLMs - ChatGPT, Claude, DeepSeek (local), Perplexity, and Grok - and Grok consistently gives good answers and often the most helpful answers. It's ~always king when there's any 'ethical' consideration (i.e. othe...
Negative Anti 4 repliesr A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests?
DeepSeek-v3.1-Terminus Official 2025-09-22	siliconflow/deepseek-v3.1-terminus DeepSeek	101	28
Mixed Mixed 0 repliesr Notable performance improvement in agentic tool use: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-TerminusThe Deepseek provider may train on your prompts: https://openrouter.ai/deepseek/deepseek-v3.1-terminus
Neutral Neutral 0 repliesr The link is off. This link works https://api-docs.deepseek.com/updates#deepseek-v31-terminus
Very negative Anti 1 repliesr I tried V3.1 but it was driving me crazy by ignoring parts of user input, which R1 never did. I had many such instances when e.g. asking about running DeepSeek 671B it instead picked DeepSeek 67B because 671B is too large to exist so I must have made a mistake etc. I concluded...
Mixed Mixed 4 repliesr > What’s improved? Language consistency: fewer CN/EN mix-ups & no more random chars.It's good that they made this improvement. But is there any advantages at this point using DeepSeek over Qwen?
Positive Pro 0 repliesr Interesting--I'd seen Chinese characters surprise inserted when it was just repeating back input with one provider, but not others. (I'd also occasionally seen tokens surprise-translated to Chinese.)There's a GitHub bug about it that leads to more discussion here: https://gith...
Qwen3-Omni: Native Omni AI model for text, image and video Repo 2025-09-22	Qwen3-Omni Flash Qwen	571	142
Positive Pro 12 repliesr Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.The Chinese are going to end up owning the AI...
Neutral Neutral 3 repliesr You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.It has an entertaining selection of different voices, including:Dylan - A teenager who grew up in Beijing's hutongsPeter...
Neutral Neutral 4 repliesr The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.I wonder if we'll see a macOS port soon - currently it very much needs an N...
Very positive Pro 0 repliesr Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.https://www.youtube.com/watch?v=_zdOrPju4_g
Positive Pro 2 repliesr Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to you...
Qwen3-VL Official 2025-09-23	Qwen3-VL 235B-A22B Qwen	434	160
Very positive Pro 11 repliesr As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.Today I tried a handfu...
Very positive Pro 4 repliesr The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said,...
Neutral Neutral 2 repliesr If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage ...
Very positive Pro 2 repliesr The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Negative Anti 3 repliesr Sadly it still fails the "extra limb" test.I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.Like every other model I have tested, it insists that the animals have their an...
Improved Gemini 2.5 Flash and Flash-Lite Official 2025-09-25	Gemini 2.5 Flash Gemini	540	279
Negative Anti 14 repliesr This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.I've been running into it consistently, responses that just s...
Neutral Neutral 2 repliesr I added support to these models to my llm-gemini plugin, so you can run them like this (using uvx so no need to install anything first): export LLM_GEMINI_KEY='...' uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks' Re...
Negative Neutral 5 repliesr Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a ...
Mixed Mixed 8 repliesr Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflo...
Neutral Neutral 5 repliesr Non-AI Summary:Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).Gemini 2.5 Flash-Lite improvements include better instruction following, reduced v...
DeepSeek-v3.2-Exp Repo 2025-09-29	DeepSeek V3.2 Exp DeepSeek	309	50
Positive Pro 2 repliesr The 2nd order effect that not a lot of people talk about is price: the fact that model scaling at this pace also correlates with price is amazing.I think this is just as important to distribution of AI as model intelligence is.AFAIK there are no fundamental "laws" that prevent...
Positive Pro 2 repliesr Happy to see Chinese OSS models keep getting better and cheaper. It also comes with a 50% API price drop for an already cheap model, now at:$0.28/M Input ($0.028/M cache hit) > $0.42/M Output
Neutral Neutral 2 repliesr https://openrouter.ai/deepseek/deepseek-v3.2-exp
Neutral Neutral 0 repliesr Not sure if I get it correctly:They trained a thing to learn mimicking the full attention distribution but only filtering the top-k (k=2048) most important attention tokens so that when the context window increases, the compute does not go up linearly but constantly for the at...
Positive Pro 0 repliesr wow...gigantic reduction in cost while holding the benchmarks mostly steady. Impressive.
Claude Sonnet 4.5 Official 2025-09-29	Claude Sonnet 4.5 Claude	1,585	786
Positive Pro 12 repliesr I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of bench...
Negative Anti 21 repliesr Anecdotal evidence.I have a fairly large web application with ~200k LoC.Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI)."implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing ...
Very negative Anti 8 repliesr I haven't shouted into the void for a while. Today is as good a day as any other to do so.I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection b...
Negative Anti 8 repliesr > Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all ...
Mixed Mixed 7 repliesr I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chang...
GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilies Official 2025-09-30	GLM-4.6 GLM	43	10
Positive Pro 1 repliesr GLM 4.5 is a great budget option for me. After cancelling Claude sub ($20 USD one), I've been trying Gemini 2.5 Pro through Gemini CLI, and it was okay for landing pages and UI. However, I switched to GLM through Claude Code using their cheapest subscription and it's been pret...
Positive Pro 0 repliesr GLM 4.5 for me one of the best coding model when running with agents such as Claude Code and Opencode (excluding Claude and GPT-5).Looking forward to try GLM 4.6.But Z AI should do like KIMI¹ and launch a verifier to see how GLM models behave on 3rd party providers.[1] https:/...
Positive Pro 1 repliesr Very respectable. If Sonnet 4.5 was delayed by a day, GLM 4.6 would have been the obvious best choice for agentic coding... up there with gpt-codex.The z.ai monthly plan looks like a steal if you're working on personal hobby projects instead of paying the $200/mo price for Cla...
Neutral Neutral 0 repliesr It would be great if people that are using GLM-4.6 within Claude Code could chime in with their experience in using it. I'm a MAX subscriber and having extra cheap tokens would be very helpful.
Claude Haiku 4.5 Official 2025-10-15	Claude Haiku 4.5 Claude	730	287
Mixed Mixed 6 repliesr Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being t...
Negative Neutral 19 repliesr Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for......
Neutral Neutral 1 repliesr I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.
Positive Pro 8 repliesr Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...
Neutral Neutral 4 repliesr I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.
DeepSeek OCR Repo 2025-10-20	DeepSeek OCR DeepSeek	1,003	244
Positive Pro 7 repliesr The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required...
Neutral Neutral 4 repliesr I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-c...
Negative Anti 5 repliesr The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.https://annas-archive.org/blog...
Neutral Neutral 2 repliesr For everyone wondering how good this and other benchmarks are:- the OmniAI benchmark is bad- Instead check OmniDocBench[1] out- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini- End to End OCR is still extremely tricky- composed pip...
Neutral Neutral 7 repliesr How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
Tongyi DeepResearch – open-source 30B MoE Model that rivals OpenAI De... 3rd party 2025-11-02	Tongyi DeepResearch Yi	365	153
Neutral Neutral 7 repliesr It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.Given the size of frontier models I would assume that th...
Negative Anti 11 repliesr Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return.
Neutral Neutral 2 repliesr I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch
Mixed Pro 1 repliesr This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though.
Neutral Neutral 13 repliesr Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.Constra...
Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model Official 2025-11-06	Kimi K2 Thinking Kimi	936	427
Neutral Pro 5 repliesr As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups fr...
Neutral Neutral 4 repliesr uv tool install llm llm install llm-moonshot llm keys set moonshot # paste key llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle' https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...Here's what I got using OpenRouter's moonshotai/kimi-k...
Mixed Mixed 17 repliesr It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run local...
Positive Pro 9 repliesr Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?
Very positive Pro 1 repliesr From our tests, Kimi K2 Thinking is better than literally everything - gpt-5, claude 4.5 sonnet. the only model that is better than Kimi K2 thinking is GPT-5 codex.It's now available on https://okara.ai if anyone wants to try it.
GPT-5-Codex-Mini – A more compact and cost-efficient version of GPT-5... Repo 2025-11-08	Codex Mini GPT	56	54
Positive Neutral 2 repliesr I managed to get GPT-5-Codex-Mini to draw me a pelican. It's not a very good one! https://static.simonwillison.net/static/2025/codex-hacking-m...For comparison, here's GPT-5-Codex (not mini) https://static.simonwillison.net/static/2025/codex-hacking-d... and full GPT-5: https:...
Positive Pro 0 repliesr Since they also limited the usage of codex CLI quite a bit, this might help. I really like the gpt-5-codex model. It has the most impressive dynamic reasoning effort I've seen. Responding instantly to simple questions while thinking very long when necessary. It's also interest...
Negative Anti 2 repliesr Codex isn’t even a good model.
Negative Anti 4 repliesr GPT-5 and GPT-5-Codex are already not clever enough for anything interesting.
GPT-5.1: A smarter, more conversational ChatGPT Official 2025-11-12	GPT-5.1 GPT	555	726
Negative Anti 27 repliesr I don’t want more conversational, I want more to the point. Less telling me how great my question is, less about being friendly, instead I want more cold, hard, accurate, direct, and factual results.It’s a machine and a tool, not a person and definitely not my friend.
Negative Anti 22 repliesr All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.I suspect this approach is a direct response to the backlash against removing 4o.
Negative Anti 12 repliesr > what romanian football player won the premier league> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.> ...> No Romanian footballer has ever won...
Mixed Neutral 3 repliesr I’ve seen various older people that I’m connected with on Facebook posting screenshots of chats they’ve had with ChatGPT.It’s quite bizarre from that small sample how many of them take pride in “baiting” or “bantering” with ChatGPT and then post screenshots showing how they “g...
Positive Pro 6 repliesr Seems like people here are pretty negative towards a "conversational" AI chatbot.Chatgpt has a lot of frustrations and ethical concerns, and I hate the sycophancy as much as everyone else, but I don't consider being conversational to be a bad thing.It's just preference I guess...
Grok 4.1 Official 2025-11-17	Grok 4.1 Fast Grok	140	128
Neutral Neutral 5 repliesr https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
Negative Anti 4 repliesr No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).
Negative Anti 5 repliesr Not a big fan of emojis becoming the norm in LLM output.It seems Grok 4.1 uses more emojis than 4.Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.
Very negative Anti 3 repliesr Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" m...
Negative Neutral 1 repliesr It is exhausting deciding which model to use on any given day.
Gemini 3 Pro Model Card [pdf] Paper 2025-11-18	Gemini 3 Pro Preview Gemini	280	335
Neutral Neutral 19 repliesr Benchmarks from page 4 of the model card: \| Benchmark \| 3 Pro \| 2.5 Pro \| Sonnet 4.5 \| GPT-5.1 \| \|-----------------------\|-----------\|---------\|------------\|-----------\| \| Humanity's Last Exam \| 37.5% \| 21.6% \| 13.7% \| 26.5% \| \| ARC-AGI-2 \| 31.1% \| 4.9% \| 13.6% \| 17.6% \| \| GPQ...
Positive Mixed 13 repliesr It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
Positive Pro 1 repliesr One benchmark I would really like to see: instruction adherence.For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outp...
Neutral Neutral 8 repliesr There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.
Neutral Neutral 6 repliesr Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/[email protected]` which obviously fails...Anyone happen to know why? Is this website by any change sharing information on safe medical abo...
Gemini 3 Official 2025-11-18	Gemini 3 Pro Preview Gemini	1,735	1,056
Positive Pro 19 repliesr Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training dataGemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this pr...
Very positive Pro 1 repliesr This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...I spent years building a compiler that takes our custom XML format and generat...
Positive Pro 11 repliesr Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, e...
Neutral Pro 5 repliesr Static Pelican is boring. First attempt:Generate SVG animation of following:1 - There is High fantasy mage tower with a top window a dome2 - Green goblin come in front of tower with a torch3 - Grumpy old mage with beard appear in a tower window in high purple hat4 - Mage sends...
Negative Anti 14 repliesr I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, ...
Building more with GPT-5.1-Codex-Max Official 2025-11-19	GPT-5.1 Codex Max GPT	483	319
Mixed Pro 22 repliesr I've been using a lot of Claude and Codex recently.One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character...
Neutral Neutral 14 repliesr Rest assured that we are better at training models than naming them ;D- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0- Natively trained to work across many hours across multiple context windows via compaction- 30% mor...
Positive Pro 4 repliesr Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini ...
Negative Anti 5 repliesr OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)They were probably sitting on this for a while. That makes me think this is a ...
Neutral Neutral 3 repliesr Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
Claude Opus 4.5 Official 2025-11-24	Claude Opus 4.5 Claude	1,113	506
Positive Pro 19 repliesr The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.Also notable: they're claiming SOTA prompt injection resi...
Negative Mixed 15 repliesr This is gonna be game-changing for the next 2-4 weeks before they nerf the model.Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases”...
Positive Pro 24 repliesr I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.I ...
Mixed Mixed 1 repliesr The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded...
Positive Pro 9 repliesr Seeing these benchmarks makes me so happy.Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.Ive been hol...
DeepSeek-v3.2-Speciale Repo 2025-12-01	DeepSeek-V3.2-Speciale DeepSeek	20	0	—
DeepSeek-v3.2: Pushing the frontier of open large language models [pdf] Repo 2025-12-01	siliconflow/deepseek-v3.2 DeepSeek	982	465
Positive Pro 6 repliesr Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly.
Neutral Pro 15 repliesr How will the Google/Anthropic/OpenAI's of the world make money on AI if open models are competitive with their models? What hurt open source in the past was its inability to keep up with the quality and feature depth of closed source competitors, but models seem to be reaching...
Positive Pro 2 repliesr Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851
Mixed Mixed 1 repliesr > DeepSeek-V3.2 introduces significant updates to its chat template compared to prior versions. The primary changes involve a revised format for tool calling and the introduction of a "thinking with tools" capability.At first, I thought they had gone the route of implementi...
Mixed Mixed 7 repliesr It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?Frontier models are far exceeding even the...
Mistral 3 family of models released Official 2025-12-02	Mistral 3 Mistral	826	236
Very positive Pro 7 repliesr I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and fol...
Mixed Mixed 3 repliesr The new large model uses DeepseekV2 architecture. 0 mention on the page lol.It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar ...
Positive Pro 2 repliesr The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPUPelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/
Positive Pro 1 repliesr Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.Mistral had the best small mode...
Positive Pro 4 repliesr Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.
Gemini 3 Pro: the frontier of vision AI Official 2025-12-05	Gemini 3 Pro Preview Gemini	566	295
Mixed Mixed 27 repliesr WellIt is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.In fact GPT5 wrote an edge detection...
Positive Pro 4 repliesr I do some electrical drafting work for construction and throw basic tasks at LLMs.I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
Positive Pro 2 repliesr These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this fro...
Neutral Neutral 3 repliesr Interesting "ScreenSpot Pro" results: 72.7% Gemini 3 Pro 11.4% Gemini 2.5 Pro 49.9% Claude Opus 4.5 3.50% GPT-5.1 ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Usehttps://arxiv.org/abs/2504.07981
Neutral Neutral 4 repliesr In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.
Mistral releases Devstral2 and Mistral Vibe CLI Official 2025-12-09	Devstral 2 (latest) Devstral	745	349
Positive Pro 9 repliesr llm install llm-mistral llm mistral refresh llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle" https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!(That said I'm not 100% certain I guessed the correct model ID, ...
Positive Mixed 2 repliesr Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku ...
Positive Pro 2 repliesr I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.It spent about half an hour, correctly identified wh...
Positive Mixed 0 repliesr So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....Here is wha...
Mixed Mixed 9 repliesr Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.I'm a bit saddened ...
Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large... Official 2025-12-10	Qwen3-Omni Flash Realtime Qwen	316	106
Positive Pro 7 repliesr This is a 30B parameter MoE with 3B active parameters and is the successor to their previous 7B omni model. [1]You can expect this model to have similar performance to the non-omni version. [2]There aren't many open-weights omni models so I consider this a big deal. I would us...
Neutral Neutral 4 repliesr Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.ed...
Neutral Neutral 2 repliesr Is there a way to run these Omni models on a Macbook quantized via GGUF or MLX? I know I can run it in LMStudio or Llama.cpp but they don't have streaming microphone support or streaming webcam support.Qwen usually provides example code in Python that requires Cuda and a non-q...
Neutral Neutral 0 repliesr Wayback for those that can't reach https://web.archive.org/web/20251210164048/https://qwen.ai/b...
Neutral Neutral 1 repliesr The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.I'm curious how anyone has solved this
GPT-5.2 Official 2025-12-11	GPT-5.2 GPT	1,195	1,083
Neutral Neutral 18 repliesr In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.The thing that would now make the biggest difference isn't "more intelligence", whatever that...
Negative Anti 10 repliesr Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnO...
Negative Mixed 9 repliesr I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "pro...
Negative Anti 5 repliesr Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:- It is faster which is appreciated but not as fast as Opus 4.5- I see no changes, very little noticeable improvements over 5.1- I do not see any value in exchange ...
Neutral Pro 5 repliesr I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.The medium-reasoning version also improves: 62.7 → 72.1.The no-reasoning version also improves: ...
GPT Image 1.5 Official 2025-12-16	gpt-image-1.5 gpt-image	522	256
Positive Pro 16 repliesr Okay results are in for GenAI Showdown with the new gpt-image 1.5 model for the editing portions of the site!https://genai-showdown.specr.net/image-editingConclusions- OpenAI has always had some of the strongest prompt understanding alongside the weakest image fidelity. This u...
Mixed Mixed 4 repliesr I have a Nano Banana Pro blog post in the works expanding on my experiments with Nano Banana (https://news.ycombinator.com/item?id=45917875). Running a few of my test cases from that post and the upcoming blog post through this new ChatGPT Image model, this new model is better...
Negative Anti 7 repliesr If this was a farm of sweatshop Photoshopers in 2010, who download all images from the internet and provide a service of combining them on your request, this would escalate pretty quickly.Question: with copyright and authorship dead wrt AI, how do I make (at least) new content...
Positive Pro 2 repliesr I am very impressed a benchmark I like to run is have it create sprite maps, uv texture maps for an imagined 3d modelNoticed it captured a megaman legends vibe ....https://x.com/AgentifySH/status/2001037332770615302and here it generated a texture map from a 3d characterhttps:/...
Negative Anti 4 repliesr It's really weird to see "make images from memories that aren't real" as a product pitch
Gemini 3 Flash: Frontier intelligence built for speed Official 2025-12-17	Gemini 3 Flash Preview Gemini	1,102	580
Very positive Pro 24 repliesr Don’t let the “flash” name fool you, this is an amazing model.I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fra...
Mixed Mixed 8 repliesr This is awesome. No preview release either, which is great to production.They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for outputFor comparison:Gemini 3.0 Flash: $0.50/M for input and $3.00/M for outputGemini 2.5 Fl...
Positive Pro 4 repliesr Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.Presumably a big motivation for them is to be first to get something good and cheap enough they c...
Mixed Mixed 6 repliesr These flash models keep getting more expensive with every release.Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world ...
Positive Pro 3 repliesr This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recogniti...
GPT-5.2-Codex Official 2025-12-18	GPT-5.2 Codex GPT	589	318
Very positive Pro 15 repliesr If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, metho...
Positive Pro 9 repliesr I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging t...
Mixed Mixed 4 repliesr Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context: SWE-Bench (Pro / Verified) Model \| Pro (%) \| Verified (%) -----------------...
Positive Pro 4 repliesr The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it...
Positive Pro 3 repliesr I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex s...
GLM-4.7: Advancing the Coding Capability Official 2025-12-22	GLM-4.7 GLM	437	235
Positive Pro 11 repliesr My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese prim...
Neutral Neutral 5 repliesr Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-co...
Mixed Mixed 2 repliesr Appears to be cheap and effective, though under suspicion.But the personal and policy issues are about as daunting as the technology is promising.Some the terms, possibly similar to many such services: - The use of Z.ai to develop, train, or enhance any algorithms, models, or ...
Negative Anti 3 repliesr I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">Analyze the User's Input: Question: "is it ok for l...
Very positive Pro 2 repliesr I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Mayb...
MiniMax M2.1: Built for Real-World Complex Tasks, Multi-Language Prog... Official 2025-12-26	MiniMax-M2.1 MiniMax	228	81
Mixed Mixed 1 repliesr I've played with this a bit and it's ok. I'd place it somewhere around sonnet 4.5 level, probably below. But with this aggressive pricing you can just run 3 copies to do the same thing, choose the one that succeeded and still come out way ahead with the cost. Not as great as f...
Negative Anti 2 repliesr Would it kill them to use the words "AI coding agent" somewhere prominent?"MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks" could be an IDE, a UI framework, a performance library, or, or...
Neutral Neutral 0 repliesr The weights got released on huggingface now.https://huggingface.co/MiniMaxAI/MiniMax-M2.1
Neutral Neutral 1 repliesr I think people should stop comparing to sonnet, but to opus instead since it's so far ahead on producing code I would actually want to use (gemini 3 pro tends to be lacking in generalization and wants things to be using it's own style rather than adapting).Whatever benchmark o...
Negative Anti 2 repliesr > MiniMax has been continuously transforming itself in a more AI-native way. The core driving forces of this process are models, Agent scaffolding, and organization. Throughout the exploration process, we have gained increasingly deeper understanding of these three aspects....
GLM-4.7-Flash Repo 2026-01-19	GLM-4.7 GLM	378	135
Positive Pro 3 repliesr Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't reall...
Positive Pro 2 repliesr I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designe...
Positive Pro 8 repliesr Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to ...
Positive Pro 2 repliesr > SWE-bench Verified 59.2This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
Neutral Neutral 1 repliesr What’s the significance of this for someone out of the loop?
Qwen3-TTS family is now open sourced: Voice design, clone, and genera... Official 2026-01-22	Qwen3-TTS Qwen	744	225
Neutral Neutral 9 repliesr If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then pas...
Neutral Neutral 4 repliesr I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.pyYou can try it with uv (downloads a 4.5GB model on first run) like ...
Mixed Mixed 2 repliesr Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically i...
Very positive Pro 3 repliesr it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the mode...
Neutral Neutral 9 repliesr Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
Qwen3-Max-Thinking Official 2026-01-26	Qwen3 Max Qwen	502	424
Mixed Mixed 5 repliesr One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the...
Neutral Neutral 3 repliesr It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?My problem with deep research tends to be that what it does is it searche...
Neutral Neutral 4 repliesr I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/...
Neutral Neutral 2 repliesr Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.
Positive Pro 0 repliesr The most important benchmark:https://boutell.dev/misc/qwen3-max-pelican.svgI used Simon Willison's usual prompt.It thought for over 2 minutes (free account). The commentary was even more glowing than the image.It has a certain charm.
Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model Official 2026-01-27	Kimi K2.5 Kimi	502	239
Neutral Neutral 4 repliesr Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.51T parameters, 32b active parameters.License: MIT with the following modification:Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or s...
Very positive Pro 5 repliesr The "Deepseek moment" is just one year ago today!Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
Positive Pro 4 repliesr > For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduc...
Neutral Neutral 0 repliesr I posted this elsewhere but thought I'd repost here:* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO* https://dashboard.safe.ai/ — CAIS' incredible dashboard* https://clocks.brianmoore.com/ — a visual comparison of how well models ...
Positive Pro 3 repliesr One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.GitHub: http...
Kimi K2.5 Technical Report [pdf] Repo 2026-01-30	Kimi K2.5 Kimi	388	141
Positive Pro 4 repliesr I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good a...
Negative Anti 5 repliesr Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.examples to illustratehttps://ww...
Negative Anti 0 repliesr I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via...
Positive Pro 0 repliesr I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.
Mixed Mixed 2 repliesr It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can...
Qwen3-Coder-Next Official 2026-02-03	Qwen3 Coder Flash Qwen	735	429
Positive Pro 14 repliesr This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough ...
Neutral Neutral 7 repliesr For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next
Neutral Neutral 2 repliesr I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this: brew upgrade llama.cpp # or brew install if you don't have it yet Then: llama-cli \ -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \ --fit on \ --seed 3407 \ --temp 1.0 \ --top-p ...
Positive Mixed 3 repliesr It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.
Positive Pro 0 repliesr I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTUVideo is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: ht...
Claude Opus 4.6 Official 2026-02-05	Claude Opus 4.6 Claude	2,346	1,031
Very positive Pro 37 repliesr Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 ...
Positive Anti 7 repliesr 5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!
Mixed Mixed 12 repliesr I'm still not sure I understand Anthropic's general strategy right now.They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.Meanwhile, Claude's general use cases are... fine. For generic r...
Neutral Neutral 1 repliesr Claude Code release notes: > Version 2.1.32: • Claude Opus 4.6 is now available! • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) • Claude now automatically records ...
Mixed Mixed 23 repliesr The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
Advancing finance with Claude Opus 4.6 Official 2026-02-05	Claude Opus 4.6 Claude	154	50
Mixed Mixed 4 repliesr Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.My anecdotal experienc...
Neutral Neutral 0 repliesr Based on the article... is this basically just making Claude better at formatting and data presentation, or does it also get better at analysis? I get the impression it's the former.
Mixed Mixed 0 repliesr The benchmarks look good. Slide decks and spreadsheets look better. The people must use Claude Cowork and have their Claude Code moment and figure out the consequences. It will be really interesting to see articles like this (https://mitchellh.com/writing/my-ai-adoption-journe...
Mixed Mixed 1 repliesr And then you hand it to your boss who takes a 20 second look at it and asks why you made a projection that assume massive revenue growth and 3 years of perfectly flat utilities, insurance, G&A - no inflation etc.It does look really promising as a skeleton starting point th...
Negative Neutral 0 repliesr > The side-by-side outputs below show how output quality has improved from Claude Opus 4.5 to Opus 4.6.Disclaimer: I use AI to code (and I code for finance) and I love Anthropic.But: for f-ck's sake, I cannot click on the picture and have it show up in full. It stays at its...
GPT-5.3-Codex Official 2026-02-05	GPT-5.3 Codex GPT	1,530	605
Neutral Pro 23 repliesr Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophicallyWith Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the...
Positive Pro 6 repliesr I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.GPT-5.3-codex scores 77.3.
Mixed Mixed 6 repliesr ,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks...
Positive Pro 2 repliesr Something that caught my eye from the announcement:> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own trainingI'm happy to see the Codex team moving to this kind of dogfooding. I think this was cr...
Neutral Neutral 8 repliesr I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
Claude Opus 4.6 extra usage promo Official 2026-02-05	Claude Opus 4.6 Claude	212	75
Negative Anti 8 repliesr Considering all the problems they've been having with over-charging Claude Code users over the past few weeks it's the very least they could do. Max subscribers are hitting their 5 hour usage limits in 30-40 minutes with a single instance doing light work, while Anthropic have...
Negative Anti 1 repliesr Is this one of those, Hey turn off the overcharge protection because you can go $50 into debt for free, and then maybe you'll just keep going and not notice you owe us an extra $500 type of situations?
Neutral Neutral 1 repliesr Meanwhile codex is running a free tier for a month right now for anyone who wants to experiment with it: https://openai.com/index/introducing-the-codex-app/Doesn't appear to include the new model though, only the state-of-yesterdays-art (literally yesterdays).
Very negative Anti 2 repliesr I'll pass on this $50, but please hire real human and fix your crappy app, Claude!This bug has been for years: in Claude (web or app), if you create a new chat at the middle of existing chat thinking or tool calling, the existing chat will be broken, either losing data, or bec...
Neutral Neutral 0 repliesr You can use this if you started your Pro or Max subscription before Wednesday, February 4, 2026 at 11:59 PM PT.Go to https://claude.ai/settings/usage, turn on extra usage and enable the promo from the notification afterwards.I received €42, top up was not required and auto-rel...
Qwen-Image-2.0: Professional infographics, exquisite photorealism Official 2026-02-10	Qwen-Image-2.0 Qwen	422	192
Neutral Neutral 10 repliesr I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the ren...
Positive Pro 2 repliesr Couple of thoughts:1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can...
Positive Pro 3 repliesr It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.
Neutral Neutral 1 repliesr The "horse riding man" prompt is wild:"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult br...
Neutral Neutral 11 repliesr I recently tried out LMStudio on Linux for local models. So easy to use!What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
GLM-5: Targeting complex systems engineering and long-horizon agentic... Official 2026-02-11	GLM-5 GLM	484	520
Mixed Mixed 9 repliesr Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame.
Mixed Pro 8 repliesr Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be ye...
Positive Pro 2 repliesr Bought some API credits and ran it through opencode (model was "GLM 5").Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO sc...
Mixed Mixed 1 repliesr Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.[1] https:/...
Positive Pro 5 repliesr GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clea...
MiniMax M2.5 released: 80.2% in SWE-bench Verified Official 2026-02-12	MiniMax-M2.5 MiniMax	207	55
Negative Anti 3 repliesr I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the st...
Negative Anti 2 repliesr Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...
Very positive Pro 2 repliesr Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim...
Very negative Anti 1 repliesr Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective prop...
Negative Anti 0 repliesr M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking
Gemini 3 Deep Think Official 2026-02-12	Gemini 3 Pro Preview Gemini	1,081	693
Positive Pro 15 repliesr Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)Wow.https://blog.google/innovation-and-ai/models-and-research/ge...
Negative Neutral 11 repliesr Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had K...
Positive Pro 3 repliesr I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a f...
Very positive Pro 17 repliesr Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.
Neutral Neutral 3 repliesr Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved">Submi...
GPT‑5.3‑Codex‑Spark Official 2026-02-12	GPT-5.3 Codex Spark GPT	890	387
Positive Pro 13 repliesr Wow, I wish we could post pictures to HN. That chip is HUGE!!!!The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× mor...
Positive Pro 8 repliesr I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d r...
Mixed Mixed 7 repliesr First thoughts using gpt-5.3-codex-spark in Codex CLI:Blazing fast but it definitely has a small model feel.It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a ...
Very positive Pro 10 repliesr Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing
Neutral Neutral 2 repliesr My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/
GPT-5.2 derives a new result in theoretical physics Official 2026-02-13	GPT-5.2 GPT	574	401
Neutral Mixed 26 repliesr The headline may make it seem like AI just discovered some new result in physics all on its own, but reading the post, humans started off trying to solve some problem, it got complex, GPT simplified it and found a solution with the simpler representation. It took 12 hours for ...
Positive Pro 15 repliesr It's interesting to me that whenever a new breakthrough in AI use comes up, there's always a flood of people who come in to handwave away why this isn't actually a win for LLMs. Like with the novel solutions GPT 5.2 has been able to find for erdos problems - many users here (e...
Positive Pro 2 repliesr "An internal scaffolded version of GPT‑5.2 then spent roughly 12 hours reasoning through the problem, coming up with the same formula and producing a formal proof of its validity."When I use GPT 5.2 Thinking Extended, it gave me the impression that it's consistent enough/has a...
Positive Mixed 9 repliesr AI can be an amazing productivity multiplier for people who know what they're doing.This result reminded me of the C compiler case that Anthropic posted recently. Sure, agents wrote the code for hours but there was a human there giving them directions, scoping the problem, fin...
Mixed Mixed 1 repliesr It would be more accurate to say that humans using GPT-5.2 derived a new result in theoretical physics (or, if you're being generous, humans and GPT-5.2 together derived a new result). The title makes it sound like GPT-5.2 produced a complete or near-complete paper on its own,...
Qwen3.5: Towards Native Multimodal Agents Official 2026-02-16	Qwen3.5 397B-A17B Qwen	434	214
Neutral Neutral 1 repliesr For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5
Negative Anti 9 repliesr You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.
Positive Pro 0 repliesr "the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training obje...
Negative Anti 8 repliesr Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...
Positive Pro 3 repliesr Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.
Claude Sonnet 4.6 Official 2026-02-17	Claude Sonnet 4.6 Claude	1,346	1,226
Very negative Anti 15 repliesr I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their o...
Negative Neutral 4 repliesr They use the word "Sonnet" 60+ times on that page but never give the casual reader any context of what a "Sonnet model" actually is. Neither does their landing page. You have to scroll all the way to the footer to find a link under the "Models" section. You click it and you fi...
Mixed Mixed 9 repliesr I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 yearsIt is a far cry from Opus 4.6.Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing ana...
Neutral Neutral 16 repliesr The demise of saas has been overplayed imho. When companies buy software they are essentially buying something that solves a problem and the insurance that comes with that. Part of that means they get to pick up the phone and complain if something doesn't work and someone on t...
Positive Neutral 9 repliesr I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers.
Gemini 3.1 Pro Official 2026-02-19	Gemini 3.1 Pro Preview Gemini	963	914
Negative Anti 31 repliesr I hope this works better than 3.0 ProI'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.It's stunningly good at reasoning, design, and generatin...
Positive Pro 22 repliesr People underrate Google's cost effectiveness so much. Half price of Opus. HALF.Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight____Update:3.1 was 40% of the cost to run AA index vs Opu...
Mixed Mixed 2 repliesr If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.While it gives me hope, I am going to pl...
Neutral Neutral 9 repliesr Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricingKnowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3Compare to Op...
Mixed Mixed 8 repliesr These models are so powerful.It's totally possible to build entire software products in the fraction of the time it took before.But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.It feels lik...
Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local com... 3rd party 2026-02-28	Qwen3.5 122B Qwen	461	272
Mixed Anti 18 repliesr If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.I've been playing with Qwen3-Coder-...
Negative Anti 19 repliesr I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.So far Opus 4.6 and Gemini Pro are ve...
Neutral Neutral 4 repliesr I recently wrote a guide on getting:- llama.cpp- OpenCode- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)working on a M1 MacBook Pro (e.g. using brew).It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models....
Positive Pro 4 repliesr I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.I then discovered what quantization is by reading a blog post about binary quantizat...
Negative Anti 3 repliesr Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.Up until relatively recently, while people had already long been making these claim...
Gemini 3.1 Flash-Lite: Built for intelligence at scale Official 2026-03-03	Gemini 3.1 Flash Lite Preview Gemini	60	30
Neutral Mixed 2 repliesr Lots of comments about the price change, but Artifical Analysis reports that 3.1 Flash-Lite (reasoning) used fewer than half of the tokens of 2.5 Flash-Lite (reasoning).This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output...
Negative Anti 0 repliesr Unfortunate, significant price increase for a 'lite' model: $0.25 IN / $1.50 OUT vs. Gemini 2.5 Flash-Lite $0.10 IN / $0.40 OUT.
Negative Anti 4 repliesr For the last 2 years, startup wisdom has been that models will continue to get cheaper and better. Claude first, and now Gemini has shown that it's not the case.We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic...
Neutral Anti 0 repliesr That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model
Positive Pro 2 repliesr You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.We benchmarked it for real-life voice-to-text use cases: <10s 10-30s 30s-1m 1-2m 2-3m Flash 2548 2732 3177 4583 5961 Flash L...
GPT‑5.3 Instant Official 2026-03-03	GPT-5.3 Chat GPT	395	301
Very negative Anti 22 repliesr The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. "Why it matters", "the big picture", "it's not jut you", the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional so you can ea...
Negative Anti 7 repliesr I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / "non-thinking" mode?I feel like openai is going to get right back to wh...
Negative Anti 11 repliesr Since the page mentions:> Better judgment around refusalsHas any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, "make up a joke about <group>" and then ...
Negative Anti 5 repliesr I kind of chuckled when I read the headline "GPT‑5.3 Instant: Smoother, more ..."LLM companies starting to sound like cigarette advertisements.
Negative Anti 6 repliesr Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles?
Something is afoot in the land of Qwen 3rd party 2026-03-04	Qwen Turbo Qwen	783	360
Positive Pro 8 repliesr I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've h...
Negative Anti 2 repliesr There has been tension between Qwen's research team and Alibaba's product team, say the Qwen App. And recently, Alibaba tried to impose DAU as a KPI. It's understandable that a company like Alibaba would force a change of product strategy for any number of reasons. What puzzle...
Positive Pro 1 repliesr One thing I’ve noticed with local models is that people tolerate a lot more trial and error behavior. When a hosted model wastes tokens it feels expensive, but when a local model loops a bit it just feels like it’s “thinking.”If models like Qwen can get good enough for coding ...
Neutral Neutral 9 repliesr I wonder how a US lab hasn't dumped truckloads of cash into various laps to ensure these researchers have a place at their lab
Neutral Neutral 5 repliesr How do those companies make money? Qwen, GLM, Kimi, etc all released for free. I have no experience in the field, but from reading HN alone my impression was training is exceptionally costly and inference can be barely made profitable. How/why do they fund ongoing development ...
GPT-5.4 Official 2026-03-05	GPT-5.4 GPT	1,019	807
Mixed Mixed 13 repliesr The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/Also per...
Negative Anti 11 repliesr I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m...
Positive Pro 7 repliesr I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Cod...
Negative Anti 19 repliesr I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post f...
Negative Mixed 10 repliesr So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it w...
GPT-5.4 Thinking and GPT-5.4 Pro 3rd party 2026-03-05	GPT-5.4 Pro GPT	93	2
Neutral Neutral 0 repliesr Comments moved to https://news.ycombinator.com/item?id=47265045.
Neutral Neutral 0 repliesr More discussion: https://news.ycombinator.com/item?id=47265005
GPT‑5.4 Mini and Nano Official 2026-03-17	GPT-5.4 mini GPT	248	145
Positive Pro 7 repliesr I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).- GPT-...
Positive Pro 6 repliesr To me, mini releases matter much more and better reflect the real progress than SOTA models.The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.Meanwhile, when a smaller / less powerful model releases a new ...
Mixed Mixed 7 repliesr I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I...
Neutral Neutral 4 repliesr Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...
Neutral Anti 2 repliesr According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?GPT 5 mini: Input $0.25 / Output $2.00GPT 5 nano: Input: $0.05 / Output $0.40GPT 5.4 mini: Input $0.75 / Output $4.50G...
Qwen3.6-Plus: Towards real world agents Official 2026-04-02	Qwen3.6 Plus Qwen	594	213
Negative Anti 11 repliesr This is their hosted-only model, not an open weight model like they’ve become known for. They got a lot of good publicity for their open weight model releases, which was the goal. The hard part is pivoting from an open weight provider to being considered as a competitor to Cla...
Positive Mixed 2 repliesr I understand peoples reactions of Qwen team comparing against Opus 4.5 instead of 4.6. And them comparing against Gemini Pro 3.0 instead of 3.1. But calling it misleading is a bit of stretch in my eyes, people here are acting like we immediately forgot how previous generations...
Positive Pro 2 repliesr Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...I used the https://modelstudio.alibabacloud.com/ API to generate that one, which required signing up for an account and attaching PayPal billing - but it looks like OpenRouter are offering it ...
Negative Anti 5 repliesr Worth noting that this model, unlike almost all qwen models, is not open-weight, nor is the parameter count exposed. Also odd that it is compared against opus 4.5 even though 4.6 was released like 2 months ago.
Positive Pro 2 repliesr I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.I'm obviously not s...
Google releases Gemma 4 open models Official 2026-04-02	Gemma 4 26B Gemma	1,811	474
Positive Pro 21 repliesr Thinking / reasoning + multimodal + tool calling.We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!Guide for those interested: https://unsloth.ai/docs/models/gemma-4Also note to use temperature = 1.0, top_p ...
Mixed Mixed 9 repliesr I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.https://simonwillison.net/2026/Apr/2/gemma-4/The gemma-4-31b model is compl...
Neutral Neutral 3 repliesr Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: \| Model \| MMLUP \| GPQA \| LCB \| ELO \| TAU2 \| MMMLU \| HLE-n \| HLE-t \| \|----------------\|-------\|-------\|-------\|------\|-------\|-------\|-------\|-------\| \| G4 31B \| 85.2% \| ...
Negative Mixed 6 repliesr Prompt:> what is the Unix timestamp for this: 2026-04-01T16:00:00ZQwen 3.5-27b-dwq> Thought for 8 minutes 34 seconds. 7074 tokens.> The Unix timestamp for 2026-04-01T16:00:00Z is:> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)Gemma-4-26b-a4b> Thou...
Neutral Neutral 23 repliesr Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can
GLM-5.1: Towards Long-Horizon Tasks Official 2026-04-07	GLM-5.1 GLM	617	262
Positive Pro 3 repliesr We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.But keeping in mind t...
Positive Pro 3 repliesr Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/
Neutral Anti 1 repliesr Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.[0] https://huggingface.co/unsloth/GLM-5...
Mixed Mixed 7 repliesr To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
Positive Pro 15 repliesr Every single day, three things are becoming more and more clear: (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat (2) Local/private inference is the future of AI (3) There's still no killer product yet (so get to work!)
MiniMax M2.7 Is Now Open Source 3rd party 2026-04-12	MiniMax-M2.7 MiniMax	80	34
Negative Anti 4 repliesr Absolutely not "open source" - here's the license: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICE...> Non-commercial use permitted based on MIT-style terms; commercial use requires prior written authorization.And calling the non-commercial usage "MIT-style ter...
Positive Pro 1 repliesr GGUFs are out too, well done Unsloth as usual!https://huggingface.co/unsloth/MiniMax-M2.7-GGUFI've been using M2.7 through the Alibaba coding plan for a bit now, and am quite impressed with it's coding ability, and even more impressed when I see how small it is. Fascinating re...
Negative Anti 2 repliesr What's people's experience of using MiniMax for coding?I had a really bad time with it. I use (real) Claude Code for work so I know what a good model feels like. MiniMax's token plan is nice but the quality is really far from Claude models.I needed to constantly "remind" it to...
Mixed Mixed 1 repliesr "Helped build itself" is a bit of a stretch here, it makes it sound as if the model was doing lasting self-improvements.What the article describes is that the model was able to tweak to its own deployment harness (memory, skills, experimental loop etc) to improve performance o...
Mixed Mixed 1 repliesr In addition to this conversation already having been started at https://news.ycombinator.com/item?id=47735348 yesterday, MiniMax M2.7 is not open source. The open weights have been released, which is definitely good and follows some of the spirit of open source, but isn't the ...

Analyzing 156 LLM Launch Posts on Hacker News

Dataset and methodology

Across providers

Launch timeline

Monthly model launch cadence

Launch performance

Overall model launch timeline

Score vs. comments

Top 10 posts by score

Provider engagement overview

Comments

Comment sentiment by post

Common discussion themes

Sentiment distribution

Comment sentiment by provider

Posts with skeptical comment sections

Key insights

Provider details

OpenAI

Model and post engagement timeline

Comment reception over time

Google DeepMind

Model and post engagement timeline

Comment reception over time

Anthropic

Model and post engagement timeline

Comment reception over time

Alibaba

Model and post engagement timeline

Comment reception over time

Mistral AI

Model and post engagement timeline

Comment reception over time

Meta

Model and post engagement timeline

Comment reception over time

DeepSeek

Model and post engagement timeline

Comment reception over time

Zhipu AI

Model and post engagement timeline

Comment reception over time

Moonshot AI

Model and post engagement timeline

Comment reception over time

xAI

Model and post engagement timeline

Comment reception over time

Microsoft

Model and post engagement timeline

Comment reception over time

MiniMax

Model and post engagement timeline

Comment reception over time

Cohere

Model and post engagement timeline

Comment reception over time

Limitations

Full dataset index