This report tracks how Hacker News reacted to 156 posts about LLM launches from 13 providers, covering 132 model versions across 24 families (March 2023 to April 2026). The selected posts drew 56,041 HN comments in total, with a labeled sample of 735 non-deleted top-level comments for sentiment analysis.
Dataset and methodology
Posts are included if the title points to a model launch or release, from official or trusted third-party sources, above a minimum score and comments threshold (≥50 each), and verified by manual review. Duplicates (e.g. one post covering multiple models) are counted once.
For each post, the top 5 comments (by HN display order) are collected and labeled for sentiment and stance using an LLM. Discussion themes are identified via keyword matching.
156 posts from 13 providers, covering 132 model versions (Mar 2023 – Apr 2026). Mostly official announcements (118), plus repos (15), papers (7), and third-party coverage (16). 56,041 total HN comments across those posts, with 735 non-deleted top-level comments sampled for labeling.
Across providers
This section compares providers on launch timing, performance, and comment quality. Charts use the top 8 providers by total score.
Launch timeline
How many models shipped each month, and how the pace changed over time.
Monthly model launch cadence
Unique model versions launched per month (by release date). Peak: 2025-09 (11 models).
Launch performance
Score, comment volume, and per-post averages — which providers get the most attention and which perform best.
Overall model launch timeline
Score vs. comments
Each dot is one post. Points above the diagonal get more discussion relative to their score.
Top 10 posts by score
The top 5 posts account for 13.5% of all score — attention is highly concentrated.
Provider engagement overview
All providers ranked by post count, total score, and engagement efficiency.
| Provider | Posts | Total score | Total comments | Avg score | Avg comments | Score/comment | Score share |
|---|---|---|---|---|---|---|---|
|
|
28 | 22,602 | 16,682 | 807 | 596 | 1.35 | 21.8% |
|
|
24 | 21,055 | 11,043 | 877 | 460 | 1.91 | 20.3% |
|
|
11 | 13,791 | 7,114 | 1254 | 647 | 1.94 | 13.3% |
|
|
24 | 11,311 | 5,183 | 471 | 216 | 2.18 | 10.9% |
|
|
20 | 11,002 | 4,258 | 550 | 213 | 2.58 | 10.6% |
|
|
6 | 7,488 | 3,217 | 1248 | 536 | 2.33 | 7.2% |
|
|
10 | 6,970 | 3,053 | 697 | 305 | 2.28 | 6.7% |
|
|
9 | 3,227 | 1,775 | 359 | 197 | 1.82 | 3.1% |
|
|
4 | 2,174 | 986 | 544 | 247 | 2.2 | 2.1% |
|
|
7 | 1,871 | 2,082 | 267 | 297 | 0.9 | 1.8% |
|
|
5 | 1,054 | 305 | 211 | 61 | 3.46 | 1.0% |
|
|
4 | 864 | 245 | 216 | 61 | 3.53 | 0.8% |
|
|
4 | 364 | 98 | 91 | 25 | 3.71 | 0.4% |
- Scale vs. efficiency: providers with the most launches do not always have the highest per-post engagement. High-volume example: OpenAI. Highest average score (3+ posts): Anthropic.
- Concentration risk: Cohere has the highest single-post concentration (52.5% of score from one post), while Alibaba is most evenly distributed (7.7%).
Comments
735 labeled top comments from 152 posts (100.0% of sampled, non-deleted comments) across providers. Net sentiment is 0.169 (range −2 to +2), with 38.0% debate-weighted controversy vs 37.9% pro. Anti-stance comments average 6.8 replies vs 6.5 for pro.
Comment sentiment by post
Color = net score, size = comment count, shape = post type. Hover for per-comment breakdown.
Common discussion themes
Keyword matches across 735 comments.
Sentiment distribution
735 labeled comments.
Comment sentiment by provider
Aggregate sentiment and stance per provider, with per-post breakdowns. Click a provider to expand.
| Sentiment | Stance | Pro % | Anti % | Neg % | Avg replies | Controversy % | |
|---|---|---|---|---|---|---|---|
|
|
0.0 | 0.0 | 32.1% | 32.1% | 32.1% | 8.86 | 50.3% |
|
GPT-4
4,091pts
|
0.0 | -0.2 | 40.0% | 60.0% | 60.0% | 23.0 | |
| -0.2 | 0.0 | 40.0% | 40.0% | 40.0% | 3.6 | ||
|
GPT-4o
3,138pts
|
0.4 | 0.0 | 20.0% | 20.0% | 20.0% | 9.6 | |
| 0.0 | 0.2 | 40.0% | 20.0% | 20.0% | 3.4 | ||
| -1.0 | -0.8 | 0.0% | 80.0% | 80.0% | 6.8 | ||
|
OpenAI O1-Mini
30pts
|
0.0 | 0.0 | 0.0% | 0.0% | 0.0% | 0.67 | |
|
OpenAI O3-Mini
962pts
|
0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 9.8 | |
|
GPT-4.5
1,136pts
|
-0.6 | -0.6 | 0.0% | 60.0% | 60.0% | 18.8 | |
| -0.2 | 0.0 | 20.0% | 20.0% | 20.0% | 3.8 | ||
|
GPT-4.1 in the API
680pts
|
-0.2 | 0.2 | 20.0% | 0.0% | 20.0% | 7.6 | |
|
OpenAI o3 and o4-mini
555pts
|
0.0 | 0.2 | 60.0% | 40.0% | 40.0% | 6.6 | |
|
OpenAI o3-pro
289pts
|
0.0 | -0.2 | 20.0% | 40.0% | 20.0% | 4.2 | |
|
GPT-5
2,063pts
|
-0.6 | -0.8 | 0.0% | 80.0% | 60.0% | 30.6 | |
|
GPT-5-Codex
396pts
|
1.4 | 1.0 | 100.0% | 0.0% | 0.0% | 3.0 | |
| 0.0 | -0.2 | 20.0% | 40.0% | 40.0% | 3.0 | ||
| 0.0 | -0.25 | 25.0% | 50.0% | 50.0% | 2.0 | ||
| -0.4 | -0.4 | 20.0% | 60.0% | 60.0% | 14.0 | ||
| 0.0 | 0.2 | 40.0% | 20.0% | 20.0% | 9.6 | ||
|
GPT-5.2
1,195pts
|
-0.6 | -0.2 | 20.0% | 40.0% | 60.0% | 9.4 | |
|
GPT Image 1.5
522pts
|
0.0 | 0.0 | 40.0% | 40.0% | 40.0% | 6.6 | |
|
GPT-5.2-Codex
589pts
|
1.0 | 0.8 | 80.0% | 0.0% | 0.0% | 7.0 | |
|
GPT-5.3-Codex
1,530pts
|
0.4 | 0.6 | 60.0% | 0.0% | 0.0% | 9.0 | |
|
GPT‑5.3‑Codex‑Spark
890pts
|
0.8 | 0.6 | 60.0% | 0.0% | 0.0% | 8.0 | |
| 0.6 | 0.4 | 40.0% | 0.0% | 0.0% | 10.6 | ||
|
GPT‑5.3 Instant
395pts
|
-1.2 | -1.0 | 0.0% | 100.0% | 100.0% | 10.2 | |
|
GPT-5.4
1,019pts
|
-0.4 | -0.2 | 20.0% | 40.0% | 60.0% | 12.0 | |
| 0.0 | 0.0 | 0.0% | 0.0% | 0.0% | 0.0 | ||
|
GPT‑5.4 Mini and Nano
248pts
|
0.4 | 0.2 | 40.0% | 20.0% | 0.0% | 5.2 | |
|
|
0.286 | 0.252 | 43.7% | 18.5% | 21.0% | 8.3 | 31.3% |
|
Gemini AI
2,135pts
|
0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 12.8 | |
| 0.0 | 0.0 | 40.0% | 40.0% | 40.0% | 10.8 | ||
|
Gemma: New Open Models
1,129pts
|
-0.4 | -0.4 | 0.0% | 40.0% | 40.0% | 12.4 | |
| -0.6 | -0.4 | 0.0% | 40.0% | 40.0% | 3.4 | ||
|
Gemini Flash
442pts
|
-0.4 | -0.4 | 0.0% | 40.0% | 40.0% | 2.8 | |
| 1.2 | 1.0 | 100.0% | 0.0% | 0.0% | 6.8 | ||
| -0.4 | -0.2 | 20.0% | 40.0% | 60.0% | 13.4 | ||
| 0.2 | 0.2 | 20.0% | 0.0% | 0.0% | 4.6 | ||
| -0.5 | -0.5 | 25.0% | 75.0% | 75.0% | 1.25 | ||
|
Gemini 2.5
973pts
|
1.4 | 1.0 | 100.0% | 0.0% | 0.0% | 9.0 | |
|
Gemini 2.5 Flash
1,076pts
|
1.0 | 1.0 | 100.0% | 0.0% | 0.0% | 10.0 | |
| 1.2 | 0.6 | 80.0% | 20.0% | 20.0% | 3.6 | ||
| 0.4 | 0.6 | 60.0% | 0.0% | 0.0% | 3.6 | ||
| 0.4 | 0.4 | 60.0% | 20.0% | 20.0% | 14.2 | ||
|
Gemini 2.5 Flash Image
1,093pts
|
0.6 | 0.6 | 80.0% | 20.0% | 20.0% | 7.8 | |
| -0.4 | -0.2 | 0.0% | 20.0% | 40.0% | 6.8 | ||
| 0.4 | 0.2 | 20.0% | 0.0% | 0.0% | 9.4 | ||
|
Gemini 3
1,735pts
|
0.6 | 0.6 | 80.0% | 20.0% | 20.0% | 10.0 | |
| 0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 8.0 | ||
| 0.8 | 0.6 | 60.0% | 0.0% | 0.0% | 9.0 | ||
|
Gemini 3 Deep Think
1,081pts
|
0.6 | 0.6 | 60.0% | 0.0% | 20.0% | 9.8 | |
|
Gemini 3.1 Pro
963pts
|
0.0 | 0.0 | 20.0% | 20.0% | 20.0% | 14.4 | |
| -0.2 | -0.4 | 20.0% | 60.0% | 40.0% | 1.6 | ||
|
Google releases Gemma 4 open models
1,811pts
|
0.0 | 0.2 | 20.0% | 0.0% | 20.0% | 12.4 | |
|
|
0.324 | 0.234 | 36.9% | 13.5% | 13.5% | 5.12 | 29.3% |
|
Qwen2 LLM Released
261pts
|
0.2 | 0.2 | 40.0% | 20.0% | 20.0% | 4.2 | |
| 0.8 | 0.8 | 80.0% | 0.0% | 20.0% | 3.6 | ||
| -0.2 | -0.2 | 0.0% | 20.0% | 20.0% | 4.2 | ||
| 0.2 | 0.4 | 40.0% | 0.0% | 0.0% | 5.4 | ||
| 1.0 | 0.6 | 60.0% | 0.0% | 0.0% | 5.4 | ||
| 0.0 | 0.0 | 50.0% | 50.0% | 50.0% | 0.5 | ||
|
QVQ-Max: Think with Evidence
121pts
|
0.25 | 0.25 | 25.0% | 0.0% | 0.0% | 1.75 | |
| 0.0 | 0.0 | 20.0% | 20.0% | 20.0% | 11.0 | ||
| 0.6 | 0.4 | 40.0% | 0.0% | 0.0% | 8.8 | ||
| 0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 4.8 | ||
|
Qwen3-Next
569pts
|
0.4 | 0.2 | 40.0% | 20.0% | 20.0% | 4.4 | |
| 0.8 | 0.6 | 60.0% | 0.0% | 0.0% | 4.2 | ||
|
Qwen3-VL
434pts
|
1.0 | 0.4 | 60.0% | 20.0% | 20.0% | 4.4 | |
| -0.2 | 0.0 | 20.0% | 20.0% | 20.0% | 6.8 | ||
| 0.2 | 0.2 | 20.0% | 0.0% | 0.0% | 2.8 | ||
| 0.4 | 0.2 | 20.0% | 0.0% | 0.0% | 5.4 | ||
|
Qwen3-Max-Thinking
502pts
|
0.2 | 0.2 | 20.0% | 0.0% | 0.0% | 2.8 | |
|
Qwen3-Coder-Next
735pts
|
0.6 | 0.4 | 40.0% | 0.0% | 0.0% | 5.2 | |
| 0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 5.4 | ||
| 0.0 | 0.0 | 40.0% | 40.0% | 40.0% | 4.2 | ||
| -0.2 | -0.4 | 20.0% | 60.0% | 40.0% | 9.6 | ||
| 0.2 | 0.2 | 40.0% | 20.0% | 20.0% | 5.0 | ||
| 0.2 | 0.0 | 40.0% | 40.0% | 40.0% | 4.4 | ||
|
|
0.187 | 0.165 | 34.1% | 17.6% | 18.7% | 3.29 | 35.9% |
|
Mistral 7B
884pts
|
0.2 | 0.0 | 40.0% | 40.0% | 40.0% | 2.6 | |
| 0.6 | 0.6 | 60.0% | 0.0% | 0.0% | 4.2 | ||
|
Mixtral of experts
639pts
|
0.4 | 0.6 | 60.0% | 0.0% | 0.0% | 2.6 | |
| 0.6 | 0.4 | 40.0% | 0.0% | 0.0% | 2.8 | ||
|
Mistral Large
599pts
|
0.4 | 0.4 | 40.0% | 0.0% | 0.0% | 2.0 | |
|
Mixtral 8x22B
533pts
|
0.0 | -0.2 | 0.0% | 20.0% | 0.0% | 5.2 | |
| -0.4 | -0.2 | 0.0% | 20.0% | 40.0% | 6.0 | ||
|
Codestral Mamba
485pts
|
-0.2 | 0.0 | 20.0% | 20.0% | 40.0% | 2.8 | |
|
Mistral NeMo
418pts
|
-0.2 | -0.2 | 0.0% | 20.0% | 20.0% | 2.4 | |
| -0.2 | -0.2 | 20.0% | 40.0% | 40.0% | 1.8 | ||
|
Un Ministral, Des Ministraux
216pts
|
-0.6 | -0.6 | 0.0% | 60.0% | 60.0% | 3.8 | |
|
Codestral 25.01
21pts
|
0.0 | 0.0 | 0.0% | 0.0% | 0.0% | 0.0 | |
|
Mistral Small 3
620pts
|
0.0 | 0.4 | 60.0% | 20.0% | 20.0% | 3.0 | |
|
Mistral OCR
1,756pts
|
0.4 | 0.2 | 40.0% | 20.0% | 20.0% | 4.6 | |
|
Medium Is the New Large
70pts
|
0.0 | 0.0 | 20.0% | 20.0% | 20.0% | 0.8 | |
|
Devstral
701pts
|
1.0 | 0.8 | 80.0% | 0.0% | 0.0% | 2.0 | |
| -0.4 | -0.2 | 20.0% | 40.0% | 40.0% | 5.4 | ||
| 1.0 | 0.8 | 80.0% | 0.0% | 0.0% | 3.4 | ||
| 0.8 | 0.4 | 40.0% | 0.0% | 0.0% | 4.4 | ||
|
|
-0.073 | 0.018 | 25.5% | 23.6% | 27.3% | 11.89 | 37.8% |
|
Claude 3.7 Sonnet and Claude Code
2,127pts
|
0.8 | 0.8 | 80.0% | 0.0% | 0.0% | 40.2 | |
|
Claude 4
2,013pts
|
-0.2 | -0.2 | 40.0% | 60.0% | 40.0% | 7.8 | |
|
Claude Opus 4.1
841pts
|
-0.6 | -0.4 | 0.0% | 40.0% | 40.0% | 8.4 | |
| 0.2 | 0.4 | 40.0% | 0.0% | 0.0% | 12.0 | ||
|
Claude Sonnet 4.5
1,585pts
|
-0.6 | -0.4 | 20.0% | 60.0% | 60.0% | 11.2 | |
|
Claude Haiku 4.5
730pts
|
0.0 | 0.2 | 20.0% | 0.0% | 20.0% | 7.6 | |
|
Claude Opus 4.5
1,113pts
|
0.4 | 0.6 | 60.0% | 0.0% | 20.0% | 13.6 | |
|
Claude Opus 4.6
2,346pts
|
0.6 | 0.0 | 20.0% | 20.0% | 0.0% | 16.0 | |
| -0.2 | 0.0 | 0.0% | 0.0% | 20.0% | 1.0 | ||
| -0.8 | -0.6 | 0.0% | 60.0% | 60.0% | 2.4 | ||
|
Claude Sonnet 4.6
1,346pts
|
-0.4 | -0.2 | 0.0% | 20.0% | 40.0% | 10.6 | |
|
|
0.267 | 0.267 | 37.8% | 11.1% | 11.1% | 4.96 | 21.6% |
|
DeepSeek-R1
1,843pts
|
0.8 | 0.8 | 80.0% | 0.0% | 0.0% | 12.2 | |
| 1.0 | 0.6 | 60.0% | 0.0% | 0.0% | 6.6 | ||
|
DeepSeek-V3 Technical Report
132pts
|
0.0 | 0.0 | 20.0% | 20.0% | 20.0% | 1.4 | |
|
Deepseek R1-0528
451pts
|
0.0 | 0.2 | 20.0% | 0.0% | 20.0% | 6.8 | |
|
DeepSeek-v3.1
778pts
|
-0.2 | -0.4 | 0.0% | 40.0% | 20.0% | 4.2 | |
|
DeepSeek-v3.1-Terminus
101pts
|
-0.2 | 0.0 | 20.0% | 20.0% | 20.0% | 1.0 | |
|
DeepSeek-v3.2-Exp
309pts
|
0.6 | 0.6 | 60.0% | 0.0% | 0.0% | 1.2 | |
|
DeepSeek OCR
1,003pts
|
0.0 | 0.0 | 20.0% | 20.0% | 20.0% | 5.0 | |
| 0.4 | 0.6 | 60.0% | 0.0% | 0.0% | 6.2 | ||
|
|
0.462 | 0.41 | 56.4% | 15.4% | 12.8% | 3.72 | 34.8% |
| -0.2 | -0.2 | 20.0% | 40.0% | 40.0% | 4.0 | ||
| 0.2 | 0.0 | 40.0% | 40.0% | 40.0% | 3.8 | ||
| 0.8 | 0.8 | 80.0% | 0.0% | 0.0% | 2.2 | ||
| 0.75 | 0.75 | 75.0% | 0.0% | 0.0% | 0.5 | ||
| 0.4 | 0.2 | 40.0% | 20.0% | 20.0% | 4.6 | ||
|
GLM-4.7-Flash
378pts
|
0.8 | 0.8 | 80.0% | 0.0% | 0.0% | 3.2 | |
| 0.4 | 0.6 | 60.0% | 0.0% | 0.0% | 5.0 | ||
| 0.6 | 0.4 | 60.0% | 20.0% | 0.0% | 5.8 | ||
|
|
-0.314 | -0.286 | 25.7% | 54.3% | 54.3% | 4.51 | 67.4% |
|
Grok-2 Beta Release
226pts
|
0.2 | 0.4 | 60.0% | 20.0% | 20.0% | 3.0 | |
| -1.0 | -1.0 | 0.0% | 100.0% | 100.0% | 4.2 | ||
|
Grok 4 Launch [video]
437pts
|
0.6 | 0.4 | 60.0% | 20.0% | 20.0% | 6.6 | |
|
Grok 4
328pts
|
-0.6 | -0.6 | 0.0% | 60.0% | 60.0% | 5.6 | |
|
Grok Code Fast 1
512pts
|
-0.4 | -0.6 | 20.0% | 80.0% | 60.0% | 6.0 | |
|
Grok 4 Fast
96pts
|
0.0 | 0.0 | 40.0% | 40.0% | 40.0% | 2.6 | |
|
Grok 4.1
140pts
|
-1.0 | -0.6 | 0.0% | 60.0% | 80.0% | 3.6 | |
|
|
0.467 | 0.367 | 53.3% | 16.7% | 16.7% | 7.27 | 27.4% |
|
Llama 2
2,268pts
|
-0.2 | -0.2 | 40.0% | 60.0% | 60.0% | 13.2 | |
|
Meta Llama 3
2,199pts
|
0.2 | 0.0 | 40.0% | 40.0% | 20.0% | 4.2 | |
|
Llama 3.1
437pts
|
0.8 | 0.6 | 60.0% | 0.0% | 0.0% | 2.8 | |
| 1.2 | 0.8 | 80.0% | 0.0% | 0.0% | 6.2 | ||
|
Llama-3.3-70B-Instruct
425pts
|
0.8 | 0.8 | 80.0% | 0.0% | 0.0% | 3.8 | |
|
The Llama 4 herd
1,235pts
|
0.0 | 0.2 | 20.0% | 0.0% | 20.0% | 13.4 | |
|
|
0.5 | 0.45 | 55.0% | 10.0% | 10.0% | 4.05 | 27.7% |
| 0.6 | 0.6 | 60.0% | 0.0% | 0.0% | 3.6 | ||
| 0.6 | 0.6 | 60.0% | 0.0% | 0.0% | 7.2 | ||
| 0.8 | 0.6 | 60.0% | 0.0% | 0.0% | 3.2 | ||
| 0.0 | 0.0 | 40.0% | 40.0% | 40.0% | 2.2 | ||
|
|
0.3 | 0.15 | 35.0% | 20.0% | 15.0% | 2.2 | 23.4% |
|
Phi-3 Technical Report
411pts
|
0.4 | 0.2 | 40.0% | 20.0% | 20.0% | 3.4 | |
| 0.0 | 0.0 | 33.3% | 33.3% | 33.3% | 0.0 | ||
| 0.0 | -0.2 | 20.0% | 40.0% | 20.0% | 4.4 | ||
| 1.0 | 1.0 | 100.0% | 0.0% | 0.0% | 0.5 | ||
|
Phi-4 Reasoning Models
131pts
|
0.4 | 0.2 | 20.0% | 0.0% | 0.0% | 0.8 | |
|
|
-0.15 | -0.2 | 25.0% | 45.0% | 40.0% | 1.7 | 63.0% |
| 0.6 | 0.4 | 60.0% | 20.0% | 0.0% | 2.2 | ||
| -0.4 | -0.4 | 0.0% | 40.0% | 40.0% | 1.2 | ||
| -0.6 | -0.6 | 20.0% | 80.0% | 80.0% | 1.6 | ||
| -0.2 | -0.2 | 20.0% | 40.0% | 40.0% | 1.8 | ||
|
|
-0.313 | -0.25 | 18.8% | 43.8% | 43.8% | 1.0 | 62.5% |
| -0.4 | -0.4 | 20.0% | 60.0% | 60.0% | 1.0 | ||
| 0.0 | 0.0 | 0.0% | 0.0% | 0.0% | 0.0 | ||
| 0.2 | 0.2 | 20.0% | 0.0% | 0.0% | 1.4 | ||
| -0.8 | -0.6 | 20.0% | 80.0% | 80.0% | 0.8 |
Posts with skeptical comment sections
High-score posts where comments skew anti or negative — the biggest gaps between upvotes and reception. The gap index combines anti/negative comment share with the post's visibility (log-scaled score), so high-scoring posts with hostile threads rank highest.
| Post | Score | Anti % | Mixed % | Avg replies | Gap idx |
|---|---|---|---|---|---|
| 395 | 100.0% | 0.0% | 10.2 | 71.92 | |
|
Very negative
Anti
22 repliesr
The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. Why it matters , the big picture , it's not jut you , the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional s... |
|||||
|
Negative
Anti
7 repliesr
I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / non-thinking mode? I feel like openai is ... |
|||||
|
Negative
Anti
11 repliesr
Since the page mentions: Better judgment around refusals Has any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, make up a joke about group and then iteratin... |
|||||
|
Negative
Anti
5 repliesr
I kind of chuckled when I read the headline GPT‑5.3 Instant: Smoother, more ... LLM companies starting to sound like cigarette advertisements. |
|||||
|
Negative
Anti
6 repliesr
Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles? |
|||||
| 2,063 | 80.0% | 0.0% | 30.6 | 64.24 | |
|
Neutral
Neutral
109 repliesr
It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the d... |
|||||
|
Neutral
Anti
12 repliesr
GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release). Compare that to Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release) Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release) https://platform.openai.com/docs/model... |
|||||
|
Negative
Anti
12 repliesr
Going by the system card at: https://openai.com/index/gpt-5-system-card/ GPT‑5 is a unified system . . . OK . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly... |
|||||
|
Negative
Anti
7 repliesr
I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year. What excites me now is that ... |
|||||
|
Negative
Anti
13 repliesr
The marketing copy and the current livestream appear tautological: it's better because it's better. Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks. |
|||||
| 4,091 | 60.0% | 0.0% | 23.0 | 60.0 | |
|
Very positive
Pro
44 repliesr
After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire ... |
|||||
|
Negative
Anti
42 repliesr
A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example: Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowe... |
|||||
|
Very negative
Anti
14 repliesr
I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is buil... |
|||||
|
Negative
Anti
9 repliesr
That footnote on page 15 is the scariest thing i've read about AI/ML to date. To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reas... |
|||||
|
Very positive
Pro
6 repliesr
From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir... |
|||||
| 132 | 100.0% | 0.0% | 4.2 | 58.8 | |
|
Negative
Anti
7 repliesr
This article is weak and just general speculation. Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this: Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ... |
|||||
|
Negative
Anti
2 repliesr
The creation of a model which is co-state-of-the-art (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite throwing more compute a... |
|||||
|
Negative
Anti
10 repliesr
Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute. If you had $3 billion, xAI would choose to invest $2.5 billion in G... |
|||||
|
Negative
Anti
2 repliesr
I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good. Another nit-pick: I don't think DeepSeek had 50k H... |
|||||
|
Negative
Anti
0 repliesr
The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is: Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thi... |
|||||
| 2,268 | 60.0% | 0.0% | 13.2 | 55.75 | |
|
Positive
Pro
7 repliesr
Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5! AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. - Llama 1 (llama-65b): 57.6 - LLama 2 (llama-2-70b-chat-hf): 64.6 - GPT-3.5: 85.2 -... |
|||||
|
Negative
Anti
25 repliesr
Key detail from release: If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must reques... |
|||||
|
Negative
Anti
6 repliesr
This was a pretty disappointing initial exchange: what are the most common non-investor roles at early stage venture capital firms? Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term non-investor roles may be perc... |
|||||
|
Positive
Pro
23 repliesr
Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app! Model playground here: https://llama2.ai Hosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chat If you want to just play with the mode... |
|||||
|
Negative
Anti
5 repliesr
Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op... |
|||||
| 290 | 80.0% | 20.0% | 6.8 | 54.57 | |
|
Very negative
Anti
25 repliesr
So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ... |
|||||
|
Negative
Anti
3 repliesr
That's an inflection point, all right. OpenAI's customers can now at least break even. Of course, it means a flood of crap content. |
|||||
|
Negative
Anti
4 repliesr
For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000. I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views. This is approximately ~$... |
|||||
|
Negative
Anti
0 repliesr
This article is weird clickbait which, even weirder, worked. It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a c... |
|||||
|
Neutral
Mixed
2 repliesr
This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be... |
|||||
| 1,585 | 60.0% | 20.0% | 11.2 | 53.16 | |
|
Positive
Pro
12 repliesr
I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/ It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehens... |
|||||
|
Negative
Anti
21 repliesr
Anecdotal evidence. I have a fairly large web application with ~200k LoC. Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI). implement a fuzzy search for conversations and reports either when selecting Go to Conversation or Go to Report and typing th... |
|||||
|
Very negative
Anti
8 repliesr
I haven't shouted into the void for a while. Today is as good a day as any other to do so. I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connec... |
|||||
|
Negative
Anti
8 repliesr
Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks. Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all... |
|||||
|
Mixed
Mixed
7 repliesr
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too. It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chan... |
|||||
| 512 | 80.0% | 0.0% | 6.0 | 52.52 | |
|
Positive
Pro
10 repliesr
Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which... |
|||||
|
Negative
Anti
14 repliesr
It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the fast name of the model) is token output speed. I would have thought it uncontroversial view among software engineers that token quality is much important than to... |
|||||
|
Negative
Anti
1 repliesr
Is this the model that is the Coding version of Grok-4 promised when Grok-4 had awful coding benchmarks? I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up was... |
|||||
|
Negative
Anti
2 repliesr
On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness. Let's see this harness, then, because third party reports rate it at 57.6% https://www.vals.ai/models/grok_grok-code-fast-1 |
|||||
|
Mixed
Anti
3 repliesr
I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanation... |
|||||
| 207 | 80.0% | 0.0% | 1.6 | 51.34 | |
|
Negative
Anti
3 repliesr
I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have t... |
|||||
|
Negative
Anti
2 repliesr
Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826... |
|||||
|
Very positive
Pro
2 repliesr
Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim... |
|||||
|
Very negative
Anti
1 repliesr
Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective... |
|||||
|
Negative
Anti
0 repliesr
M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking |
|||||
| 1,136 | 60.0% | 20.0% | 18.8 | 50.76 | |
|
Negative
Anti
48 repliesr
GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens It sou... |
|||||
|
Negative
Anti
12 repliesr
Is it official then? Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise.... |
|||||
|
Neutral
Neutral
10 repliesr
I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes... Here's the result: https://gist.g... |
|||||
|
Negative
Anti
9 repliesr
Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real was that all moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles t... |
|||||
|
Mixed
Mixed
15 repliesr
First impression of GPT-4.5: 1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5 2. The style it writes is way better: it keeps the tone you ask and ma... |
|||||
Key insights
Timing.
- Launch cadence accelerates after mid-2024, with more months containing multiple major model releases. The peak month is 2025-09 with 11 launches.
Attention.
- Top 3 providers hold 55.4% of total score from 40.4% of posts.
- Top 5 posts alone account for 13.5% of all score.
- GPT-4 leads both in score (4,091 points) and discussion (2,507 comments).
Discussion.
- Criticism gets more replies: 6.8 avg for anti-stance vs 6.5 for pro.
- Open source / weights is the top discussion theme (131 mentions), followed by Coding quality (129).
- Open-weight models get more positive reception (net sentiment 0.346) vs closed-source (0.062).
- Surface-level positive (35.8%), but reply-weighted controversy is 38.0%.
Providers.
OpenAI leads in volume.
Anthropic leads in avg score per post (1254).
OpenAI: most attention, most pushback — 50.3% reply-weighted controversy.
Google DeepMind: 20.3% score share, positive sentiment (0.286), low controversy.
Anthropic: highest avg score (1254), but debate-heavy — 23.6% anti stance.
- Rising engagement:
Alibaba,
Mistral AI,
Zhipu AI.
- Declining engagement:
OpenAI,
Anthropic,
DeepSeek.
Engagement data uses the full curated set. Sentiment and stance come from a smaller top-comment sample — treat close comparisons with caution.
Provider details
Each provider section shows its launch timeline and post details. For cross-provider rankings, comment patterns, and performance comparisons, see the across providers section above.
OpenAI
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2023-03-14
|
GPT-4
GPT
|
4,091 | 2,507 |
|
|
Very positive
Pro
44 repliesr
After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patie... |
||||
|
Negative
Anti
42 repliesr
A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example:>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not all... |
||||
|
Very negative
Anti
14 repliesr
I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is built off over the years... |
||||
|
Negative
Anti
9 repliesr
That footnote on page 15 is the scariest thing i've read about AI/ML to date."To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and... |
||||
|
Very positive
Pro
6 repliesr
From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir... |
||||
|
Official
2024-04-09
|
GPT-4 Turbo Vision
GPT
|
220 | 107 |
|
|
Positive
Pro
8 repliesr
They also added both JSON and function support to the vision model - previously it didn't have those.This means you can now use gpt-4-turbo vision to extract structured data from an image!I was previously using a nasty hack where I'd run the image through the vision model to e... |
||||
|
Neutral
Neutral
2 repliesr
Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:> "I’m hoping we can get evals out shortly to help quantify this. Until ... |
||||
|
Negative
Anti
2 repliesr
People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it. |
||||
|
Neutral
Pro
0 repliesr
Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:https://www.datasette.cloud/blog/2024/datasette-extract/YouTube video demo here: https://www.youtube.com/watc... |
||||
|
Negative
Anti
6 repliesr
Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive! |
||||
|
Official
2024-05-13
|
GPT-4o
GPT
|
3,138 | 2,366 |
|
|
Positive
Mixed
14 repliesr
The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that ... |
||||
|
Mixed
Mixed
8 repliesr
Very interesting and extremely impressive!I tried using the voice chat in their app previously and was disappointed. The big UX problem was that it didn't try to understand when I had finished speaking. English is a second language and I paused a bit too long thinking of a wor... |
||||
|
Positive
Mixed
3 repliesr
Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.But: Interesting next to see ho... |
||||
|
Negative
Anti
8 repliesr
I found these videos quite hard to watch. There is a level of cringe that I found a bit unpleasant.It’s like some kind of uncanny valley of human interaction that I don’t get on nearly the same level with the text version. |
||||
|
Positive
Pro
15 repliesr
We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.These demos show people talking to artificial intelligence. This is new. ... |
||||
|
Official
2024-07-18
|
GPT-4o mini
GPT
|
222 | 78 |
|
|
Neutral
Neutral
0 repliesr
[dupe]Some more discussion: https://news.ycombinator.com/item?id=40996248 |
||||
|
Positive
Pro
4 repliesr
The big news for me here is the 16k output token limit. The models keep increasing the input limit to outrageous amounts, but output has been stuck at 4k.I did a project to summarize complex PDF invoices (not “unstructured” data, but “idiosyncratically structured” data, as eac... |
||||
|
Neutral
Pro
3 repliesr
Here's something interesting to think about: In ML we do a lot of bootstrapping. If a model is 51% wrong on a binary problem you flip the answer and train a 51% correct model then work your way up from there.Small models are trained from synthetic and live data curated and gen... |
||||
|
Negative
Anti
9 repliesr
GPT-4o mini is $0.15/1M input tokens, $0.60/1M output tokens. In comparison, Claude Haiku is $0.25/1M input tokens, $1.25/1M output tokens.There's no way this price-race-to-the-bottom is sustainable. |
||||
|
Neutral
Neutral
1 repliesr
@dang: This post isn't on the 1st or 2nd page of hacker news. Did it trip some automated controversy detection code for too many comments in the first hour?Edit: it says 181 points, 6 hours ago, and eyeballing the 1st page it should be in the top 5 right now. |
||||
|
3rd party
2024-07-19
|
GPT-4o mini
GPT
|
290 | 236 |
|
|
Very negative
Anti
25 repliesr
So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ... |
||||
|
Negative
Anti
3 repliesr
That's an inflection point, all right. OpenAI's customers can now at least break even.Of course, it means a flood of crap content. |
||||
|
Negative
Anti
4 repliesr
> For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000.> I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views.> This is approxim... |
||||
|
Negative
Anti
0 repliesr
This article is weird clickbait which, even weirder, worked.It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a co... |
||||
|
Neutral
Mixed
2 repliesr
This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be in f... |
||||
|
Official
2024-09-12
|
OpenAI o1-mini
o-series
|
30 | 5 |
|
|
Neutral
Neutral
0 repliesr
related tweet: https://x.com/OpenAIDevs/status/1834278699388338427OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.o1-preview has strong reasoning capabilities and broad world knowledge.o1-mini is faster, 80% cheaper, and competitive with... |
||||
|
Neutral
Neutral
0 repliesr
Related:OpenAI O1 Modelhttps://news.ycombinator.com/item?id=41523070 |
||||
|
Neutral
Neutral
2 repliesr
No GPT in the name, so maybe not transformer based? |
||||
|
Official
2025-01-31
|
o3-mini
o-series
|
962 | 904 |
|
|
Neutral
Neutral
15 repliesr
I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...For 18,936 input, 2,905 output it cost 3.3612 cents.Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes... |
||||
|
Neutral
Neutral
2 repliesr
I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21Example usage: llm -m o3-mini 'write a poem about a pirate and a walrus' \ -o reasoning_effort high Output (com... |
||||
|
Positive
Pro
4 repliesr
For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet. 62% $186 o1 high 60% $18 o3-mini high 57% $5 DeepSe... |
||||
|
Positive
Pro
14 repliesr
For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still co... |
||||
|
Neutral
Neutral
14 repliesr
So far, it seems like this is the hierarchyo1 > GPT-4o > o3-mini > o1-mini > GPT-4o-minio3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf |
||||
|
Official
2025-02-27
|
GPT-4.5
GPT
|
1,136 | 988 |
|
|
Negative
Anti
48 repliesr
GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokensGPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokensIt sounds like it's so expensive and t... |
||||
|
Negative
Anti
12 repliesr
Is it official then?Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But w... |
||||
|
Neutral
Neutral
10 repliesr
I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c... |
||||
|
Negative
Anti
9 repliesr
Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to s... |
||||
|
Mixed
Mixed
15 repliesr
First impression of GPT-4.5:1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.52. The style it writes is way better: it keeps the tone you ask and make... |
||||
|
Official
2025-03-19
|
o1-pro
o-series
|
131 | 129 |
|
|
Mixed
Mixed
5 repliesr
Pricing: $150 / 1M input tokens, $600 / 1M output tokens. (Not a typo.)Very expensive, but I've been using it with my ChatGPT Pro subscription and it's remarkably capable. I'll give it 100,000 token codebases and it'll find nuanced bugs I completely overlooked.(Now I almost fe... |
||||
|
Neutral
Neutral
4 repliesr
This is their first model to only be available via the new Responses API - if you have code that uses Chat Completions you'll need to upgrade to Responses in order to support this.Could take me a while to add support for it to my LLM tool: https://github.com/simonw/llm/issues/839 |
||||
|
Neutral
Neutral
5 repliesr
It cost me 94 cents to render a pelican riding a bicycle SVG with this one!Notes and SVG output here: https://simonwillison.net/2025/Mar/19/o1-pro/ |
||||
|
Neutral
Pro
3 repliesr
Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.OpenAI is now within an order of magnitude of a highly skilled humans with their frontier model pricing. o3 pr... |
||||
|
Negative
Anti
2 repliesr
It has a 2023 knowledge cut-off, and 200k context window... ? That's pretty underwhelming. |
||||
|
Official
2025-04-14
|
GPT-4.1
GPT
|
680 | 492 |
|
|
Negative
Neutral
16 repliesr
As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)- o3-mini (web search, CoT, canvas, but no i... |
||||
|
Neutral
Neutral
8 repliesr
Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year: SWE Aider Cost Fast Fresh Claude 3.7 70% 65% $15 77 8/24 Gemini 2.5 64% 69% $10 200 1/25 GPT-4.1 55% 53% $8 169 6/24 DeepSeek R1 49% 57% $... |
||||
|
Neutral
Neutral
8 repliesr
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:- telling the model to be persistent (+20%)- dont self-inject/parse toolcalls (+2%)- prompted planning (+4%)- JSON BAD - use X... |
||||
|
Mixed
Mixed
2 repliesr
I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.My take aways:- This is the first model from OpenAI that f... |
||||
|
Neutral
Pro
4 repliesr
From OpenAI's announcement:> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better s... |
||||
|
Official
2025-04-16
|
o3
o-series
|
555 | 507 |
|
|
Very negative
Anti
13 repliesr
Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...With right knowledge and web searches one can answer this question in a matte... |
||||
|
Positive
Pro
5 repliesr
Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA ... |
||||
|
Positive
Pro
7 repliesr
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]Incredible how resilient Claude models have been for best-in-coding class.[1] But by only about 1%, and inclusive of C... |
||||
|
Positive
Pro
3 repliesr
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).Using o4-min... |
||||
|
Negative
Anti
5 repliesr
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.GPT-4o mini: The new moon in August 2025 will occur on August 12.Llama 3.3 70B: The new moon in August... |
||||
|
Official
2025-06-10
|
o3-pro
o-series
|
289 | 199 |
|
|
Mixed
Mixed
6 repliesr
I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing sl... |
||||
|
Neutral
Neutral
5 repliesr
The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these mo... |
||||
|
Neutral
Anti
4 repliesr
The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off? |
||||
|
Negative
Anti
4 repliesr
I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on... |
||||
|
Positive
Pro
2 repliesr
here's a nice user review we published: https://www.latent.space/p/o3-prosama's highlight[0]:> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."I kept nu... |
||||
|
Official
2025-08-07
|
GPT-5
GPT
|
2,063 | 2,482 |
|
|
Neutral
Neutral
109 repliesr
It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the differ... |
||||
|
Neutral
Anti
12 repliesr
GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release).Compare that toGemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)https://platform.openai.com/docs/models/comparehttps://deepmin... |
||||
|
Negative
Anti
12 repliesr
Going by the system card at: https://openai.com/index/gpt-5-system-card/> GPT‑5 is a unified system . . .OK> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which mod... |
||||
|
Negative
Anti
7 repliesr
I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.What excites me now is that Gemini... |
||||
|
Negative
Anti
13 repliesr
The marketing copy and the current livestream appear tautological: "it's better because it's better."Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks. |
||||
|
Official
2025-09-15
|
GPT-5-Codex
GPT
|
396 | 137 |
|
|
Positive
Pro
5 repliesr
Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).As someon... |
||||
|
Very positive
Pro
3 repliesr
I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was ... |
||||
|
Very positive
Pro
4 repliesr
Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any lim... |
||||
|
Positive
Pro
2 repliesr
From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.For people that have not tried it in say ~1 month, give Codex CLI a try. |
||||
|
Positive
Pro
1 repliesr
It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input... |
||||
|
Official
2025-09-15
|
GPT-5-Codex
GPT
|
250 | 141 |
|
|
Very positive
Pro
9 repliesr
I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% ... |
||||
|
Neutral
Neutral
0 repliesr
This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum. |
||||
|
Negative
Anti
2 repliesr
My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company p... |
||||
|
Mixed
Mixed
2 repliesr
Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchm... |
||||
|
Negative
Anti
2 repliesr
I signed up to OpenAI, verified my identity, and added my credit card, bought $10 of credits.But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.Are you meant to only use Codex with the... |
||||
|
Repo
2025-11-08
|
Codex Mini
GPT
|
56 | 54 |
|
|
Positive
Neutral
2 repliesr
I managed to get GPT-5-Codex-Mini to draw me a pelican. It's not a very good one! https://static.simonwillison.net/static/2025/codex-hacking-m...For comparison, here's GPT-5-Codex (not mini) https://static.simonwillison.net/static/2025/codex-hacking-d... and full GPT-5: https:... |
||||
|
Positive
Pro
0 repliesr
Since they also limited the usage of codex CLI quite a bit, this might help. I really like the gpt-5-codex model. It has the most impressive dynamic reasoning effort I've seen. Responding instantly to simple questions while thinking very long when necessary. It's also interest... |
||||
|
Negative
Anti
2 repliesr
Codex isn’t even a good model. |
||||
|
Negative
Anti
4 repliesr
GPT-5 and GPT-5-Codex are already not clever enough for anything interesting. |
||||
|
Official
2025-11-12
|
GPT-5.1
GPT
|
555 | 726 |
|
|
Negative
Anti
27 repliesr
I don’t want more conversational, I want more to the point. Less telling me how great my question is, less about being friendly, instead I want more cold, hard, accurate, direct, and factual results.It’s a machine and a tool, not a person and definitely not my friend. |
||||
|
Negative
Anti
22 repliesr
All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.I suspect this approach is a direct response to the backlash against removing 4o. |
||||
|
Negative
Anti
12 repliesr
> what romanian football player won the premier league> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.> ...> No Romanian footballer has ever won... |
||||
|
Mixed
Neutral
3 repliesr
I’ve seen various older people that I’m connected with on Facebook posting screenshots of chats they’ve had with ChatGPT.It’s quite bizarre from that small sample how many of them take pride in “baiting” or “bantering” with ChatGPT and then post screenshots showing how they “g... |
||||
|
Positive
Pro
6 repliesr
Seems like people here are pretty negative towards a "conversational" AI chatbot.Chatgpt has a lot of frustrations and ethical concerns, and I hate the sycophancy as much as everyone else, but I don't consider being conversational to be a bad thing.It's just preference I guess... |
||||
|
Official
2025-11-19
|
GPT-5.1 Codex Max
GPT
|
483 | 319 |
|
|
Mixed
Pro
22 repliesr
I've been using a lot of Claude and Codex recently.One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character... |
||||
|
Neutral
Neutral
14 repliesr
Rest assured that we are better at training models than naming them ;D- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0- Natively trained to work across many hours across multiple context windows via compaction- 30% mor... |
||||
|
Positive
Pro
4 repliesr
Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini ... |
||||
|
Negative
Anti
5 repliesr
OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)They were probably sitting on this for a while. That makes me think this is a ... |
||||
|
Neutral
Neutral
3 repliesr
Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm... |
||||
|
Official
2025-12-11
|
GPT-5.2
GPT
|
1,195 | 1,083 |
|
|
Neutral
Neutral
18 repliesr
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.The thing that would now make the biggest difference isn't "more intelligence", whatever that... |
||||
|
Negative
Anti
10 repliesr
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnO... |
||||
|
Negative
Mixed
9 repliesr
I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "pro... |
||||
|
Negative
Anti
5 repliesr
Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:- It is faster which is appreciated but not as fast as Opus 4.5- I see no changes, very little noticeable improvements over 5.1- I do not see any value in exchange ... |
||||
|
Neutral
Pro
5 repliesr
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.The medium-reasoning version also improves: 62.7 → 72.1.The no-reasoning version also improves: ... |
||||
|
Official
2025-12-16
|
gpt-image-1.5
gpt-image
|
522 | 256 |
|
|
Positive
Pro
16 repliesr
Okay results are in for GenAI Showdown with the new gpt-image 1.5 model for the editing portions of the site!https://genai-showdown.specr.net/image-editingConclusions- OpenAI has always had some of the strongest prompt understanding alongside the weakest image fidelity. This u... |
||||
|
Mixed
Mixed
4 repliesr
I have a Nano Banana Pro blog post in the works expanding on my experiments with Nano Banana (https://news.ycombinator.com/item?id=45917875). Running a few of my test cases from that post and the upcoming blog post through this new ChatGPT Image model, this new model is better... |
||||
|
Negative
Anti
7 repliesr
If this was a farm of sweatshop Photoshopers in 2010, who download all images from the internet and provide a service of combining them on your request, this would escalate pretty quickly.Question: with copyright and authorship dead wrt AI, how do I make (at least) new content... |
||||
|
Positive
Pro
2 repliesr
I am very impressed a benchmark I like to run is have it create sprite maps, uv texture maps for an imagined 3d modelNoticed it captured a megaman legends vibe ....https://x.com/AgentifySH/status/2001037332770615302and here it generated a texture map from a 3d characterhttps:/... |
||||
|
Negative
Anti
4 repliesr
It's really weird to see "make images from memories that aren't real" as a product pitch |
||||
|
Official
2025-12-18
|
GPT-5.2 Codex
GPT
|
589 | 318 |
|
|
Very positive
Pro
15 repliesr
If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, metho... |
||||
|
Positive
Pro
9 repliesr
I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging t... |
||||
|
Mixed
Mixed
4 repliesr
Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context: SWE-Bench (Pro / Verified) Model | Pro (%) | Verified (%) -----------------... |
||||
|
Positive
Pro
4 repliesr
The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it... |
||||
|
Positive
Pro
3 repliesr
I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex s... |
||||
|
Official
2026-02-05
|
GPT-5.3 Codex
GPT
|
1,530 | 605 |
|
|
Neutral
Pro
23 repliesr
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophicallyWith Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the... |
||||
|
Positive
Pro
6 repliesr
I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.GPT-5.3-codex scores 77.3. |
||||
|
Mixed
Mixed
6 repliesr
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks... |
||||
|
Positive
Pro
2 repliesr
Something that caught my eye from the announcement:> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own trainingI'm happy to see the Codex team moving to this kind of dogfooding. I think this was cr... |
||||
|
Neutral
Neutral
8 repliesr
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes. |
||||
|
Official
2026-02-12
|
GPT-5.3 Codex Spark
GPT
|
890 | 387 |
|
|
Positive
Pro
13 repliesr
Wow, I wish we could post pictures to HN. That chip is HUGE!!!!The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× mor... |
||||
|
Positive
Pro
8 repliesr
I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d r... |
||||
|
Mixed
Mixed
7 repliesr
First thoughts using gpt-5.3-codex-spark in Codex CLI:Blazing fast but it definitely has a small model feel.It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a ... |
||||
|
Very positive
Pro
10 repliesr
Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing |
||||
|
Neutral
Neutral
2 repliesr
My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/ |
||||
|
Official
2026-02-13
|
GPT-5.2
GPT
|
574 | 401 |
|
|
Neutral
Mixed
26 repliesr
The headline may make it seem like AI just discovered some new result in physics all on its own, but reading the post, humans started off trying to solve some problem, it got complex, GPT simplified it and found a solution with the simpler representation. It took 12 hours for ... |
||||
|
Positive
Pro
15 repliesr
It's interesting to me that whenever a new breakthrough in AI use comes up, there's always a flood of people who come in to handwave away why this isn't actually a win for LLMs. Like with the novel solutions GPT 5.2 has been able to find for erdos problems - many users here (e... |
||||
|
Positive
Pro
2 repliesr
"An internal scaffolded version of GPT‑5.2 then spent roughly 12 hours reasoning through the problem, coming up with the same formula and producing a formal proof of its validity."When I use GPT 5.2 Thinking Extended, it gave me the impression that it's consistent enough/has a... |
||||
|
Positive
Mixed
9 repliesr
AI can be an amazing productivity multiplier for people who know what they're doing.This result reminded me of the C compiler case that Anthropic posted recently. Sure, agents wrote the code for hours but there was a human there giving them directions, scoping the problem, fin... |
||||
|
Mixed
Mixed
1 repliesr
It would be more accurate to say that humans using GPT-5.2 derived a new result in theoretical physics (or, if you're being generous, humans and GPT-5.2 together derived a new result). The title makes it sound like GPT-5.2 produced a complete or near-complete paper on its own,... |
||||
|
Official
2026-03-03
|
GPT-5.3 Chat
GPT
|
395 | 301 |
|
|
Very negative
Anti
22 repliesr
The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. "Why it matters", "the big picture", "it's not jut you", the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional so you can ea... |
||||
|
Negative
Anti
7 repliesr
I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / "non-thinking" mode?I feel like openai is going to get right back to wh... |
||||
|
Negative
Anti
11 repliesr
Since the page mentions:> Better judgment around refusalsHas any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, "make up a joke about <group>" and then ... |
||||
|
Negative
Anti
5 repliesr
I kind of chuckled when I read the headline "GPT‑5.3 Instant: Smoother, more ..."LLM companies starting to sound like cigarette advertisements. |
||||
|
Negative
Anti
6 repliesr
Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles? |
||||
|
Official
2026-03-05
|
GPT-5.4
GPT
|
1,019 | 807 |
|
|
Mixed
Mixed
13 repliesr
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/Also per... |
||||
|
Negative
Anti
11 repliesr
I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m... |
||||
|
Positive
Pro
7 repliesr
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Cod... |
||||
|
Negative
Anti
19 repliesr
I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post f... |
||||
|
Negative
Mixed
10 repliesr
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it w... |
||||
|
3rd party
2026-03-05
|
GPT-5.4 Pro
GPT
|
93 | 2 |
|
|
Neutral
Neutral
0 repliesr
Comments moved to https://news.ycombinator.com/item?id=47265045. |
||||
|
Neutral
Neutral
0 repliesr
More discussion: https://news.ycombinator.com/item?id=47265005 |
||||
|
Official
2026-03-17
|
GPT-5.4 mini
GPT
|
248 | 145 |
|
|
Positive
Pro
7 repliesr
I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).- GPT-... |
||||
|
Positive
Pro
6 repliesr
To me, mini releases matter much more and better reflect the real progress than SOTA models.The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.Meanwhile, when a smaller / less powerful model releases a new ... |
||||
|
Mixed
Mixed
7 repliesr
I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I... |
||||
|
Neutral
Neutral
4 repliesr
Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican... |
||||
|
Neutral
Anti
2 repliesr
According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?GPT 5 mini: Input $0.25 / Output $2.00GPT 5 nano: Input: $0.05 / Output $0.40GPT 5.4 mini: Input $0.75 / Output $4.50G... |
||||
Google DeepMind
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2023-12-06
|
Gemini
Gemini
|
2,135 | 1,602 |
|
|
Neutral
Neutral
0 repliesr
Related blog post: https://blog.google/technology/ai/google-gemini-ai/ (via https://news.ycombinator.com/item?id=38544746, but we merged the threads) |
||||
|
Positive
Pro
8 repliesr
Very impressive! I noticed two really notable things right off the bat:1. I asked it a question about a feature that TypeScript doesn't have[1]. GPT4 usually does not recognize that it's impossible (I've tried asking it a bunch of times, it gets it right with like 50% probabil... |
||||
|
Neutral
Neutral
5 repliesr
For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %... |
||||
|
Positive
Pro
21 repliesr
This demo is nuts: https://youtu.be/UIZAiXYceBI?si=8ELqSinKHdlGlNpX |
||||
|
Mixed
Mixed
30 repliesr
One observation: Sundar's comments in the main video seem like he's trying to communicate "we've been doing this ai stuff since you (other AI companies) were little babies" - to me this comes off kind of badly, like it's trying too hard to emphasize how long they've been doing... |
||||
|
Official
2024-02-15
|
Gemini 1.5 Pro
Gemini
|
1,244 | 588 |
|
|
Very positive
Pro
36 repliesr
The white paper is worth a read. The things that stand out to me are:1. They don't talk about how they get to 10M token context2. They don't talk about how they get to 10M token context3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creat... |
||||
|
Neutral
Neutral
0 repliesr
One interesting tidbit from the technical report:>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conser... |
||||
|
Positive
Pro
8 repliesr
Massive whoa if true from technical report"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"https://storage.googleapis.com/deepmind-media/gemini/g... |
||||
|
Negative
Anti
6 repliesr
0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form. |
||||
|
Very negative
Anti
4 repliesr
Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I... |
||||
|
Official
2024-02-21
|
Gemma
Gemma
|
1,129 | 509 |
|
|
Neutral
Neutral
9 repliesr
Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B +-------------+----------+-------------+-------------+ | Benchmark | Gemma 7B | Mistral 7B | Llama-2 7B | +-------------+----------+-------------+-------------+ | MMLU | 64.3 | 60.1 | 45.3 | | HellaSwag | 81.2 | ... |
||||
|
Negative
Anti
15 repliesr
The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policySomething that caught my eye in the terms:> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.One of the... |
||||
|
Neutral
Neutral
29 repliesr
Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.Opinions are our own and not of Google DeepMind. |
||||
|
Neutral
Neutral
3 repliesr
I notice a few divergences to common models:- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;- The vocabulary size is 10x (256K vs. Mistral’s 32K);- The training token count is tripled (6T vs. Llama2's 2T)Apart from that, it uses the ... |
||||
|
Negative
Anti
6 repliesr
Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning. |
||||
|
3rd party
2024-03-04
|
Gemini 1.5 Pro
Gemini
|
51 | 72 |
|
|
Mixed
Neutral
13 repliesr
Within a couple sentences of each other he says (paraphrasing), “if you test any model (Gemini, ChatGPT, Grok) you’ll see it says far left things” and “we don’t know why [Gemini] is far left leaning”.I think he’s being honest here. At least at the executive level of having no ... |
||||
|
Neutral
Neutral
0 repliesr
Transcript:https://lifearchitect.ai/sergey/ |
||||
|
Negative
Anti
0 repliesr
> The Gemini art? Yeah, with the… Okay. That wasn’t really expected [...] > So, once again, we haven’t fully understood why it means “left” in many cases. And that’s not our intention.Riiiight. |
||||
|
Neutral
Neutral
0 repliesr
Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking. |
||||
|
Very negative
Anti
4 repliesr
Let's be clear. This isn't a left or right thing. That's just a convenient argument to avoid the real problem. It's about moral relativism.I am about as left wing as you can be and gemini is simply offensive. It didn't want to say if Hitler was worse than Elon or Trump. It did... |
||||
|
Official
2024-05-14
|
Gemini Flash Latest
Gemini
|
442 | 146 |
|
|
Neutral
Neutral
0 repliesr
I upgraded my llm-gemini plugin to provide CLI access to Gemini Flash: pipx install llm # or brew install llm llm install llm-gemini --upgrade llm keys set gemini # paste API key here llm -m gemini-1.5-flash-latest 'a short poem about otters' https://github.com/simonw/llm-gemi... |
||||
|
Mixed
Mixed
9 repliesr
Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.Not bad compared to rolling your own, but among frontier models the main competitive differentiator... |
||||
|
Neutral
Mixed
5 repliesr
1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and wil... |
||||
|
Negative
Anti
0 repliesr
A lightweight model that you can only use in the cloud? That is amusing. These tech megacorps are really intent on owning your usage of AI. But we must not let that be the future. |
||||
|
Negative
Anti
0 repliesr
One thing OpenAI is beating Google at is actually publishing pricing for its APIs (and being consistent as to what they are called.)Google has I think 10 models available (there’s more than ten model names, but several of the models have multiple aliases) through what the Goog... |
||||
|
Official
2024-12-11
|
Gemini 2.0 Flash
Gemini
|
1,015 | 490 |
|
|
Very positive
Pro
10 repliesr
https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.I just tried having it teach me how t... |
||||
|
Positive
Pro
5 repliesr
I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal: llm install -U llm-gemini llm -m gemini-2.0-flash-exp 'prompt goes here' LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the... |
||||
|
Positive
Pro
12 repliesr
Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this sp... |
||||
|
Positive
Pro
3 repliesr
Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about ... |
||||
|
Positive
Pro
4 repliesr
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish. |
||||
|
3rd party
2025-02-05
|
Gemini 2.0 Flash
Gemini
|
1,303 | 447 |
|
|
Positive
Pro
33 repliesr
I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large... |
||||
|
Negative
Anti
17 repliesr
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.You... |
||||
|
Negative
Neutral
6 repliesr
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data. |
||||
|
Very negative
Anti
5 repliesr
We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I won... |
||||
|
Positive
Mixed
6 repliesr
(disclaimer I am CEO of llamaindex, which includes LlamaParse)Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so we... |
||||
|
Paper
2025-03-12
|
Gemma 3 27B
Gemma
|
488 | 245 |
|
|
Neutral
Neutral
6 repliesr
Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.Gemma 3 model overview: https://ai.google.dev/gemma/docs/coreHuggingface collection: https://huggingface.co/collecti... |
||||
|
Neutral
Neutral
11 repliesr
Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.(Opinions our own and not of Google DeepMind.)PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957 |
||||
|
Positive
Pro
3 repliesr
Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good ou... |
||||
|
Neutral
Neutral
0 repliesr
Linking to he announcement (which links to his PDF) would probably be more useful.Introducing Gemma 3: The most capable model you can run on a single GPU or TPUhttps://blog.google/technology/developers/gemma-3/ |
||||
|
Mixed
Mixed
3 repliesr
Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).But the example image shows that this model still makes dumb errors or has a poo... |
||||
|
Official
2025-03-12
|
Gemini 2.0 Flash
Gemini
|
101 | 12 |
|
|
Negative
Anti
4 repliesr
It just tells me 'content not permitted'.For context, I was attempting to put a cup of hot chocolate into the hands of an anime character. |
||||
|
Positive
Pro
1 repliesr
Ever since OpenAI showed (but did not release) this type of multimodal output with 4o, I have been waiting for this to be available to the general public.It seems like really combining visuals at the level of generation capability means language understanding is fully grounded... |
||||
|
Negative
Anti
0 repliesr
Curious that they use "Use Gemini 2.0 Flash to tell a story and it will illustrate it with pictures, keeping the characters and settings consistent throughout", and their example is not consistent at all between two pictures. |
||||
|
Negative
Anti
0 repliesr
I was really hoping that there would be more character consistency, given the fact they mention it in the blog. It also doesn't seem to reliably follow styles like "watercolor illustration" or "line and wash". |
||||
|
Official
2025-03-25
|
Gemini 2.5 Flash
Gemini
|
973 | 485 |
|
|
Very positive
Pro
14 repliesr
One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.Imagine for instance you give the LLM the profile of the love interest for your... |
||||
|
Very positive
Pro
22 repliesr
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. ... |
||||
|
Positive
Pro
4 repliesr
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.Plus it drew me a very decent pelican r... |
||||
|
Positive
Pro
3 repliesr
Tops our benchmark in an unprecedented way.https://help.kagi.com/kagi/ai/llm-benchmark.htmlHigh quality, to the point. Bit on the slow side. Indeed a very strong model.Google is back in the game big time. |
||||
|
Positive
Pro
2 repliesr
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.[0] https://aider.c... |
||||
|
Official
2025-04-17
|
Gemini 2.5 Flash
Gemini
|
1,076 | 560 |
|
|
Very positive
Pro
20 repliesr
Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's m... |
||||
|
Neutral
Pro
3 repliesr
An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini uv tool install llm llm install llm-gemini llm keys set gemini # paste key here llm -... |
||||
|
Positive
Pro
18 repliesr
Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.Google is silently winning the AI race. |
||||
|
Positive
Pro
6 repliesr
One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-und... |
||||
|
Positive
Pro
3 repliesr
For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and late... |
||||
|
Official
2025-04-20
|
Gemma 3 27B
Gemma
|
602 | 276 |
|
|
Very positive
Pro
11 repliesr
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of mem... |
||||
|
Very positive
Pro
1 repliesr
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I... |
||||
|
Negative
Anti
3 repliesr
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn'... |
||||
|
Very positive
Pro
3 repliesr
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.I still like Gemini 2.5 Pro ... |
||||
|
Positive
Pro
0 repliesr
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of ent... |
||||
|
Official
2025-05-20
|
Gemma 3n 4B
Gemma
|
441 | 163 |
|
|
Positive
Pro
12 repliesr
You can try it on Android right now:Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...Import the .task file in ... |
||||
|
Neutral
Neutral
3 repliesr
Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot ... |
||||
|
Mixed
Mixed
2 repliesr
According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-previewE4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4... |
||||
|
Positive
Pro
1 repliesr
On Hugging face I see 4B and 2B versions now -https://huggingface.co/collections/google/gemma-3n-preview-6...Gemma 3n Previewgoogle/gemma-3n-E4B-it-litert-previewgoogle/gemma-3n-E2B-it-litert-previewInteresting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE model... |
||||
|
Mixed
Pro
0 repliesr
It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is ex... |
||||
|
Official
2025-08-14
|
Gemma 3 270M
Gemma
|
824 | 317 |
|
|
Positive
Pro
30 repliesr
Hi all, I built these models with a great team. They're available for download across the open model ecosystem so give them a try! I built these models with a great team and am thrilled to get them out to you.From our side we designed these models to be strong for their size o... |
||||
|
Negative
Anti
18 repliesr
My lovely interaction with the 270M-F16 model:> what's second tallest mountain on earth?The second tallest mountain on Earth is Mount Everest.> what's the tallest mountain on earth?The tallest mountain on Earth is Mount Everest.> whats the second tallest mountain?The ... |
||||
|
Positive
Pro
3 repliesr
I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.The great thing is I have enough data (100k+) to fine-tune and r... |
||||
|
Mixed
Mixed
14 repliesr
This model is a LOT of fun. It's absolutely tiny - just a 241MB download - and screamingly fast, and hallucinates wildly about almost everything.Here's one of dozens of results I got for "Generate an SVG of a pelican riding a bicycle". For this one it decided to write a poem: ... |
||||
|
Positive
Pro
6 repliesr
Apple should be doing this. Unless their plan is to replace their search deal with an AI deal -- it's just crazy to me how absent Apple is. Tim Cook said, "it's ours to take" but they really seem to be grasping at the wind right now. Go Google! |
||||
|
Official
2025-08-26
|
Gemini 2.5 Flash Image
Gemini
|
1,093 | 475 |
|
|
Very positive
Pro
19 repliesr
This is the gpt 4 moment for image editing models. Nano banana aka gemini 2.5 flash is insanely good. It made a 171 elo point jump in lmarena!Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111 |
||||
|
Positive
Pro
7 repliesr
I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).https://genai-showdown.specr.netThis model gets 8 of the 12 prompts correct and easily comes within ... |
||||
|
Very negative
Anti
4 repliesr
Unfortunately, it suffers from the same safetyism than other many releases. Half of the prompts get rejected. How can you have character consistency if the model is forbidden from editing any human. And most of my photo editing involves humans, so basically this is just a usel... |
||||
|
Positive
Pro
3 repliesr
There is one thing Gemini 2.5 Flash Image can do that no other edit model can do: incorporate multiple images simultaneously without shenanigans due to its multimodality, e.g. for Flux Kontext, if you want to "put the person in the first image into the second image", you have ... |
||||
|
Positive
Pro
6 repliesr
I digitised our family photos but a lot of them were damaged (shifted colours, spills, fingerprints on film, spots) that are difficult to correct for so many images. I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing detai... |
||||
|
Official
2025-09-25
|
Gemini 2.5 Flash
Gemini
|
540 | 279 |
|
|
Negative
Anti
14 repliesr
This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.I've been running into it consistently, responses that just s... |
||||
|
Neutral
Neutral
2 repliesr
I added support to these models to my llm-gemini plugin, so you can run them like this (using uvx so no need to install anything first): export LLM_GEMINI_KEY='...' uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks' Re... |
||||
|
Negative
Neutral
5 repliesr
Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a ... |
||||
|
Mixed
Mixed
8 repliesr
Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflo... |
||||
|
Neutral
Neutral
5 repliesr
Non-AI Summary:Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).Gemini 2.5 Flash-Lite improvements include better instruction following, reduced v... |
||||
|
Paper
2025-11-18
|
Gemini 3 Pro Preview
Gemini
|
280 | 335 |
|
|
Neutral
Neutral
19 repliesr
Benchmarks from page 4 of the model card: | Benchmark | 3 Pro | 2.5 Pro | Sonnet 4.5 | GPT-5.1 | |-----------------------|-----------|---------|------------|-----------| | Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | | ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | | GPQ... |
||||
|
Positive
Mixed
13 repliesr
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding |
||||
|
Positive
Pro
1 repliesr
One benchmark I would really like to see: instruction adherence.For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outp... |
||||
|
Neutral
Neutral
8 repliesr
There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score. |
||||
|
Neutral
Neutral
6 repliesr
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/[email protected]` which obviously fails...Anyone happen to know why? Is this website by any change sharing information on safe medical abo... |
||||
|
Official
2025-11-18
|
Gemini 3 Pro Preview
Gemini
|
1,735 | 1,056 |
|
|
Positive
Pro
19 repliesr
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training dataGemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this pr... |
||||
|
Very positive
Pro
1 repliesr
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...I spent years building a compiler that takes our custom XML format and generat... |
||||
|
Positive
Pro
11 repliesr
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, e... |
||||
|
Neutral
Pro
5 repliesr
Static Pelican is boring. First attempt:Generate SVG animation of following:1 - There is High fantasy mage tower with a top window a dome2 - Green goblin come in front of tower with a torch3 - Grumpy old mage with beard appear in a tower window in high purple hat4 - Mage sends... |
||||
|
Negative
Anti
14 repliesr
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, ... |
||||
|
Official
2025-12-05
|
Gemini 3 Pro Preview
Gemini
|
566 | 295 |
|
|
Mixed
Mixed
27 repliesr
WellIt is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.In fact GPT5 wrote an edge detection... |
||||
|
Positive
Pro
4 repliesr
I do some electrical drafting work for construction and throw basic tasks at LLMs.I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon |
||||
|
Positive
Pro
2 repliesr
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this fro... |
||||
|
Neutral
Neutral
3 repliesr
Interesting "ScreenSpot Pro" results: 72.7% Gemini 3 Pro 11.4% Gemini 2.5 Pro 49.9% Claude Opus 4.5 3.50% GPT-5.1 ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Usehttps://arxiv.org/abs/2504.07981 |
||||
|
Neutral
Neutral
4 repliesr
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL. |
||||
|
Official
2025-12-17
|
Gemini 3 Flash Preview
Gemini
|
1,102 | 580 |
|
|
Very positive
Pro
24 repliesr
Don’t let the “flash” name fool you, this is an amazing model.I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fra... |
||||
|
Mixed
Mixed
8 repliesr
This is awesome. No preview release either, which is great to production.They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for outputFor comparison:Gemini 3.0 Flash: $0.50/M for input and $3.00/M for outputGemini 2.5 Fl... |
||||
|
Positive
Pro
4 repliesr
Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.Presumably a big motivation for them is to be first to get something good and cheap enough they c... |
||||
|
Mixed
Mixed
6 repliesr
These flash models keep getting more expensive with every release.Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world ... |
||||
|
Positive
Pro
3 repliesr
This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recogniti... |
||||
|
Official
2026-02-12
|
Gemini 3 Pro Preview
Gemini
|
1,081 | 693 |
|
|
Positive
Pro
15 repliesr
Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)Wow.https://blog.google/innovation-and-ai/models-and-research/ge... |
||||
|
Negative
Neutral
11 repliesr
Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had K... |
||||
|
Positive
Pro
3 repliesr
I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a f... |
||||
|
Very positive
Pro
17 repliesr
Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind. |
||||
|
Neutral
Neutral
3 repliesr
Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved">Submi... |
||||
|
Official
2026-02-19
|
Gemini 3.1 Pro Preview
Gemini
|
963 | 914 |
|
|
Negative
Anti
31 repliesr
I hope this works better than 3.0 ProI'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.It's stunningly good at reasoning, design, and generatin... |
||||
|
Positive
Pro
22 repliesr
People underrate Google's cost effectiveness so much. Half price of Opus. HALF.Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight____Update:3.1 was 40% of the cost to run AA index vs Opu... |
||||
|
Mixed
Mixed
2 repliesr
If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.While it gives me hope, I am going to pl... |
||||
|
Neutral
Neutral
9 repliesr
Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricingKnowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3Compare to Op... |
||||
|
Mixed
Mixed
8 repliesr
These models are so powerful.It's totally possible to build entire software products in the fraction of the time it took before.But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.It feels lik... |
||||
|
Official
2026-03-03
|
Gemini 3.1 Flash Lite Preview
Gemini
|
60 | 30 |
|
|
Neutral
Mixed
2 repliesr
Lots of comments about the price change, but Artifical Analysis reports that 3.1 Flash-Lite (reasoning) used fewer than half of the tokens of 2.5 Flash-Lite (reasoning).This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output... |
||||
|
Negative
Anti
0 repliesr
Unfortunate, significant price increase for a 'lite' model: $0.25 IN / $1.50 OUT vs. Gemini 2.5 Flash-Lite $0.10 IN / $0.40 OUT. |
||||
|
Negative
Anti
4 repliesr
For the last 2 years, startup wisdom has been that models will continue to get cheaper and better. Claude first, and now Gemini has shown that it's not the case.We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic... |
||||
|
Neutral
Anti
0 repliesr
That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model |
||||
|
Positive
Pro
2 repliesr
You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.We benchmarked it for real-life voice-to-text use cases: <10s 10-30s 30s-1m 1-2m 2-3m Flash 2548 2732 3177 4583 5961 Flash L... |
||||
|
Official
2026-04-02
|
Gemma 4 26B
Gemma
|
1,811 | 474 |
|
|
Positive
Pro
21 repliesr
Thinking / reasoning + multimodal + tool calling.We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!Guide for those interested: https://unsloth.ai/docs/models/gemma-4Also note to use temperature = 1.0, top_p ... |
||||
|
Mixed
Mixed
9 repliesr
I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.https://simonwillison.net/2026/Apr/2/gemma-4/The gemma-4-31b model is compl... |
||||
|
Neutral
Neutral
3 repliesr
Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: | Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t | |----------------|-------|-------|-------|------|-------|-------|-------|-------| | G4 31B | 85.2% | ... |
||||
|
Negative
Mixed
6 repliesr
Prompt:> what is the Unix timestamp for this: 2026-04-01T16:00:00ZQwen 3.5-27b-dwq> Thought for 8 minutes 34 seconds. 7074 tokens.> The Unix timestamp for 2026-04-01T16:00:00Z is:> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)Gemma-4-26b-a4b> Thou... |
||||
|
Neutral
Neutral
23 repliesr
Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can |
||||
Anthropic
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2025-02-24
|
Claude Sonnet 3.7
Claude
|
2,127 | 963 |
|
|
Positive
Pro
18 repliesr
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.Aider 0.75.0 is out with support for 3.7 Sonnet [1].Thinking supp... |
||||
|
Neutral
Neutral
82 repliesr
Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product. |
||||
|
Positive
Pro
8 repliesr
Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.https://help.kagi.com/kagi/ai/llm-benchmark.htmlAppears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, abo... |
||||
|
Positive
Pro
91 repliesr
You can get your HN profile analyzed by it and it's pretty funny :)https://hn-wrapped.kadoa.com/I'm using this to test the humor of new models. |
||||
|
Positive
Pro
2 repliesr
I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was... |
||||
|
Official
2025-05-22
|
Claude 4
Claude
|
2,013 | 1,170 |
|
|
Neutral
Pro
10 repliesr
An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)https://docs.anthropic.com/en/docs/about-claude/models/overv... |
||||
|
Positive
Pro
7 repliesr
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handl... |
||||
|
Negative
Anti
8 repliesr
> Users requiring raw chains of thought for advanced prompt engineering can contact salesSo it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly ref... |
||||
|
Negative
Anti
13 repliesr
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks. |
||||
|
Mixed
Anti
1 repliesr
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing ... |
||||
|
Official
2025-08-05
|
Claude Opus 4.1
Claude
|
841 | 328 |
|
|
Neutral
Neutral
10 repliesr
All three major labs released something within hours of each other. This anime arc is insane. |
||||
|
Mixed
Mixed
5 repliesr
Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, sp... |
||||
|
Negative
Anti
22 repliesr
I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certa... |
||||
|
Neutral
Neutral
1 repliesr
They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon(He had been stuck in the Team Rocket hideout (I believe) for weeks) |
||||
|
Very negative
Anti
4 repliesr
Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.I've basically wasted the morning o... |
||||
|
Official
2025-08-12
|
Claude Sonnet 4
Claude
|
1,324 | 692 |
|
|
Mixed
Mixed
16 repliesr
This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (... |
||||
|
Neutral
Pro
12 repliesr
A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max): 1. Build context for the work you're doing. Put lots of your codebase into the context window. 2. Do work, but ... |
||||
|
Mixed
Mixed
4 repliesr
This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users do... |
||||
|
Mixed
Mixed
20 repliesr
My experience with the current tools so far:1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "... |
||||
|
Positive
Pro
8 repliesr
One of the most helpful usages of CC so far is when I simply ask:"Are there any bugs in the current diff"It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think thr... |
||||
|
Official
2025-09-29
|
Claude Sonnet 4.5
Claude
|
1,585 | 786 |
|
|
Positive
Pro
12 repliesr
I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of bench... |
||||
|
Negative
Anti
21 repliesr
Anecdotal evidence.I have a fairly large web application with ~200k LoC.Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI)."implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing ... |
||||
|
Very negative
Anti
8 repliesr
I haven't shouted into the void for a while. Today is as good a day as any other to do so.I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection b... |
||||
|
Negative
Anti
8 repliesr
> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all ... |
||||
|
Mixed
Mixed
7 repliesr
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chang... |
||||
|
Official
2025-10-15
|
Claude Haiku 4.5
Claude
|
730 | 287 |
|
|
Mixed
Mixed
6 repliesr
Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being t... |
||||
|
Negative
Neutral
19 repliesr
Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...... |
||||
|
Neutral
Neutral
1 repliesr
I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5. |
||||
|
Positive
Pro
8 repliesr
Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%... |
||||
|
Neutral
Neutral
4 repliesr
I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest. |
||||
|
Official
2025-11-24
|
Claude Opus 4.5
Claude
|
1,113 | 506 |
|
|
Positive
Pro
19 repliesr
The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.Also notable: they're claiming SOTA prompt injection resi... |
||||
|
Negative
Mixed
15 repliesr
This is gonna be game-changing for the next 2-4 weeks before they nerf the model.Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases”... |
||||
|
Positive
Pro
24 repliesr
I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.I ... |
||||
|
Mixed
Mixed
1 repliesr
The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded... |
||||
|
Positive
Pro
9 repliesr
Seeing these benchmarks makes me so happy.Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.Ive been hol... |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
2,346 | 1,031 |
|
|
Very positive
Pro
37 repliesr
Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 ... |
||||
|
Positive
Anti
7 repliesr
5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive! |
||||
|
Mixed
Mixed
12 repliesr
I'm still not sure I understand Anthropic's general strategy right now.They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.Meanwhile, Claude's general use cases are... fine. For generic r... |
||||
|
Neutral
Neutral
1 repliesr
Claude Code release notes: > Version 2.1.32: • Claude Opus 4.6 is now available! • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) • Claude now automatically records ... |
||||
|
Mixed
Mixed
23 repliesr
The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe... |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
154 | 50 |
|
|
Mixed
Mixed
4 repliesr
Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.My anecdotal experienc... |
||||
|
Neutral
Neutral
0 repliesr
Based on the article... is this basically just making Claude better at formatting and data presentation, or does it also get better at analysis? I get the impression it's the former. |
||||
|
Mixed
Mixed
0 repliesr
The benchmarks look good. Slide decks and spreadsheets look better. The people must use Claude Cowork and have their Claude Code moment and figure out the consequences. It will be really interesting to see articles like this (https://mitchellh.com/writing/my-ai-adoption-journe... |
||||
|
Mixed
Mixed
1 repliesr
And then you hand it to your boss who takes a 20 second look at it and asks why you made a projection that assume massive revenue growth and 3 years of perfectly flat utilities, insurance, G&A - no inflation etc.It does look really promising as a skeleton starting point th... |
||||
|
Negative
Neutral
0 repliesr
> The side-by-side outputs below show how output quality has improved from Claude Opus 4.5 to Opus 4.6.Disclaimer: I use AI to code (and I code for finance) and I love Anthropic.But: for f-ck's sake, I cannot click on the picture and have it show up in full. It stays at its... |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
212 | 75 |
|
|
Negative
Anti
8 repliesr
Considering all the problems they've been having with over-charging Claude Code users over the past few weeks it's the very least they could do. Max subscribers are hitting their 5 hour usage limits in 30-40 minutes with a single instance doing light work, while Anthropic have... |
||||
|
Negative
Anti
1 repliesr
Is this one of those, Hey turn off the overcharge protection because you can go $50 into debt for free, and then maybe you'll just keep going and not notice you owe us an extra $500 type of situations? |
||||
|
Neutral
Neutral
1 repliesr
Meanwhile codex is running a free tier for a month right now for anyone who wants to experiment with it: https://openai.com/index/introducing-the-codex-app/Doesn't appear to include the new model though, only the state-of-yesterdays-art (literally yesterdays). |
||||
|
Very negative
Anti
2 repliesr
I'll pass on this $50, but please hire real human and fix your crappy app, Claude!This bug has been for years: in Claude (web or app), if you create a new chat at the middle of existing chat thinking or tool calling, the existing chat will be broken, either losing data, or bec... |
||||
|
Neutral
Neutral
0 repliesr
You can use this if you started your Pro or Max subscription before Wednesday, February 4, 2026 at 11:59 PM PT.Go to https://claude.ai/settings/usage, turn on extra usage and enable the promo from the notification afterwards.I received €42, top up was not required and auto-rel... |
||||
|
Official
2026-02-17
|
Claude Sonnet 4.6
Claude
|
1,346 | 1,226 |
|
|
Very negative
Anti
15 repliesr
I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their o... |
||||
|
Negative
Neutral
4 repliesr
They use the word "Sonnet" 60+ times on that page but never give the casual reader any context of what a "Sonnet model" actually is. Neither does their landing page. You have to scroll all the way to the footer to find a link under the "Models" section. You click it and you fi... |
||||
|
Mixed
Mixed
9 repliesr
I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 yearsIt is a far cry from Opus 4.6.Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing ana... |
||||
|
Neutral
Neutral
16 repliesr
The demise of saas has been overplayed imho. When companies buy software they are essentially buying something that solves a problem and the insurance that comes with that. Part of that means they get to pick up the phone and complain if something doesn't work and someone on t... |
||||
|
Positive
Neutral
9 repliesr
I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers. |
||||
Alibaba
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2024-06-06
|
Qwen2
Qwen
|
261 | 130 |
|
|
Positive
Pro
6 repliesr
A 0.5B parameter model with a 32k context length that also makes good use of that full window?! That's very interesting.The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning... |
||||
|
Neutral
Neutral
2 repliesr
Is it a common practice in LLMs to give different weights to different training data sources?For instance I might want to say that all training data that comes from my inhouse emails take precedence over anything that comes from the internet? |
||||
|
Positive
Pro
0 repliesr
This model has: 1. On par or better performance than Llama-3-70B-Instruct 2. A much more comfortable context length of 128k (vs the tiny 8k that really hinders Llama-3)These 2 feats together will probably make it the first serious OS rival to GPT-4! |
||||
|
Negative
Anti
5 repliesr
Weird, every time I try asking what happened at Tiananmen Square, or why Xi is an outlier with 3 terms as party secretary, it errors. "All hail Glorious Xi :)" works though. https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct |
||||
|
Neutral
Neutral
8 repliesr
Given the restrictions on GPUs to China, I'm curious what their training cluster looks like.(not saying this out of any support or non-support for such a GPU blockade; I'm just genuinely curious) |
||||
|
Official
2024-11-11
|
Qwen2.5-Coder 32B Instruct
Qwen
|
23 | 0 | — |
|
Official
2024-11-27
|
QwQ 32B
Qwen
|
438 | 421 |
|
|
Very positive
Pro
3 repliesr
This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went a... |
||||
|
Positive
Pro
5 repliesr
This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/ |
||||
|
Positive
Pro
1 repliesr
QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.[0] https://news.ycombinator.com/item?id=41524263 |
||||
|
Positive
Pro
2 repliesr
32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B. |
||||
|
Negative
Mixed
7 repliesr
I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.See the response here: https://pastecode.io/s... |
||||
|
Official
2025-01-26
|
Qwen2.5 14B Instruct
Qwen
|
307 | 108 |
|
|
Neutral
Neutral
0 repliesr
Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/(via https://news.ycombinator.com/item?id=42832838, but we merged that thread hither) |
||||
|
Negative
Anti
13 repliesr
In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, et... |
||||
|
Neutral
Neutral
4 repliesr
Ollama has a num_ctx parameter that controls the context window length - it defaults to 2048. At a guess you will need to set that. |
||||
|
Neutral
Neutral
1 repliesr
Here are tips for running it on macOS using MLX: https://twitter.com/awnihannun/status/1883611098081099914 - using https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-... |
||||
|
Neutral
Neutral
3 repliesr
What's the SOTA for memory-centric computing? I feel like maybe we need a new paradigm or something to bring the price of AI memory down.Maybe they can take some of those hundreds of billions and invest in new approaches.Because racks of H100s are not sustainable. But it's cle... |
||||
|
Official
2025-03-05
|
QwQ 32B
Qwen
|
480 | 169 |
|
|
Mixed
Mixed
10 repliesr
Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't beli... |
||||
|
Mixed
Pro
6 repliesr
Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.These things are pretty interesting as they are developing. What US will do to retain its power?BTW I am Indian and we are not even in the race as coun... |
||||
|
Positive
Pro
4 repliesr
I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.20x smaller than Deep Seek! How small can these go? What kind of hardware can run this? |
||||
|
Neutral
Neutral
5 repliesr
To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ. |
||||
|
Mixed
Mixed
2 repliesr
It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.Overall though quite impressive i... |
||||
|
Official
2025-03-24
|
Qwen2.5-VL-32B
Qwen
|
544 | 293 |
|
|
Neutral
Neutral
5 repliesr
Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/ |
||||
|
Very positive
Pro
1 repliesr
This model is available for MLX now, in various different sizes.I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this: uv run --with 'numpy<2' --with mlx-vlm \ ... |
||||
|
Very positive
Pro
2 repliesr
We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our ... |
||||
|
Positive
Pro
9 repliesr
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced M... |
||||
|
Neutral
Neutral
10 repliesr
Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable? |
||||
|
3rd party
2025-03-27
|
Qwen2.5-Omni 7B
Qwen
|
29 | 3 |
|
|
Positive
Pro
0 repliesr
Qwen has been on fire lately. First QwQ now this! |
||||
|
Negative
Anti
1 repliesr
Dunno if the article author is in this, but there's an issue with the links. Hugging face goes to github, github goes to nowhere. |
||||
|
Official
2025-04-03
|
QVQ Max
Qwen
|
121 | 26 |
|
|
Positive
Pro
1 repliesr
I think this is old news, but this model does better than llama 4 maverick on coding. |
||||
|
Neutral
Neutral
5 repliesr
I wonder why are we getting these drops during the weekend. Is the AI race truly that heated? |
||||
|
Neutral
Neutral
1 repliesr
The about page doesn’t shed light on the composition of the core team nor their sources of incomes or funding. Am I overlooking something?> We are a group of people with diverse talents and interests. |
||||
|
Mixed
Mixed
0 repliesr
I’ve tried QVQ before, one issue I always stumbled upon is that it entered an endless loop of reasoning. It felt wrong letting it run for too long, I didn’t want to over consume Alibabas gpus.I’ll try QVQ-Max when I get home. It’s kinda fascinating how these models can interpr... |
||||
|
Official
2025-04-28
|
Qwen3 14B
Qwen
|
869 | 388 |
|
|
Negative
Anti
18 repliesr
I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Q... |
||||
|
Positive
Pro
4 repliesr
They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.Not even going into performance, need to test fi... |
||||
|
Mixed
Mixed
6 repliesr
As is now traditional for new LLM releases, I used Qwen 3 (32B, run via Ollama on a Mac) to summarize this Hacker News conversation about itself - run at the point when it hit 112 comments.The results were kind of fascinating, because it appeared to confuse my system prompt te... |
||||
|
Neutral
Neutral
12 repliesr
Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a ... |
||||
|
Neutral
Neutral
15 repliesr
With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all... |
||||
|
Official
2025-07-22
|
Qwen3 Coder Flash
Qwen
|
765 | 366 |
|
|
Neutral
Neutral
13 repliesr
I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder |
||||
|
Positive
Pro
4 repliesr
> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant firstI'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But ... |
||||
|
Neutral
Neutral
8 repliesr
The "qwen-code" app seems to be a gemini-cli fork.https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSEI hope these OSS CC clones converge at some point.Actually it is mentioned in the page: we’re also open-sourcing a command-line tool for a... |
||||
|
Neutral
Neutral
14 repliesr
At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?05%: Making code changes10%: Running build pipelines20%: Learning about changed process and people via zoom calls, teams cha... |
||||
|
Very positive
Pro
5 repliesr
I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock... |
||||
|
Official
2025-08-04
|
Qwen-Image
Qwen
|
544 | 159 |
|
|
Positive
Pro
13 repliesr
Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge. |
||||
|
Mixed
Mixed
1 repliesr
Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).That being said, it still lags prett... |
||||
|
Positive
Pro
2 repliesr
The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited. |
||||
|
Neutral
Neutral
8 repliesr
This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on t... |
||||
|
Neutral
Neutral
0 repliesr
A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations... |
||||
|
Official
2025-09-12
|
Qwen3-Next 80B-A3B (Thinking)
Qwen
|
569 | 230 |
|
|
Positive
Pro
3 repliesr
Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...But Deepseek R1 adds embed_to... |
||||
|
Positive
Mixed
4 repliesr
Alibaba keeps releasing gold contentI just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysisAccording to Qwen Chat, Qwen3-Next has the f... |
||||
|
Negative
Anti
5 repliesr
llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"Here's a classic ASCII art representation of SpongeBob SquarePants: .------. / o o \ | | | \___/ | \_______/ llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ "An ASCII of spongebob" Here's an ASCII art of SpongeB... |
||||
|
Positive
Pro
2 repliesr
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane. |
||||
|
Neutral
Neutral
8 repliesr
The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc |
||||
|
Repo
2025-09-22
|
Qwen3-Omni Flash
Qwen
|
571 | 142 |
|
|
Positive
Pro
12 repliesr
Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.The Chinese are going to end up owning the AI... |
||||
|
Neutral
Neutral
3 repliesr
You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.It has an entertaining selection of different voices, including:*Dylan* - A teenager who grew up in Beijing's hutongs*Peter*... |
||||
|
Neutral
Neutral
4 repliesr
The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.I wonder if we'll see a macOS port soon - currently it very much needs an N... |
||||
|
Very positive
Pro
0 repliesr
Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.https://www.youtube.com/watch?v=_zdOrPju4_g |
||||
|
Positive
Pro
2 repliesr
Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to you... |
||||
|
Official
2025-09-23
|
Qwen3-VL 235B-A22B
Qwen
|
434 | 160 |
|
|
Very positive
Pro
11 repliesr
As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.Today I tried a handfu... |
||||
|
Very positive
Pro
4 repliesr
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said,... |
||||
|
Neutral
Neutral
2 repliesr
If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage ... |
||||
|
Very positive
Pro
2 repliesr
The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow |
||||
|
Negative
Anti
3 repliesr
Sadly it still fails the "extra limb" test.I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.Like every other model I have tested, it insists that the animals have their an... |
||||
|
3rd party
2025-11-02
|
Tongyi DeepResearch
Yi
|
365 | 153 |
|
|
Neutral
Neutral
7 repliesr
It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.Given the size of frontier models I would assume that th... |
||||
|
Negative
Anti
11 repliesr
Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return. |
||||
|
Neutral
Neutral
2 repliesr
I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch |
||||
|
Mixed
Pro
1 repliesr
This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though. |
||||
|
Neutral
Neutral
13 repliesr
Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.Constra... |
||||
|
Official
2025-12-10
|
Qwen3-Omni Flash Realtime
Qwen
|
316 | 106 |
|
|
Positive
Pro
7 repliesr
This is a 30B parameter MoE with 3B active parameters and is the successor to their previous 7B omni model. [1]You can expect this model to have similar performance to the non-omni version. [2]There aren't many open-weights omni models so I consider this a big deal. I would us... |
||||
|
Neutral
Neutral
4 repliesr
Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.ed... |
||||
|
Neutral
Neutral
2 repliesr
Is there a way to run these Omni models on a Macbook quantized via GGUF or MLX? I know I can run it in LMStudio or Llama.cpp but they don't have streaming microphone support or streaming webcam support.Qwen usually provides example code in Python that requires Cuda and a non-q... |
||||
|
Neutral
Neutral
0 repliesr
Wayback for those that can't reach https://web.archive.org/web/20251210164048/https://qwen.ai/b... |
||||
|
Neutral
Neutral
1 repliesr
The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.I'm curious how anyone has solved this |
||||
|
Official
2026-01-22
|
Qwen3-TTS
Qwen
|
744 | 225 |
|
|
Neutral
Neutral
9 repliesr
If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then pas... |
||||
|
Neutral
Neutral
4 repliesr
I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.pyYou can try it with uv (downloads a 4.5GB model on first run) like ... |
||||
|
Mixed
Mixed
2 repliesr
Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically i... |
||||
|
Very positive
Pro
3 repliesr
it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the mode... |
||||
|
Neutral
Neutral
9 repliesr
Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics. |
||||
|
Official
2026-01-26
|
Qwen3 Max
Qwen
|
502 | 424 |
|
|
Mixed
Mixed
5 repliesr
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the... |
||||
|
Neutral
Neutral
3 repliesr
It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?My problem with deep research tends to be that what it does is it searche... |
||||
|
Neutral
Neutral
4 repliesr
I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/... |
||||
|
Neutral
Neutral
2 repliesr
Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks. |
||||
|
Positive
Pro
0 repliesr
The most important benchmark:https://boutell.dev/misc/qwen3-max-pelican.svgI used Simon Willison's usual prompt.It thought for over 2 minutes (free account). The commentary was even more glowing than the image.It has a certain charm. |
||||
|
Official
2026-02-03
|
Qwen3 Coder Flash
Qwen
|
735 | 429 |
|
|
Positive
Pro
14 repliesr
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough ... |
||||
|
Neutral
Neutral
7 repliesr
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next |
||||
|
Neutral
Neutral
2 repliesr
I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this: brew upgrade llama.cpp # or brew install if you don't have it yet Then: llama-cli \ -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \ --fit on \ --seed 3407 \ --temp 1.0 \ --top-p ... |
||||
|
Positive
Mixed
3 repliesr
It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance. |
||||
|
Positive
Pro
0 repliesr
I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTUVideo is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: ht... |
||||
|
Official
2026-02-10
|
Qwen-Image-2.0
Qwen
|
422 | 192 |
|
|
Neutral
Neutral
10 repliesr
I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the ren... |
||||
|
Positive
Pro
2 repliesr
Couple of thoughts:1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can... |
||||
|
Positive
Pro
3 repliesr
It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation. |
||||
|
Neutral
Neutral
1 repliesr
The "horse riding man" prompt is wild:"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult br... |
||||
|
Neutral
Neutral
11 repliesr
I recently tried out LMStudio on Linux for local models. So easy to use!What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen. |
||||
|
Official
2026-02-16
|
Qwen3.5 397B-A17B
Qwen
|
434 | 214 |
|
|
Neutral
Neutral
1 repliesr
For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5 |
||||
|
Negative
Anti
9 repliesr
You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question. |
||||
|
Positive
Pro
0 repliesr
"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training obje... |
||||
|
Negative
Anti
8 repliesr
Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8... |
||||
|
Positive
Pro
3 repliesr
Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder. |
||||
|
3rd party
2026-02-28
|
Qwen3.5 122B
Qwen
|
461 | 272 |
|
|
Mixed
Anti
18 repliesr
If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.I've been playing with Qwen3-Coder-... |
||||
|
Negative
Anti
19 repliesr
I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.So far Opus 4.6 and Gemini Pro are ve... |
||||
|
Neutral
Neutral
4 repliesr
I recently wrote a guide on getting:- llama.cpp- OpenCode- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)working on a M1 MacBook Pro (e.g. using brew).It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.... |
||||
|
Positive
Pro
4 repliesr
I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.I then discovered what quantization is by reading a blog post about binary quantizat... |
||||
|
Negative
Anti
3 repliesr
Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.Up until relatively recently, while people had already long been making these claim... |
||||
|
3rd party
2026-03-04
|
Qwen Turbo
Qwen
|
783 | 360 |
|
|
Positive
Pro
8 repliesr
I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've h... |
||||
|
Negative
Anti
2 repliesr
There has been tension between Qwen's research team and Alibaba's product team, say the Qwen App. And recently, Alibaba tried to impose DAU as a KPI. It's understandable that a company like Alibaba would force a change of product strategy for any number of reasons. What puzzle... |
||||
|
Positive
Pro
1 repliesr
One thing I’ve noticed with local models is that people tolerate a lot more trial and error behavior. When a hosted model wastes tokens it feels expensive, but when a local model loops a bit it just feels like it’s “thinking.”If models like Qwen can get good enough for coding ... |
||||
|
Neutral
Neutral
9 repliesr
I wonder how a US lab hasn't dumped truckloads of cash into various laps to ensure these researchers have a place at their lab |
||||
|
Neutral
Neutral
5 repliesr
How do those companies make money? Qwen, GLM, Kimi, etc all released for free. I have no experience in the field, but from reading HN alone my impression was training is exceptionally costly and inference can be barely made profitable. How/why do they fund ongoing development ... |
||||
|
Official
2026-04-02
|
Qwen3.6 Plus
Qwen
|
594 | 213 |
|
|
Negative
Anti
11 repliesr
This is their hosted-only model, not an open weight model like they’ve become known for. They got a lot of good publicity for their open weight model releases, which was the goal. The hard part is pivoting from an open weight provider to being considered as a competitor to Cla... |
||||
|
Positive
Mixed
2 repliesr
I understand peoples reactions of Qwen team comparing against Opus 4.5 instead of 4.6. And them comparing against Gemini Pro 3.0 instead of 3.1. But calling it misleading is a bit of stretch in my eyes, people here are acting like we immediately forgot how previous generations... |
||||
|
Positive
Pro
2 repliesr
Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...I used the https://modelstudio.alibabacloud.com/ API to generate that one, which required signing up for an account and attaching PayPal billing - but it looks like OpenRouter are offering it ... |
||||
|
Negative
Anti
5 repliesr
Worth noting that this model, unlike almost all qwen models, is not open-weight, nor is the parameter count exposed. Also odd that it is compared against opus 4.5 even though 4.6 was released like 2 months ago. |
||||
|
Positive
Pro
2 repliesr
I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.I'm obviously not s... |
||||
Mistral AI
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2023-09-27
|
Mistral 7B
Mistral
|
884 | 618 |
|
|
Very positive
Pro
5 repliesr
Major kudos to Mistral for being the first company to Apache license a model of this class.Meta wouldn't make LLama open source.DeciLM wouldn't make theirs open source.All of them wanted to claim they were open source, while putting in place restrictions and not using an open ... |
||||
|
Negative
Anti
1 repliesr
I asked it "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." and it responded:To stack these items on top of each other in a stable manner, follow these steps:1. Start with the 9 eggs. Arrange the... |
||||
|
Positive
Pro
5 repliesr
I eat my initial words, this works really well on my macbook air M1 and feels comparable of GPT3.5 - which is actually an amazing feat!Question: is there something like this, but with the "function calling api" finetuning? 95% of my uses nowadays deal with input/output of stru... |
||||
|
Neutral
Neutral
0 repliesr
For anyone who missed it, the twitter announcement of the model was just a torrent tracker uri: https://twitter.com/MistralAI/status/1706877320844509405 |
||||
|
Negative
Anti
2 repliesr
They don't mention what datasets were used. I've come across too many models in the past which gave amazing results because benchmarks leaked into their training data. How are we supposed to verify one of these HuggingFace datasets didn't leak the benchmarks into the training ... |
||||
|
3rd party
2023-12-08
|
Mixtral 8x7B
Mixtral
|
546 | 239 |
|
|
Positive
Pro
10 repliesr
In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:https://huggingface.co/fblgit/una-xaberius-... |
||||
|
Neutral
Neutral
4 repliesr
Andrej Karpathy's take:New open weights LLM from @MistralAIparams.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2Likely related code: https://github.com/mistralai/megablocks... |
||||
|
Positive
Pro
2 repliesr
Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini! |
||||
|
Positive
Pro
3 repliesr
Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph... |
||||
|
Neutral
Neutral
2 repliesr
Looks to be Mixture of Experts, here is the params.json: { "dim": 4096, "n_layers": 32, "head_dim": 128, "hidden_dim": 14336, "n_heads": 32, "n_kv_heads": 8, "norm_eps": 1e-05, "vocab_size": 32000, "moe": { "num_experts_per_tok": 2, "num_experts": 8 } } |
||||
|
Official
2023-12-11
|
Mixtral 8x7B
Mixtral
|
639 | 300 |
|
|
Neutral
Neutral
1 repliesr
Andrej Karpathy's Take:Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...New HuggingFace explainer on MoE very nice: https://huggingface.co/bl... |
||||
|
Neutral
Neutral
2 repliesr
More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtralAlready available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF |
||||
|
Positive
Pro
6 repliesr
> We’re currently using Mixtral 8x7B behind our endpoint mistral-small...So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be. |
||||
|
Neutral
Pro
3 repliesr
> A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.you... |
||||
|
Positive
Pro
1 repliesr
This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much... |
||||
|
Paper
2024-01-09
|
Mixtral 8x7B
Mixtral
|
359 | 150 |
|
|
Very positive
Pro
4 repliesr
This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com...... |
||||
|
Positive
Pro
7 repliesr
I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locall... |
||||
|
Neutral
Neutral
1 repliesr
If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile3) ./mixt... |
||||
|
Neutral
Neutral
2 repliesr
On mac silicon:https://ollama.ai/ollama pull mixtralFor a chatgpt-esk web uihttps://github.com/ollama-webui/ollama-webuidocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama... |
||||
|
Neutral
Neutral
0 repliesr
Recent and related:Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/ite... |
||||
|
Official
2024-02-26
|
Mistral Large (latest)
Mistral
|
599 | 267 |
|
|
Positive
Pro
4 repliesr
I appreciate the honesty in the marketing materials. Showing the product scoring below the market leader in a big benchmark is better than the Google way of cherry picking benchmarks. |
||||
|
Mixed
Mixed
1 repliesr
Very nice! I know they've already done a lot, but I would've liked some language in there re-affirming a commitment to contributing to the open source community. I had thought that was a major part of their brand.I've been staying tuned[0] since the miqu[1] debacle thinking th... |
||||
|
Neutral
Neutral
2 repliesr
Changelog is also updated: [1]Feb. 26, 2024API endpoints: We renamed 3 API endpoints and added 2 model endpoints.open-mistral-7b (aka mistral-tiny-2312): renamed from mistral-tiny. The endpoint mistral-tiny will be deprecated in three months.open-mixtral-8x7B (aka mistral-smal... |
||||
|
Neutral
Neutral
1 repliesr
I just added support for the new models to my https://github.com/simonw/llm-mistral plugin for my LLM CLI tool. You can now do this: pipx install llm llm install llm-mistral llm keys set mistral < paste your API key here > llm -m mistral-large 'prompt goes here' |
||||
|
Positive
Pro
2 repliesr
Just tried Le Chat for some coding issues I had today that ChatGPT (with GPT-4) wasn't able to solve, and Le Chat actually gave way better answers. Not sure if ChatGPT quality has gone down to save costs as some people suggest, but for these few problems the quality of the ans... |
||||
|
Official
2024-04-17
|
Mixtral 8x22B
Mixtral
|
533 | 242 |
|
|
Mixed
Anti
5 repliesr
First test I tried to run a random taxation question through itOutput: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...Within the first paragraph, it outputs:> GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGEThought that was hilarious. |
||||
|
Neutral
Neutral
11 repliesr
Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all? |
||||
|
Neutral
Mixed
5 repliesr
"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 TurboMaybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex... |
||||
|
Mixed
Mixed
2 repliesr
Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data. |
||||
|
Neutral
Neutral
3 repliesr
What's the best way to run this on my Macbook Pro?I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is... |
||||
|
Official
2024-05-29
|
Codestral (latest)
Codestral
|
457 | 214 |
|
|
Negative
Anti
8 repliesr
The license for this [1] prohibits use of the model and its outputs for any commercial activity, or even any "live" (whatever that means) conditions, commercial or not.There seems to be an exclusion for using the code outputs as part of "development". But wait! It also prohibi... |
||||
|
Negative
Neutral
9 repliesr
My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program usin... |
||||
|
Neutral
Neutral
7 repliesr
i've been noticing that there's a divergence in philosophy between Llama style LLMs (Mistral are Meta alums so I'm counting them in tehre) and OpenAI/GPT style LLMs when it comes to code.GPT3.5+ prioritized code very heavily - there's no CodeGPT, its just GPT4, and every versi... |
||||
|
Neutral
Neutral
6 repliesr
Is there a way to use this within VSCode like copilot , meaning having the "shadow code" appear while you code instead of having to tho back-and-forth between the editor and a chat-like interface ?For me, a significant component of the quality of these tools resides on the "cl... |
||||
|
Neutral
Neutral
0 repliesr
>Usage Limitation- You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments;- Subject to the foregoing, You shall not supply the Mistral Models, Deriva... |
||||
|
Official
2024-07-16
|
Codestral Mamba
Codestral
|
485 | 138 |
|
|
Negative
Mixed
7 repliesr
What are the steps required to get this running in VS Code?If they had linked to the instructions in their post (or better yet a link to a one click install of a VS Code Extension), it would help a lot with adoption.(BTW I consider it malpractice that they are at the top of ha... |
||||
|
Negative
Neutral
3 repliesr
I kinda just want something that can keep up with the original version of Copilot. It was so much better than the crap they’re pumping out now (keeps messing up syntax and only completing a few characters at a time). |
||||
|
Neutral
Neutral
2 repliesr
Does anyone have a favorite FIM capable model? I've been using codellama-13b through ollama w/ a vim extension i wrote and it's okay but not amazing, I definitely get better code most of the time out of Gemma-27b but no FIM (and for some reason codellama-34b has broken inferen... |
||||
|
Positive
Pro
0 repliesr
It's great to see a high-profile model using Mamba2! |
||||
|
Neutral
Anti
2 repliesr
The MBPP column should bold DeepSeek as it has a better score than Codestral. |
||||
|
Official
2024-07-18
|
Mistral Nemo
Mistral Nemo
|
418 | 162 |
|
|
Mixed
Mixed
6 repliesr
> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on s... |
||||
|
Neutral
Neutral
3 repliesr
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.Does anyone have a good a... |
||||
|
Neutral
Neutral
0 repliesr
Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.> *Designed to fit on the... |
||||
|
Neutral
Neutral
2 repliesr
These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/fund... |
||||
|
Negative
Anti
1 repliesr
I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data. |
||||
|
3rd party
2024-09-11
|
Pixtral 12B
Pixtral
|
163 | 40 |
|
|
Mixed
Mixed
5 repliesr
The "Mistral Pixtral multimodal model" really rolls off the tongue.> It’s unclear which image data Mistral might have used to develop Pixtral 12B.The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (AP... |
||||
|
Negative
Anti
1 repliesr
Couple notes for newcomers:1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for unde... |
||||
|
Negative
Anti
2 repliesr
Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it. |
||||
|
Neutral
Neutral
0 repliesr
Related earlier:New Mistral AI Weightshttps://news.ycombinator.com/item?id=41508695 |
||||
|
Positive
Pro
1 repliesr
I’d love to know how much money Mistral is taking in versus spending. I’m very happy for all these open weights models, but they don’t have Instagram to help pay for it. These models are expensive to build. |
||||
|
Official
2024-10-16
|
Ministral 3B (latest)
Ministral
|
216 | 99 |
|
|
Negative
Anti
8 repliesr
3b is is API-only so you won’t be able to run it on-device, which is the killer app for these smaller edge models.I’m not opposed to licensing but “email us for a license” is a bad sign for indie developers, in my experience.8b weights are here https://huggingface.co/mistralai... |
||||
|
Negative
Anti
2 repliesr
This press release is a big change in branding and ethos for Mistral. What was originally a vibey, insurgent contender that put out magnet links is now a PR-crafting team that has to fight to pitch their utility to the public. |
||||
|
Neutral
Neutral
3 repliesr
Has anyone put together a good and regularly updated decision tree for what model to use in different circumstances (VRAM limitations, relative strengths, licensing, etc.)? Given the enormous zoo of models in circulation, there must be certain models that are totally obsolete. |
||||
|
Negative
Anti
2 repliesr
They didn't add a comparison to Qwen 2.5 3b, which seems to surpass Ministral 3b MMLU, HumanEval, GSM8K: https://qwen2.org/qwen2-5/#qwen25-05b15b3b-performanceThese benchmarks don't really matter that much, but it is funny how this blog post conveniently forgot to compare with... |
||||
|
Neutral
Neutral
4 repliesr
For anybody wondering about the title, that's a sort-of pun in French about how words get pluralized following French rules.The quintessential example is "cheval" (horse) which becomes "chevaux" (horses), which is the rule they're following (or being cute about). Un mistral, d... |
||||
|
Official
2025-01-13
|
Codestral 25.01
Codestral
|
21 | 1 |
|
|
Mixed
Mixed
0 repliesr
256k context window and FIM support sounds good. I can't see it mentioned on the release page how big the model is, they say sub-100B but they are comparing it to 22B and the 22B seems to hold up quite well against it. If we're talking 80B vs 22B then it seems like quite a hea... |
||||
|
Official
2025-01-30
|
Mistral Small 3
Mistral
|
620 | 194 |
|
|
Positive
Pro
8 repliesr
I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. Th... |
||||
|
Neutral
Pro
6 repliesr
Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-lic... |
||||
|
Neutral
Neutral
1 repliesr
Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:* run LLM evaluations systematically and at scale* share the data with the public in a rigorous and transparent wayWe use the UK governmen... |
||||
|
Negative
Anti
0 repliesr
Not so subtle in function calling example[1] "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", [1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-... |
||||
|
Neutral
Pro
0 repliesr
So the point of this release is1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)3) Performance, according to mistral, somewhere between Qwen 2... |
||||
|
Official
2025-03-06
|
Mistral OCR
Mistral
|
1,756 | 417 |
|
|
Mixed
Mixed
9 repliesr
I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.You can see the samples here - https://huggingf... |
||||
|
Negative
Anti
7 repliesr
It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:https://i.imgur.com/jcwW5AG.jpegFor the blocks in the center, it outputs:> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 aoû... |
||||
|
Very positive
Pro
2 repliesr
This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.Specifically, this allows y... |
||||
|
Positive
Pro
3 repliesr
I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF. |
||||
|
Mixed
Mixed
2 repliesr
We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-geminiA high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and... |
||||
|
Official
2025-05-07
|
Mistral Medium 3
Mistral
|
70 | 20 |
|
|
Negative
Anti
1 repliesr
It's not cheaper than Deepseek V3.1, though, and Deepseek outperforms on nearly everything. And only between 1- 3x the throughput based on the openrouter metrics (near equivalent throughput if you use an FP8 quant). Wish I could be a little more excited about this one. |
||||
|
Neutral
Neutral
1 repliesr
https://lifearchitect.ai/models-table/ |
||||
|
Neutral
Neutral
2 repliesr
I guess this one (Mistral Medium 3) won't be open? |
||||
|
Positive
Pro
0 repliesr
The medium model is crazy fast. Reminds me of maverick on groq (except better according to their own testing). |
||||
|
Neutral
Neutral
0 repliesr
The only hint at its size is that it requires "self-hosted environments of four GPUs and above". |
||||
|
Official
2025-05-21
|
Devstral
Devstral
|
701 | 148 |
|
|
Positive
Pro
4 repliesr
The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tagsI find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters bec... |
||||
|
Very positive
Pro
3 repliesr
The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run t... |
||||
|
Positive
Pro
1 repliesr
It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mis... |
||||
|
Positive
Pro
1 repliesr
It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions. |
||||
|
Mixed
Mixed
1 repliesr
*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*total duration: 35.016288581s load duration:... |
||||
|
Official
2025-06-10
|
Magistral Small
Magistral
|
941 | 424 |
|
|
Neutral
Neutral
7 repliesr
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUFollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XLor./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1... |
||||
|
Negative
Anti
15 repliesr
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is... |
||||
|
Very negative
Anti
1 repliesr
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.I tried it, 80% of the "text" was recognised as images and output as whitespace so most of... |
||||
|
Positive
Pro
2 repliesr
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in ... |
||||
|
Neutral
Neutral
2 repliesr
Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/ |
||||
|
Repo
2025-06-20
|
Mistral Small 3.2
Mistral
|
23 | 0 | — |
|
Official
2025-12-02
|
Mistral 3
Mistral
|
826 | 236 |
|
|
Very positive
Pro
7 repliesr
I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and fol... |
||||
|
Mixed
Mixed
3 repliesr
The new large model uses DeepseekV2 architecture. 0 mention on the page lol.It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar ... |
||||
|
Positive
Pro
2 repliesr
The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPUPelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/ |
||||
|
Positive
Pro
1 repliesr
Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.Mistral had the best small mode... |
||||
|
Positive
Pro
4 repliesr
Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things. |
||||
|
Official
2025-12-09
|
Devstral 2 (latest)
Devstral
|
745 | 349 |
|
|
Positive
Pro
9 repliesr
llm install llm-mistral llm mistral refresh llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle" https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!(That said I'm not 100% certain I guessed the correct model ID, ... |
||||
|
Positive
Mixed
2 repliesr
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku ... |
||||
|
Positive
Pro
2 repliesr
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.It spent about half an hour, correctly identified wh... |
||||
|
Positive
Mixed
0 repliesr
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....Here is wha... |
||||
|
Mixed
Mixed
9 repliesr
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.I'm a bit saddened ... |
||||
Meta
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2023-07-18
|
Llama 2
Llama
|
2,268 | 820 |
|
|
Positive
Pro
7 repliesr
Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.- Llama 1 (llama-65b): 57.6- LLama 2 (llama-2-70b-chat-hf): 64.6- GPT-3.5: 85.2- GPT-... |
||||
|
Negative
Anti
25 repliesr
Key detail from release:> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must re... |
||||
|
Negative
Anti
6 repliesr
This was a pretty disappointing initial exchange:> what are the most common non-investor roles at early stage venture capital firms?Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term "non-investor roles" may be perc... |
||||
|
Positive
Pro
23 repliesr
Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app!Model playground here: https://llama2.aiHosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chatIf you want to just play with the model, llama2.ai is a very easy way to do ... |
||||
|
Negative
Anti
5 repliesr
Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op... |
||||
|
Official
2024-04-18
|
Meta-Llama-3-70B-Instruct
Llama
|
2,199 | 923 |
|
|
Neutral
Neutral
0 repliesr
See also https://ai.meta.com/blog/meta-llama-3/and https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...edit: and https://twitter.com/karpathy/status/1781028605709234613 |
||||
|
Neutral
Anti
15 repliesr
They've got a console for it as well, https://www.meta.ai/And announcing a lot of integration across the Meta product suite, https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...Neglected to include comparisons against GPT-4-Turbo or Claude Opus, so I guess it's far ... |
||||
|
Positive
Pro
2 repliesr
Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others: p... |
||||
|
Positive
Pro
0 repliesr
Llama 3 70B has debuted on the famous LMSYS chatbot arena leaderboard at position number 5, tied with Claude 2 Sonnet, Bard (Gemini Pro), and Command R+, ahead of Claude 2 Haiku and older versions of GPT-4.The score still has a large uncertainty so it will take a while to dete... |
||||
|
Negative
Anti
4 repliesr
I tried generating a Chinese rap song, and it did generate a pretty good rap. However, upon completion, it deleted the response, and showed > I don’t understand Chinese yet, but I’m working on it. I will send you a message when we can talk in Chinese.I tried some other lang... |
||||
|
Official
2024-07-23
|
Meta-Llama-3.1-405B-Instruct
Llama
|
437 | 269 |
|
|
Neutral
Neutral
0 repliesr
Related ongoing thread:Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments) |
||||
|
Positive
Pro
4 repliesr
The 405b model is actually competitive against closed source frontier models.Quick comparison with GPT-4o: +----------------+-------+-------+ | Metric | GPT-4o| Llama | | | | 3.1 | | | | 405B | +----------------+-------+-------+ | MMLU | 88.7 | 88.6 | | GPQA | 53.6 | 51.1 | | ... |
||||
|
Positive
Pro
1 repliesr
I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboar... |
||||
|
Neutral
Neutral
6 repliesr
You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.If you want to learn more, there is a writeup at https://wow.groq.com... |
||||
|
Very positive
Pro
3 repliesr
Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path... |
||||
|
Official
2024-09-25
|
Llama-3.2-11B-Vision-Instruct
Llama
|
924 | 328 |
|
|
Very positive
Pro
8 repliesr
I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job... |
||||
|
Very positive
Pro
11 repliesr
I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am gra... |
||||
|
Positive
Pro
6 repliesr
"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!I believe this unforgiving dynamic is why model distillation works w... |
||||
|
Positive
Pro
3 repliesr
Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:nexa run llama3.2 --streamlitDis... |
||||
|
Neutral
Neutral
3 repliesr
If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be ... |
||||
|
Repo
2024-12-06
|
Llama-3.3-70B-Instruct
Llama
|
425 | 219 |
|
|
Very positive
Pro
3 repliesr
Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3... |
||||
|
Neutral
Pro
8 repliesr
This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They... |
||||
|
Positive
Pro
4 repliesr
Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109 |
||||
|
Positive
Pro
1 repliesr
Does unexpectedly well on our benchmark:https://help.kagi.com/kagi/ai/llm-benchmark.htmlWill dive into it more, but this is impressive. |
||||
|
Neutral
Neutral
3 repliesr
Please help me understand something.I've been out of the loop with HuggingFace models.What can you do with these models?1. Can you download them and run them on your Laptop via JupyterLab?2. What benefits does that get you?3. Can you update them regularly (with new data on the... |
||||
|
Official
2025-04-05
|
Llama 4 Scout 17B 16E Instruct
Llama
|
1,235 | 658 |
|
|
Neutral
Neutral
8 repliesr
General overview below, as the pages don't seem to be working well Llama 4 Models: - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each. - They are natively multimodal: text + image input, text-only output. - Key achie... |
||||
|
Neutral
Neutral
36 repliesr
"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."Perhaps. Or, maybe, "leaning left" by th... |
||||
|
Neutral
Neutral
5 repliesr
Model training observations from both Llama 3 and 4 papers:Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].For Llama 4 training, Meta doubled the compute, using ~... |
||||
|
Positive
Pro
11 repliesr
The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct... |
||||
|
Negative
Mixed
7 repliesr
This thread so far (at 310 comments) summarized by Llama 4 Maverick: hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000 Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some r... |
||||
DeepSeek
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Repo
2025-01-20
|
DeepSeek R1
DeepSeek
|
1,843 | 663 |
|
|
Positive
Pro
23 repliesr
OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...The one I'm running is the 8.54GB file. I'm using Ollama like this: ollama run hf.co/unsloth/DeepSeek-... |
||||
|
Positive
Pro
22 repliesr
Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It c... |
||||
|
Positive
Pro
4 repliesr
> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.We've been running ... |
||||
|
Mixed
Mixed
6 repliesr
Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, Ne... |
||||
|
Positive
Pro
6 repliesr
The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.Their model crushes it on closed-system tasks... |
||||
|
Paper
2025-01-25
|
DeepSeek R1
DeepSeek
|
1,351 | 1,056 |
|
|
Positive
Pro
10 repliesr
we've been tracking the deepseek threads extensively in LS. related reads:- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet.... |
||||
|
Positive
Pro
6 repliesr
I've been using https://chat.deepseek.com/ over My ChatGPT Pro subscription because being able to read the thinking in the way they present it is just much much easier to "debug" - also I can see when it's bending it's reply to something, often softening it or pandering to me ... |
||||
|
Neutral
Neutral
10 repliesr
DeepSeek-R1 has apparently caused quite a shock wave in SV ...https://venturebeat.com/ai/why-everyone-in-ai-is-freaking-ou... |
||||
|
Very positive
Pro
3 repliesr
DeepSeek V3 came in the perfect time, precisely when Claude Sonnet turned into crap and barely allows me to complete something without me hitting some unexpected constraints.Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a hug... |
||||
|
Positive
Neutral
4 repliesr
Over 100 authors on arxiv and published under the team name, that's how you recognize everyone and build comradery. I bet morale is high over there |
||||
|
Paper
2025-03-27
|
DeepSeek V3
DeepSeek
|
132 | 34 |
|
|
Neutral
Neutral
3 repliesr
The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.... |
||||
|
Negative
Anti
2 repliesr
The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts. |
||||
|
Neutral
Neutral
1 repliesr
Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-... |
||||
|
Neutral
Neutral
0 repliesr
Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18. |
||||
|
Positive
Pro
1 repliesr
Nice to see a return to open source in models and training systems. |
||||
|
Repo
2025-05-28
|
DeepSeek R1 0528
DeepSeek
|
451 | 250 |
|
|
Neutral
Neutral
4 repliesr
Well that didn't take long, available from 7 providers through openrouter.https://openrouter.ai/deepseek/deepseek-r1-0528/providersMay 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B pa... |
||||
|
Negative
Mixed
4 repliesr
No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link |
||||
|
Positive
Pro
7 repliesr
I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare. |
||||
|
Neutral
Neutral
13 repliesr
Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses? |
||||
|
Neutral
Neutral
6 repliesr
What use cases are people using local LLMs for? Have you created any practical tools that actually increase your efficiency? I've been experimenting a bit but find it hard to get inspiration for useful applications |
||||
|
Official
2025-08-21
|
DeepSeek-V3.1
DeepSeek
|
778 | 263 |
|
|
Neutral
Neutral
6 repliesr
For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow../llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CP... |
||||
|
Mixed
Mixed
6 repliesr
For reference, here is the terminal-bench leaderboard:https://www.tbench.ai/leaderboardLooks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will... |
||||
|
Negative
Anti
5 repliesr
Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?https://artificialanalysis.ai/models/deepseek-v3-1-reasoning |
||||
|
Neutral
Anti
2 repliesr
It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoningPricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1 |
||||
|
Mixed
Mixed
2 repliesr
It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those. |
||||
|
Official
2025-09-22
|
siliconflow/deepseek-v3.1-terminus
DeepSeek
|
101 | 28 |
|
|
Mixed
Mixed
0 repliesr
Notable performance improvement in agentic tool use: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-TerminusThe Deepseek provider may train on your prompts: https://openrouter.ai/deepseek/deepseek-v3.1-terminus |
||||
|
Neutral
Neutral
0 repliesr
The link is off. This link works https://api-docs.deepseek.com/updates#deepseek-v31-terminus |
||||
|
Very negative
Anti
1 repliesr
I tried V3.1 but it was driving me crazy by ignoring parts of user input, which R1 never did. I had many such instances when e.g. asking about running DeepSeek 671B it instead picked DeepSeek 67B because 671B is too large to exist so I must have made a mistake etc. I concluded... |
||||
|
Mixed
Mixed
4 repliesr
> What’s improved? Language consistency: fewer CN/EN mix-ups & no more random chars.It's good that they made this improvement. But is there any advantages at this point using DeepSeek over Qwen? |
||||
|
Positive
Pro
0 repliesr
Interesting--I'd seen Chinese characters surprise inserted when it was just repeating back input with one provider, but not others. (I'd also occasionally seen tokens surprise-translated to Chinese.)There's a GitHub bug about it that leads to more discussion here: https://gith... |
||||
|
Repo
2025-09-29
|
DeepSeek V3.2 Exp
DeepSeek
|
309 | 50 |
|
|
Positive
Pro
2 repliesr
The 2nd order effect that not a lot of people talk about is price: the fact that model scaling at this pace also correlates with price is amazing.I think this is just as important to distribution of AI as model intelligence is.AFAIK there are no fundamental "laws" that prevent... |
||||
|
Positive
Pro
2 repliesr
Happy to see Chinese OSS models keep getting better and cheaper. It also comes with a 50% API price drop for an already cheap model, now at:$0.28/M Input ($0.028/M cache hit) > $0.42/M Output |
||||
|
Neutral
Neutral
2 repliesr
https://openrouter.ai/deepseek/deepseek-v3.2-exp |
||||
|
Neutral
Neutral
0 repliesr
Not sure if I get it correctly:They trained a thing to learn mimicking the full attention distribution but only filtering the top-k (k=2048) most important attention tokens so that when the context window increases, the compute does not go up linearly but constantly for the at... |
||||
|
Positive
Pro
0 repliesr
wow...gigantic reduction in cost while holding the benchmarks mostly steady. Impressive. |
||||
|
Repo
2025-10-20
|
DeepSeek OCR
DeepSeek
|
1,003 | 244 |
|
|
Positive
Pro
7 repliesr
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required... |
||||
|
Neutral
Neutral
4 repliesr
I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-c... |
||||
|
Negative
Anti
5 repliesr
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.https://annas-archive.org/blog... |
||||
|
Neutral
Neutral
2 repliesr
For everyone wondering how good this and other benchmarks are:- the OmniAI benchmark is bad- Instead check OmniDocBench[1] out- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini- End to End OCR is still extremely tricky- composed pip... |
||||
|
Neutral
Neutral
7 repliesr
How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)? |
||||
|
Repo
2025-12-01
|
DeepSeek-V3.2-Speciale
DeepSeek
|
20 | 0 | — |
|
Repo
2025-12-01
|
siliconflow/deepseek-v3.2
DeepSeek
|
982 | 465 |
|
|
Positive
Pro
6 repliesr
Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly. |
||||
|
Neutral
Pro
15 repliesr
How will the Google/Anthropic/OpenAI's of the world make money on AI if open models are competitive with their models? What hurt open source in the past was its inability to keep up with the quality and feature depth of closed source competitors, but models seem to be reaching... |
||||
|
Positive
Pro
2 repliesr
Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851 |
||||
|
Mixed
Mixed
1 repliesr
> DeepSeek-V3.2 introduces significant updates to its chat template compared to prior versions. The primary changes involve a revised format for tool calling and the introduction of a "thinking with tools" capability.At first, I thought they had gone the route of implementi... |
||||
|
Mixed
Mixed
7 repliesr
It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?Frontier models are far exceeding even the... |
||||
Zhipu AI
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2025-07-28
|
GLM-4.5
GLM
|
247 | 134 |
|
|
Negative
Anti
15 repliesr
I typed "Hello" in their chat [1] and it replied back with "Hello! I'm Claude, an AI assistant created by Anthropic. How can I help you today?"Hmmm....[1] https://chat.z.ai/ |
||||
|
Neutral
Neutral
3 repliesr
We have x.ai and z.ai, but why?(Sorry... dad joke.) |
||||
|
Positive
Pro
0 repliesr
I tested the model with Claude Code and my experience was it was at least as good as Sonnet 3.5, perhaps they designed it that way since they benchmarked with it.Hard to test more thoroughly on more complex problems since the API is being hammered, but I could get it to consis... |
||||
|
Negative
Anti
2 repliesr
Tried it with a few prompts. The vibes feel super weird to me, in a way that I don't know how to describe. I'm not sure if it's just that I'm so used to Gemini 2.5 Pro and the other US models. Subjectively, it doesn't feel very smart.I asked it to analyze a recent painting I m... |
||||
|
Neutral
Neutral
0 repliesr
I just added it to Gnod Search for easy comparison with the other public LLMs out there:https://www.gnod.com/search/aiInteresting that they claim it outperforms O3, Grok-4 and Gemini-2.5-Pro for coding. I will test that over the coming days on my coding tasks. |
||||
|
3rd party
2025-07-29
|
GLM-4.5-Air
GLM
|
577 | 396 |
|
|
Very positive
Pro
4 repliesr
> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high qualit... |
||||
|
Positive
Pro
2 repliesr
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.I believe we are vastly underestimating what our existing hardware is c... |
||||
|
Negative
Anti
3 repliesr
Did you understand the implementation or just that it produced a result?I would hope an LLM could spit out a cobbled form of answer to a common interview question.Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why... |
||||
|
Negative
Anti
6 repliesr
Most likely its training data included countless Space Invaders in various programming languages. |
||||
|
Mixed
Mixed
4 repliesr
I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.That said, for something like this, I’d probably get more out of simply finding an existing implementat... |
||||
|
Repo
2025-08-11
|
GLM-4.5V
GLM
|
27 | 0 | — |
|
Paper
2025-08-12
|
GLM-4.5
GLM
|
417 | 83 |
|
|
Positive
Pro
2 repliesr
Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these... |
||||
|
Positive
Pro
6 repliesr
I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:1. Claude is somewhat bet... |
||||
|
Neutral
Neutral
1 repliesr
So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.T... |
||||
|
Positive
Pro
2 repliesr
Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level. Current cloud based models are fun and useful, but a tool that ... |
||||
|
Positive
Pro
0 repliesr
This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results ... |
||||
|
Official
2025-09-30
|
GLM-4.6
GLM
|
43 | 10 |
|
|
Positive
Pro
1 repliesr
GLM 4.5 is a great budget option for me. After cancelling Claude sub ($20 USD one), I've been trying Gemini 2.5 Pro through Gemini CLI, and it was okay for landing pages and UI. However, I switched to GLM through Claude Code using their cheapest subscription and it's been pret... |
||||
|
Positive
Pro
0 repliesr
GLM 4.5 for me one of the best coding model when running with agents such as Claude Code and Opencode (excluding Claude and GPT-5).Looking forward to try GLM 4.6.But Z AI should do like KIMI¹ and launch a verifier to see how GLM models behave on 3rd party providers.[1] https:/... |
||||
|
Positive
Pro
1 repliesr
Very respectable. If Sonnet 4.5 was delayed by a day, GLM 4.6 would have been the obvious best choice for agentic coding... up there with gpt-codex.The z.ai monthly plan looks like a steal if you're working on personal hobby projects instead of paying the $200/mo price for Cla... |
||||
|
Neutral
Neutral
0 repliesr
It would be great if people that are using GLM-4.6 within Claude Code could chime in with their experience in using it. I'm a MAX subscriber and having extra cheap tokens would be very helpful. |
||||
|
Official
2025-12-22
|
GLM-4.7
GLM
|
437 | 235 |
|
|
Positive
Pro
11 repliesr
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese prim... |
||||
|
Neutral
Neutral
5 repliesr
Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-co... |
||||
|
Mixed
Mixed
2 repliesr
Appears to be cheap and effective, though under suspicion.But the personal and policy issues are about as daunting as the technology is promising.Some the terms, possibly similar to many such services: - The use of Z.ai to develop, train, or enhance any algorithms, models, or ... |
||||
|
Negative
Anti
3 repliesr
I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉,我目前无法提供你需要的具体信息,如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">Analyze the User's Input: Question: "is it ok for l... |
||||
|
Very positive
Pro
2 repliesr
I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Mayb... |
||||
|
Repo
2026-01-19
|
GLM-4.7
GLM
|
378 | 135 |
|
|
Positive
Pro
3 repliesr
Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't reall... |
||||
|
Positive
Pro
2 repliesr
I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designe... |
||||
|
Positive
Pro
8 repliesr
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to ... |
||||
|
Positive
Pro
2 repliesr
> SWE-bench Verified 59.2This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4. |
||||
|
Neutral
Neutral
1 repliesr
What’s the significance of this for someone out of the loop? |
||||
|
Official
2026-02-11
|
GLM-5
GLM
|
484 | 520 |
|
|
Mixed
Mixed
9 repliesr
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame. |
||||
|
Mixed
Pro
8 repliesr
Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be ye... |
||||
|
Positive
Pro
2 repliesr
Bought some API credits and ran it through opencode (model was "GLM 5").Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO sc... |
||||
|
Mixed
Mixed
1 repliesr
Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.[1] https:/... |
||||
|
Positive
Pro
5 repliesr
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clea... |
||||
|
Official
2026-04-07
|
GLM-5.1
GLM
|
617 | 262 |
|
|
Positive
Pro
3 repliesr
We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.But keeping in mind t... |
||||
|
Positive
Pro
3 repliesr
Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/ |
||||
|
Neutral
Anti
1 repliesr
Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.[0] https://huggingface.co/unsloth/GLM-5... |
||||
|
Mixed
Mixed
7 repliesr
To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine. |
||||
|
Positive
Pro
15 repliesr
Every single day, three things are becoming more and more clear: (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat (2) Local/private inference is the future of AI (3) There's *still* no killer product yet (so get to work!) |
||||
Moonshot AI
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
3rd party
2025-07-11
|
Moonshot Kimi K2 Instruct
Kimi
|
348 | 179 |
|
|
Positive
Pro
7 repliesr
I tried Kimi on a few coding problems that Claude was spinning on. It’s good. It’s huge, way too big to be a “local” model — I think you need something like 16 H200s to run it - but it has a slightly different vibe than some of the other models. I liked it. It would definitely... |
||||
|
Neutral
Neutral
6 repliesr
Pelican on a bicycle result: https://simonwillison.net/2025/Jul/11/kimi-k2/ |
||||
|
Positive
Pro
3 repliesr
This is a very impressive general purpose LLM (GPT 4o, DeepSeek-V3 family). It’s also open source.I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:https:... |
||||
|
Positive
Pro
1 repliesr
Technical strengths aside, I’ve been impressed with how non-robotic Kimi K2 is. Its personality is closer to Anthropic’s best: pleasant, sharp, and eloquent. A small victory over botslop prose. |
||||
|
Neutral
Neutral
1 repliesr
Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB |
||||
|
Official
2025-11-06
|
Kimi K2 Thinking
Kimi
|
936 | 427 |
|
|
Neutral
Pro
5 repliesr
As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups fr... |
||||
|
Neutral
Neutral
4 repliesr
uv tool install llm llm install llm-moonshot llm keys set moonshot # paste key llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle' https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...Here's what I got using OpenRouter's moonshotai/kimi-k... |
||||
|
Mixed
Mixed
17 repliesr
It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run local... |
||||
|
Positive
Pro
9 repliesr
Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives? |
||||
|
Very positive
Pro
1 repliesr
From our tests, Kimi K2 Thinking is better than literally everything - gpt-5, claude 4.5 sonnet. the only model that is better than Kimi K2 thinking is GPT-5 codex.It's now available on https://okara.ai if anyone wants to try it. |
||||
|
Official
2026-01-27
|
Kimi K2.5
Kimi
|
502 | 239 |
|
|
Neutral
Neutral
4 repliesr
Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.51T parameters, 32b active parameters.License: MIT with the following modification:Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or s... |
||||
|
Very positive
Pro
5 repliesr
The "Deepseek moment" is just one year ago today!Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all". |
||||
|
Positive
Pro
4 repliesr
> For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduc... |
||||
|
Neutral
Neutral
0 repliesr
I posted this elsewhere but thought I'd repost here:* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO* https://dashboard.safe.ai/ — CAIS' incredible dashboard* https://clocks.brianmoore.com/ — a visual comparison of how well models ... |
||||
|
Positive
Pro
3 repliesr
One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.GitHub: http... |
||||
|
Repo
2026-01-30
|
Kimi K2.5
Kimi
|
388 | 141 |
|
|
Positive
Pro
4 repliesr
I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good a... |
||||
|
Negative
Anti
5 repliesr
Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.examples to illustratehttps://ww... |
||||
|
Negative
Anti
0 repliesr
I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via... |
||||
|
Positive
Pro
0 repliesr
I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus. |
||||
|
Mixed
Mixed
2 repliesr
It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can... |
||||
xAI
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2024-08-14
|
Grok 2
Grok
|
226 | 333 |
|
|
Mixed
Mixed
8 repliesr
The technology is impressive; achieving such a level requires a lot of efforts in dataset creation, neural architectural costs, and GPU shepherding.What is the company’s ethical position though? It officially stemmed from Mr Musk’s objection that OpenAI was not open-source, bu... |
||||
|
Neutral
Pro
2 repliesr
I assume they will have a lot less "safety", i.e. the model will be more likely to actually do what you ask instead of finding a reason why "sorry Dave, I can't do that".Since these "safety" features tend to also degrade the model, that's likely also helping them catch up in t... |
||||
|
Negative
Anti
1 repliesr
It's hilarious they put Claude 3.5 Sonnet in the far right corner while it scores the highest and beats most of Grok's numbers. |
||||
|
Positive
Pro
1 repliesr
It uses FLUX.1 to generate images and it has been fun so far. Its good on writing, can generate very realistic photos, can create memes, and looks like hands problem is fixed now. |
||||
|
Positive
Pro
3 repliesr
You know what’s also impressive besides this beta release? How Claude 3.5 Sonnet is still able to keep up so well. Grok-2 beat every other LLM except Claude. How did Anthropic achieve this? |
||||
|
3rd party
2025-02-20
|
Grok 3
Grok
|
132 | 211 |
|
|
Negative
Anti
7 repliesr
This article is weak and just general speculation.Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ot... |
||||
|
Negative
Anti
2 repliesr
The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok ... |
||||
|
Negative
Anti
10 repliesr
Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.... |
||||
|
Negative
Anti
2 repliesr
I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Mayb... |
||||
|
Negative
Anti
0 repliesr
The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only... |
||||
|
3rd party
2025-07-10
|
Grok 4
Grok
|
437 | 604 |
|
|
Positive
Pro
5 repliesr
Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.Specialized coding model coming "in a few weeks". I notice they didn't talk ab... |
||||
|
Positive
Pro
9 repliesr
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent d... |
||||
|
Very positive
Pro
3 repliesr
I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc... |
||||
|
Neutral
Neutral
0 repliesr
"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%.""This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."https://x.com/arcprize/status/1943168950763950555 |
||||
|
Negative
Anti
16 repliesr
The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.I can already use Gemini 2.5 Pro for free in AI s... |
||||
|
3rd party
2025-07-10
|
Grok 4
Grok
|
328 | 253 |
|
|
Negative
Anti
5 repliesr
Here's something far more interesting about Grok 4: if you ask for its opinion on controversial subjects it sometimes runs a search on X for tweets "from:elonmusk" before it answers! https://simonwillison.net/2025/Jul/11/grok-musk/ |
||||
|
Negative
Mixed
5 repliesr
[edit to focus on pricing, leaving praise of Simon's post out despite being deserved]Simon claims, 'Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4.' This ignores the real price which skyroc... |
||||
|
Neutral
Anti
11 repliesr
Claude Code converted me from paying $0 for LLMs to $200 per month. Any co that wants a chance at getting that $200 ($300 is fine too) from me needs a Claude Code equivalent and a model where the equivalent's tools were part of its RL environment. I don't think I can go back t... |
||||
|
Negative
Anti
3 repliesr
Is it time for a new benchmark of "how easy is it to turn this AI into a 4chan poster", maybe it is since this seems to be an axis that Elon seems to want to distinguish his AI offering from everyone else's along. |
||||
|
Neutral
Neutral
4 repliesr
> My best guess is that these lines in the prompt were the root of the problem:The second line was recently removed, per the GitHub: https://github.com/xai-org/grok-prompts/commit/c5de4a14feb50... |
||||
|
Official
2025-08-29
|
Grok Code Fast 1
Grok
|
512 | 477 |
|
|
Positive
Pro
10 repliesr
Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which is awesome in my bo... |
||||
|
Negative
Anti
14 repliesr
It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.I would have thought it uncontroversial view among software engineers that token quality is much important than token ... |
||||
|
Negative
Anti
1 repliesr
Is this the model that is the "Coding" version of Grok-4 promised when Grok-4 had awful coding benchmarks?I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up wa... |
||||
|
Negative
Anti
2 repliesr
"On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness."Let's see this harness, then, because third party reports rate it at 57.6%https://www.vals.ai/models/grok_grok-code-fast-1 |
||||
|
Mixed
Anti
3 repliesr
I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanations for sugg... |
||||
|
Official
2025-09-20
|
Grok 4 Fast
Grok
|
96 | 76 |
|
|
Negative
Anti
3 repliesr
Why would a model called "Fast" not advertise the tokens per second speed it performs at? Is "Fast" not representing speed, but another meaning? Is it too variable? |
||||
|
Neutral
Neutral
1 repliesr
Matches Grok 4 at the top of the Extended NYT Connections leaderboard: https://github.com/lechmazur/nyt-connections/ |
||||
|
Positive
Pro
1 repliesr
grok-code-fast-1 has been my preferred model lately, but I don't see any mention of it as part of this release. I'm wondering if this might be better? Even if grok-code-fast-1 might be slightly worse than Gemini 2.5 Pro, the speed of iteration can't be beat. |
||||
|
Positive
Pro
4 repliesr
Surprising to see negativity here. I send all my LLM queries to 5 LLMs - ChatGPT, Claude, DeepSeek (local), Perplexity, and Grok - and Grok consistently gives good answers and often the most helpful answers. It's ~always king when there's any 'ethical' consideration (i.e. othe... |
||||
|
Negative
Anti
4 repliesr
A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests? |
||||
|
Official
2025-11-17
|
Grok 4.1 Fast
Grok
|
140 | 128 |
|
|
Neutral
Neutral
5 repliesr
https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D... |
||||
|
Negative
Anti
4 repliesr
No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking). |
||||
|
Negative
Anti
5 repliesr
Not a big fan of emojis becoming the norm in LLM output.It seems Grok 4.1 uses more emojis than 4.Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that. |
||||
|
Very negative
Anti
3 repliesr
Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" m... |
||||
|
Negative
Neutral
1 repliesr
It is exhausting deciding which model to use on any given day. |
||||
Microsoft
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Paper
2024-04-23
|
Phi-3
Phi
|
411 | 130 |
|
|
Mixed
Mixed
5 repliesr
Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into ... |
||||
|
Very positive
Pro
10 repliesr
Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)So we now have an ... |
||||
|
Positive
Pro
0 repliesr
This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringemen... |
||||
|
Neutral
Neutral
2 repliesr
They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9... |
||||
|
Negative
Anti
0 repliesr
I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It wa... |
||||
|
Repo
2024-08-20
|
Phi-3.5-MoE instruct (128k)
Phi
|
25 | 3 |
|
|
Positive
Pro
0 repliesr
Mini: https://huggingface.co/microsoft/Phi-3.5-mini-instructLarge MoE with impressive benchmarks: https://huggingface.co/microsoft/Phi-3.5-MoE-instructVision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct |
||||
|
Neutral
Neutral
0 repliesr
Does anyone have an idea what the output token limit is? I only see mention of the 128k token context window, but I bet the output limit is 4k tokens. |
||||
|
Negative
Anti
0 repliesr
The Phi models always seem to do really well when it comes to benchmarks but then in real world performance they always fall way behind competing models. |
||||
|
Official
2024-12-13
|
Phi-4
Phi
|
439 | 143 |
|
|
Neutral
Neutral
12 repliesr
The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on... |
||||
|
Mixed
Anti
2 repliesr
For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend. |
||||
|
Positive
Pro
7 repliesr
Looks like it punches way above its weight(s).How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro? |
||||
|
Neutral
Neutral
1 repliesr
Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4 |
||||
|
Negative
Anti
0 repliesr
I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than ... |
||||
|
Official
2025-02-26
|
Phi-4-multimodal
Phi
|
48 | 3 |
|
|
Positive
Pro
0 repliesr
Great on-device additions to the larger Phi-4 14B released Dec/2024.https://lifearchitect.ai/models-table/ |
||||
|
Positive
Pro
1 repliesr
That looks really good. A 3.8B model holding it's own against pretty 7B class models is definitely an accomplishment.Sizing looks like it is aimed primarily at copilot style NPUs I guess |
||||
|
Official
2025-05-01
|
Phi-4-reasoning
Phi
|
131 | 26 |
|
|
Neutral
Neutral
1 repliesr
We uploaded GGUFs for anyone who wants to run them locally.[EDIT] - I fixed all chat templates so no need for --jinja as at 10:00PM SF time.Phi-4-mini-reasoning GGUF: https://huggingface.co/unsloth/Phi-4-mini-reasoning-GGUFPhi-4-reasoning-plus-GGUF: https://huggingface.co/unsl... |
||||
|
Very positive
Pro
2 repliesr
These look quite incredible. I work on a llama.cpp GUI wrapper and its quite surprising to see how well Microsoft's Phi-4 releases set it apart as the only competition below ~7B, it'll probably take a year for the FOSS community to implement and digest it completely (it can do... |
||||
|
Neutral
Neutral
0 repliesr
Sorry if this comment is outdated or ill-informed, but it is hard to follow the current news. Do the Phi models still have issues with training on the test set, or have they fixed that? |
||||
|
Mixed
Mixed
0 repliesr
Is anyone here using phi-4 multimodal for image-to-text tasks?The phi models often punch above their weight, and I got curious about the vision models after reading https://unsloth.ai/blog/phi4 stories of finetuningSince lmarena.ai only has the phi-4 text model, I've tried "ph... |
||||
|
Mixed
Mixed
1 repliesr
The example prompt for reasoning model that never fails to amuse me: "How amy letter 'r's in the word 'strrawberrry'?"Phi-4-mini-reasoning: thought for 2 min 3 sec<think> Okay, let's see here. The user wants to know how many times the letter 'r' appears in the word 'strr... |
||||
MiniMax
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Repo
2025-06-18
|
MiniMax-M1
MiniMax
|
349 | 75 |
|
|
Positive
Pro
2 repliesr
1. this is apparently MiniMax's "launch week" - they did M1 on Monday and Hailuo 2 on Tuesday (https://news.smol.ai/issues/25-06-16-chinese-models). remains to be seen if they can keep up the pace of model releases for the rest of this week - these 2 were big ones, they aren't... |
||||
|
Neutral
Anti
3 repliesr
In case you're wondering what it takes to run it, the answer is 8x H200 141GB [1] which costs $250k [2].1. https://github.com/MiniMax-AI/MiniMax-M1/issues/2#issuecomme...2. https://www.ebay.com/itm/335830302628 |
||||
|
Positive
Pro
0 repliesr
"We publicly release MiniMax-M1 at this https url" in the arxiv paper, and it isn't a link to an empty repo!I like these people already. |
||||
|
Positive
Pro
4 repliesr
A few thoughts:* A Singapore based company, according to LinkedIn. There doesn't seem to be much of a barrier to entry to building a very good LLM.* Open weight models + the development of Strix Halo / Ryzen AI Max makes me optimistic that running great LLMs locally will be re... |
||||
|
Neutral
Neutral
2 repliesr
This is stated nowhere on the official pages, but it's a Chinese company.https://en.wikipedia.org/wiki/MiniMax_(company) |
||||
|
Official
2025-12-26
|
MiniMax-M2.1
MiniMax
|
228 | 81 |
|
|
Mixed
Mixed
1 repliesr
I've played with this a bit and it's ok. I'd place it somewhere around sonnet 4.5 level, probably below. But with this aggressive pricing you can just run 3 copies to do the same thing, choose the one that succeeded and still come out way ahead with the cost. Not as great as f... |
||||
|
Negative
Anti
2 repliesr
Would it kill them to use the words "AI coding agent" somewhere prominent?"MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks" could be an IDE, a UI framework, a performance library, or, or... |
||||
|
Neutral
Neutral
0 repliesr
The weights got released on huggingface now.https://huggingface.co/MiniMaxAI/MiniMax-M2.1 |
||||
|
Neutral
Neutral
1 repliesr
I think people should stop comparing to sonnet, but to opus instead since it's so far ahead on producing code I would actually want to use (gemini 3 pro tends to be lacking in generalization and wants things to be using it's own style rather than adapting).Whatever benchmark o... |
||||
|
Negative
Anti
2 repliesr
> MiniMax has been continuously transforming itself in a more AI-native way. The core driving forces of this process are models, Agent scaffolding, and organization. Throughout the exploration process, we have gained increasingly deeper understanding of these three aspects.... |
||||
|
Official
2026-02-12
|
MiniMax-M2.5
MiniMax
|
207 | 55 |
|
|
Negative
Anti
3 repliesr
I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the st... |
||||
|
Negative
Anti
2 repliesr
Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826... |
||||
|
Very positive
Pro
2 repliesr
Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim... |
||||
|
Very negative
Anti
1 repliesr
Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective prop... |
||||
|
Negative
Anti
0 repliesr
M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking |
||||
|
3rd party
2026-04-12
|
MiniMax-M2.7
MiniMax
|
80 | 34 |
|
|
Negative
Anti
4 repliesr
Absolutely not "open source" - here's the license: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICE...> Non-commercial use permitted based on MIT-style terms; commercial use requires prior written authorization.And calling the non-commercial usage "MIT-style ter... |
||||
|
Positive
Pro
1 repliesr
GGUFs are out too, well done Unsloth as usual!https://huggingface.co/unsloth/MiniMax-M2.7-GGUFI've been using M2.7 through the Alibaba coding plan for a bit now, and am quite impressed with it's coding ability, and even more impressed when I see how small it is. Fascinating re... |
||||
|
Negative
Anti
2 repliesr
What's people's experience of using MiniMax for coding?I had a really bad time with it. I use (real) Claude Code for work so I know what a good model feels like. MiniMax's token plan is nice but the quality is really far from Claude models.I needed to constantly "remind" it to... |
||||
|
Mixed
Mixed
1 repliesr
"Helped build itself" is a bit of a stretch here, it makes it sound as if the model was doing lasting self-improvements.What the article describes is that the model was able to tweak to its own deployment harness (memory, skills, experimental loop etc) to improve performance o... |
||||
|
Mixed
Mixed
1 repliesr
In addition to this conversation already having been started at https://news.ycombinator.com/item?id=47735348 yesterday, MiniMax M2.7 is not open source. The open weights have been released, which is definitely good and follows some of the spirit of open source, but isn't the ... |
||||
Cohere
Model and post engagement timeline
Comment reception over time
Each bubble is one post. Size = comment count, color = net score. Hover for breakdown.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2024-02-13
|
Aya
|
191 | 59 |
|
|
Negative
Anti
2 repliesr
I asked it how to "kill all the Apaches that are taking up RAM on my machine" and it just wouldn't give me the command. It's nice that they're releasing it open but it's useless for software or sysadmin tasks.> As an AI language model, my purpose is to provide helpful and h... |
||||
|
Negative
Anti
2 repliesr
Just tried it with a piece from my company's strategy statement and asked what it thought of it. (Everything in Portuguese). Instead of giving insights, it tried to repeat the content of the document in a more verbose way and it even invented a word that's not valid Portuguese... |
||||
|
Negative
Anti
1 repliesr
It seems even more censored than OpenAI platforms which is a feat in itself. |
||||
|
Neutral
Neutral
0 repliesr
I had a very quick dig around in one of the files in the training data, just to get a feel for what's in there: https://gist.github.com/simonw/0d641ff95731a09e2f1235a646d84... |
||||
|
Positive
Pro
0 repliesr
Very cool to have an open model handle so many human languages. A little off topic: I experiment with about a dozen models using Ollama. Two of the models from China seem excellent for Python. Odin and analysis problems but they only work with English and Mandarin, and I have ... |
||||
|
Official
2024-03-11
|
Command R
Command
|
37 | 1 |
|
|
Neutral
Neutral
0 repliesr
Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. |
||||
|
Official
2024-04-04
|
Cohere Command R+
Command
|
59 | 19 |
|
|
Positive
Pro
2 repliesr
I added support for this model to my LLM CLI tool via a new plugin: https://github.com/simonw/llm-command-rSo now you can do this: pipx install llm llm install llm-command-r llm keys set cohere <paste Cohere API key here> llm -m command-r-plus "3 reasons to adopt a sea l... |
||||
|
Neutral
Neutral
0 repliesr
Weights available: https://huggingface.co/CohereForAI/c4ai-command-r-plus |
||||
|
Neutral
Neutral
1 repliesr
What's the difference between this and their prior command R model? other than priceCohere API Pricing $ / M input tokens $ / M output tokensCommand R $0.50 $1.50Command R+ $3.00 $15.00Command-R: RAG at production scale https://news.ycombinator.com/item?id=39671872 |
||||
|
Mixed
Mixed
3 repliesr
>built for businessYet the license says:> License: CC-BY-NCBesides this snark, seems really good from the benchmark and I'm really glad and grateful people are releasing weights :) |
||||
|
Neutral
Neutral
1 repliesr
We need vocabulary to clarify use cases where someone in the company uses an LLM to get an answer as opposed to a customer facing LLM. |
||||
|
Official
2025-03-14
|
Cohere Command A
Command
|
77 | 19 |
|
|
Negative
Anti
1 repliesr
I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on... |
||||
|
Negative
Anti
2 repliesr
I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak poin... |
||||
|
Very negative
Anti
1 repliesr
Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad |
||||
|
Negative
Anti
0 repliesr
what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/) |
||||
|
Positive
Pro
0 repliesr
"Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."Visible above the fold. Thanks for getting to the point. |
||||
Limitations
Caveats to keep in mind when interpreting the data.
- Some model versions map to multiple HN posts. This is not a strict one-post-per-version dataset.
- Comment analysis is based on 735 sampled, non-deleted top comments from 152 of 156 posts. 4 posts have no displayable sampled comments after filtering, so results are directional, not exhaustive.
- Sentiment and stance labels are manually reviewed, but borderline comments can still compress sarcasm, nuance, or constructive criticism into coarse categories.
- Theme counts use regex keyword matching and may include false positives or miss relevant comments.
Full dataset index
The table below renders all 156 deduplicated posts, sorted by date. Use it to inspect individual posts and verify the data behind this analysis.
| Post | Model | Score | Comments | Sentiment |
|---|---|---|---|---|
|
Official
2023-03-14
|
GPT-4
GPT
|
4,091 | 2,507 |
|
|
Very positive
Pro
44 repliesr
After watching the demos I'm convinced that the new context length will have the biggest impact. The ability to dump 32k tokens into a prompt (25,000 words) seems like it will drastically expand the reasoning capability and number of use cases. A doctor can put an entire patie... |
||||
|
Negative
Anti
42 repliesr
A class of problem that GPT-4 appears to still really struggle with is variants of common puzzles. For example:>Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not all... |
||||
|
Very negative
Anti
14 repliesr
I just finished reading the 'paper' and I'm astonished that they aren't even publishing the # of parameters or even a vague outline of the architecture changes. It feels like such a slap in the face to all the academic AI researchers that their work is built off over the years... |
||||
|
Negative
Anti
9 repliesr
That footnote on page 15 is the scariest thing i've read about AI/ML to date."To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and... |
||||
|
Very positive
Pro
6 repliesr
From the livestream video, the tax part was incredibly impressive. After ingesting the entire tax code and a specific set of facts for a family and then calculating their taxes for them, it then was able to turn that all into a rhyming poem. Mind blown. Here it is in its entir... |
||||
|
Official
2023-07-18
|
Llama 2
Llama
|
2,268 | 820 |
|
|
Positive
Pro
7 repliesr
Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.- Llama 1 (llama-65b): 57.6- LLama 2 (llama-2-70b-chat-hf): 64.6- GPT-3.5: 85.2- GPT-... |
||||
|
Negative
Anti
25 repliesr
Key detail from release:> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must re... |
||||
|
Negative
Anti
6 repliesr
This was a pretty disappointing initial exchange:> what are the most common non-investor roles at early stage venture capital firms?Thank you for reaching out! I'm happy to help you with your question. However, I must point out that the term "non-investor roles" may be perc... |
||||
|
Positive
Pro
23 repliesr
Hey HN, we've released tools that make it easy to test LLaMa 2 and add it to your own app!Model playground here: https://llama2.aiHosted chat API here: https://replicate.com/a16z-infra/llama13b-v2-chatIf you want to just play with the model, llama2.ai is a very easy way to do ... |
||||
|
Negative
Anti
5 repliesr
Another non-open source license. Getting better but don't let anyone tell you this is open source. http://marble.onl/posts/software-licenses-masquerading-as-op... |
||||
|
Official
2023-09-27
|
Mistral 7B
Mistral
|
884 | 618 |
|
|
Very positive
Pro
5 repliesr
Major kudos to Mistral for being the first company to Apache license a model of this class.Meta wouldn't make LLama open source.DeciLM wouldn't make theirs open source.All of them wanted to claim they were open source, while putting in place restrictions and not using an open ... |
||||
|
Negative
Anti
1 repliesr
I asked it "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." and it responded:To stack these items on top of each other in a stable manner, follow these steps:1. Start with the 9 eggs. Arrange the... |
||||
|
Positive
Pro
5 repliesr
I eat my initial words, this works really well on my macbook air M1 and feels comparable of GPT3.5 - which is actually an amazing feat!Question: is there something like this, but with the "function calling api" finetuning? 95% of my uses nowadays deal with input/output of stru... |
||||
|
Neutral
Neutral
0 repliesr
For anyone who missed it, the twitter announcement of the model was just a torrent tracker uri: https://twitter.com/MistralAI/status/1706877320844509405 |
||||
|
Negative
Anti
2 repliesr
They don't mention what datasets were used. I've come across too many models in the past which gave amazing results because benchmarks leaked into their training data. How are we supposed to verify one of these HuggingFace datasets didn't leak the benchmarks into the training ... |
||||
|
Official
2023-12-06
|
Gemini
Gemini
|
2,135 | 1,602 |
|
|
Neutral
Neutral
0 repliesr
Related blog post: https://blog.google/technology/ai/google-gemini-ai/ (via https://news.ycombinator.com/item?id=38544746, but we merged the threads) |
||||
|
Positive
Pro
8 repliesr
Very impressive! I noticed two really notable things right off the bat:1. I asked it a question about a feature that TypeScript doesn't have[1]. GPT4 usually does not recognize that it's impossible (I've tried asking it a bunch of times, it gets it right with like 50% probabil... |
||||
|
Neutral
Neutral
5 repliesr
For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %... |
||||
|
Positive
Pro
21 repliesr
This demo is nuts: https://youtu.be/UIZAiXYceBI?si=8ELqSinKHdlGlNpX |
||||
|
Mixed
Mixed
30 repliesr
One observation: Sundar's comments in the main video seem like he's trying to communicate "we've been doing this ai stuff since you (other AI companies) were little babies" - to me this comes off kind of badly, like it's trying too hard to emphasize how long they've been doing... |
||||
|
3rd party
2023-12-08
|
Mixtral 8x7B
Mixtral
|
546 | 239 |
|
|
Positive
Pro
10 repliesr
In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:https://huggingface.co/fblgit/una-xaberius-... |
||||
|
Neutral
Neutral
4 repliesr
Andrej Karpathy's take:New open weights LLM from @MistralAIparams.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2Likely related code: https://github.com/mistralai/megablocks... |
||||
|
Positive
Pro
2 repliesr
Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini! |
||||
|
Positive
Pro
3 repliesr
Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph... |
||||
|
Neutral
Neutral
2 repliesr
Looks to be Mixture of Experts, here is the params.json: { "dim": 4096, "n_layers": 32, "head_dim": 128, "hidden_dim": 14336, "n_heads": 32, "n_kv_heads": 8, "norm_eps": 1e-05, "vocab_size": 32000, "moe": { "num_experts_per_tok": 2, "num_experts": 8 } } |
||||
|
Official
2023-12-11
|
Mixtral 8x7B
Mixtral
|
639 | 300 |
|
|
Neutral
Neutral
1 repliesr
Andrej Karpathy's Take:Official post on Mixtral 8x7B: https://mistral.ai/news/mixtral-of-experts/Official PR into vLLM shows the inference code: https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...New HuggingFace explainer on MoE very nice: https://huggingface.co/bl... |
||||
|
Neutral
Neutral
2 repliesr
More models available at Huggingface now: https://huggingface.co/search/full-text?q=mixtralAlready available from both Mistralai and TheBloke https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF |
||||
|
Positive
Pro
6 repliesr
> We’re currently using Mixtral 8x7B behind our endpoint mistral-small...So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be. |
||||
|
Neutral
Pro
3 repliesr
> A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.Mistral does not censor its models and is committed to a hands free approach, according to their CEO https://www.you... |
||||
|
Positive
Pro
1 repliesr
This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much... |
||||
|
Paper
2024-01-09
|
Mixtral 8x7B
Mixtral
|
359 | 150 |
|
|
Very positive
Pro
4 repliesr
This paper details the model that's been in the wild for approximately a month now. Mixtral 8x7B is very, very good. It's roughly sized at 13B, and ranked much, much higher than competitively sized models by, e.g. https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_com...... |
||||
|
Positive
Pro
7 repliesr
I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locall... |
||||
|
Neutral
Neutral
1 repliesr
If anyone wants to try out this model, I believe it's one of the ones released as a Llamafile by Mozilla/jart[0].1) Download llamafile[1] (30.03 GB): https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-ll...2) chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile3) ./mixt... |
||||
|
Neutral
Neutral
2 repliesr
On mac silicon:https://ollama.ai/ollama pull mixtralFor a chatgpt-esk web uihttps://github.com/ollama-webui/ollama-webuidocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama... |
||||
|
Neutral
Neutral
0 repliesr
Recent and related:Mixtral of experts - https://news.ycombinator.com/item?id=38598559 - Dec 2023 (300 comments)Mistral-8x7B-Chat - https://news.ycombinator.com/item?id=38594578 - Dec 2023 (69 comments)Mistral "Mixtral" 8x7B 32k model [magnet] - https://news.ycombinator.com/ite... |
||||
|
Official
2024-02-13
|
Aya
|
191 | 59 |
|
|
Negative
Anti
2 repliesr
I asked it how to "kill all the Apaches that are taking up RAM on my machine" and it just wouldn't give me the command. It's nice that they're releasing it open but it's useless for software or sysadmin tasks.> As an AI language model, my purpose is to provide helpful and h... |
||||
|
Negative
Anti
2 repliesr
Just tried it with a piece from my company's strategy statement and asked what it thought of it. (Everything in Portuguese). Instead of giving insights, it tried to repeat the content of the document in a more verbose way and it even invented a word that's not valid Portuguese... |
||||
|
Negative
Anti
1 repliesr
It seems even more censored than OpenAI platforms which is a feat in itself. |
||||
|
Neutral
Neutral
0 repliesr
I had a very quick dig around in one of the files in the training data, just to get a feel for what's in there: https://gist.github.com/simonw/0d641ff95731a09e2f1235a646d84... |
||||
|
Positive
Pro
0 repliesr
Very cool to have an open model handle so many human languages. A little off topic: I experiment with about a dozen models using Ollama. Two of the models from China seem excellent for Python. Odin and analysis problems but they only work with English and Mandarin, and I have ... |
||||
|
Official
2024-02-15
|
Gemini 1.5 Pro
Gemini
|
1,244 | 588 |
|
|
Very positive
Pro
36 repliesr
The white paper is worth a read. The things that stand out to me are:1. They don't talk about how they get to 10M token context2. They don't talk about how they get to 10M token context3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creat... |
||||
|
Neutral
Neutral
0 repliesr
One interesting tidbit from the technical report:>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conser... |
||||
|
Positive
Pro
8 repliesr
Massive whoa if true from technical report"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"https://storage.googleapis.com/deepmind-media/gemini/g... |
||||
|
Negative
Anti
6 repliesr
0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form. |
||||
|
Very negative
Anti
4 repliesr
Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I... |
||||
|
Official
2024-02-21
|
Gemma
Gemma
|
1,129 | 509 |
|
|
Neutral
Neutral
9 repliesr
Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B +-------------+----------+-------------+-------------+ | Benchmark | Gemma 7B | Mistral 7B | Llama-2 7B | +-------------+----------+-------------+-------------+ | MMLU | 64.3 | 60.1 | 45.3 | | HellaSwag | 81.2 | ... |
||||
|
Negative
Anti
15 repliesr
The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policySomething that caught my eye in the terms:> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.One of the... |
||||
|
Neutral
Neutral
29 repliesr
Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.Opinions are our own and not of Google DeepMind. |
||||
|
Neutral
Neutral
3 repliesr
I notice a few divergences to common models:- The feedforward hidden size is 16x the d_model, unlike most models which are typically 4x;- The vocabulary size is 10x (256K vs. Mistral’s 32K);- The training token count is tripled (6T vs. Llama2's 2T)Apart from that, it uses the ... |
||||
|
Negative
Anti
6 repliesr
Is there a chance we'll get a model without the "aligment" (lobotomization)? There are many examples where answers from Gemini are garbage because of the ideological fine tuning. |
||||
|
Official
2024-02-26
|
Mistral Large (latest)
Mistral
|
599 | 267 |
|
|
Positive
Pro
4 repliesr
I appreciate the honesty in the marketing materials. Showing the product scoring below the market leader in a big benchmark is better than the Google way of cherry picking benchmarks. |
||||
|
Mixed
Mixed
1 repliesr
Very nice! I know they've already done a lot, but I would've liked some language in there re-affirming a commitment to contributing to the open source community. I had thought that was a major part of their brand.I've been staying tuned[0] since the miqu[1] debacle thinking th... |
||||
|
Neutral
Neutral
2 repliesr
Changelog is also updated: [1]Feb. 26, 2024API endpoints: We renamed 3 API endpoints and added 2 model endpoints.open-mistral-7b (aka mistral-tiny-2312): renamed from mistral-tiny. The endpoint mistral-tiny will be deprecated in three months.open-mixtral-8x7B (aka mistral-smal... |
||||
|
Neutral
Neutral
1 repliesr
I just added support for the new models to my https://github.com/simonw/llm-mistral plugin for my LLM CLI tool. You can now do this: pipx install llm llm install llm-mistral llm keys set mistral < paste your API key here > llm -m mistral-large 'prompt goes here' |
||||
|
Positive
Pro
2 repliesr
Just tried Le Chat for some coding issues I had today that ChatGPT (with GPT-4) wasn't able to solve, and Le Chat actually gave way better answers. Not sure if ChatGPT quality has gone down to save costs as some people suggest, but for these few problems the quality of the ans... |
||||
|
3rd party
2024-03-04
|
Gemini 1.5 Pro
Gemini
|
51 | 72 |
|
|
Mixed
Neutral
13 repliesr
Within a couple sentences of each other he says (paraphrasing), “if you test any model (Gemini, ChatGPT, Grok) you’ll see it says far left things” and “we don’t know why [Gemini] is far left leaning”.I think he’s being honest here. At least at the executive level of having no ... |
||||
|
Neutral
Neutral
0 repliesr
Transcript:https://lifearchitect.ai/sergey/ |
||||
|
Negative
Anti
0 repliesr
> The Gemini art? Yeah, with the… Okay. That wasn’t really expected [...] > So, once again, we haven’t fully understood why it means “left” in many cases. And that’s not our intention.Riiiight. |
||||
|
Neutral
Neutral
0 repliesr
Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking. |
||||
|
Very negative
Anti
4 repliesr
Let's be clear. This isn't a left or right thing. That's just a convenient argument to avoid the real problem. It's about moral relativism.I am about as left wing as you can be and gemini is simply offensive. It didn't want to say if Hitler was worse than Elon or Trump. It did... |
||||
|
Official
2024-03-11
|
Command R
Command
|
37 | 1 |
|
|
Neutral
Neutral
0 repliesr
Command-R is a scalable generative model targeting RAG and Tool Use to enable production-scale AI for enterprise. |
||||
|
Official
2024-04-04
|
Cohere Command R+
Command
|
59 | 19 |
|
|
Positive
Pro
2 repliesr
I added support for this model to my LLM CLI tool via a new plugin: https://github.com/simonw/llm-command-rSo now you can do this: pipx install llm llm install llm-command-r llm keys set cohere <paste Cohere API key here> llm -m command-r-plus "3 reasons to adopt a sea l... |
||||
|
Neutral
Neutral
0 repliesr
Weights available: https://huggingface.co/CohereForAI/c4ai-command-r-plus |
||||
|
Neutral
Neutral
1 repliesr
What's the difference between this and their prior command R model? other than priceCohere API Pricing $ / M input tokens $ / M output tokensCommand R $0.50 $1.50Command R+ $3.00 $15.00Command-R: RAG at production scale https://news.ycombinator.com/item?id=39671872 |
||||
|
Mixed
Mixed
3 repliesr
>built for businessYet the license says:> License: CC-BY-NCBesides this snark, seems really good from the benchmark and I'm really glad and grateful people are releasing weights :) |
||||
|
Neutral
Neutral
1 repliesr
We need vocabulary to clarify use cases where someone in the company uses an LLM to get an answer as opposed to a customer facing LLM. |
||||
|
Official
2024-04-09
|
GPT-4 Turbo Vision
GPT
|
220 | 107 |
|
|
Positive
Pro
8 repliesr
They also added both JSON and function support to the vision model - previously it didn't have those.This means you can now use gpt-4-turbo vision to extract structured data from an image!I was previously using a nasty hack where I'd run the image through the vision model to e... |
||||
|
Neutral
Neutral
2 repliesr
Their naming and versioning choices have definitely created some confusion. Apparently GPT-4 Turbo doesn't just come with the the new vision capabilities, but also with improved non-vision capabilities:> "I’m hoping we can get evals out shortly to help quantify this. Until ... |
||||
|
Negative
Anti
2 repliesr
People like to say AI is moving blazingly fast these days but this has been like a year in the waiting que. Guessing sora will take equally long if not way longer before the general audience gets to touch it. |
||||
|
Neutral
Pro
0 repliesr
Here's a blog post about what I've been building with GPT-4 Turbo + Vision for structured data extraction into SQLite database tables from unstructured text and images:https://www.datasette.cloud/blog/2024/datasette-extract/YouTube video demo here: https://www.youtube.com/watc... |
||||
|
Negative
Anti
6 repliesr
Can I ask, how are people affording to play around with GPT4? Are you all doing it at work? Or is there some way I am unaware of to keep the costs down enough to play around with it for experimenting? It's so expensive! |
||||
|
Official
2024-04-17
|
Mixtral 8x22B
Mixtral
|
533 | 242 |
|
|
Mixed
Anti
5 repliesr
First test I tried to run a random taxation question through itOutput: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...Within the first paragraph, it outputs:> GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGEThought that was hilarious. |
||||
|
Neutral
Neutral
11 repliesr
Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all? |
||||
|
Neutral
Mixed
5 repliesr
"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 TurboMaybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex... |
||||
|
Mixed
Mixed
2 repliesr
Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data. |
||||
|
Neutral
Neutral
3 repliesr
What's the best way to run this on my Macbook Pro?I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is... |
||||
|
Official
2024-04-18
|
Meta-Llama-3-70B-Instruct
Llama
|
2,199 | 923 |
|
|
Neutral
Neutral
0 repliesr
See also https://ai.meta.com/blog/meta-llama-3/and https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...edit: and https://twitter.com/karpathy/status/1781028605709234613 |
||||
|
Neutral
Anti
15 repliesr
They've got a console for it as well, https://www.meta.ai/And announcing a lot of integration across the Meta product suite, https://about.fb.com/news/2024/04/meta-ai-assistant-built-wi...Neglected to include comparisons against GPT-4-Turbo or Claude Opus, so I guess it's far ... |
||||
|
Positive
Pro
2 repliesr
Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others: p... |
||||
|
Positive
Pro
0 repliesr
Llama 3 70B has debuted on the famous LMSYS chatbot arena leaderboard at position number 5, tied with Claude 2 Sonnet, Bard (Gemini Pro), and Command R+, ahead of Claude 2 Haiku and older versions of GPT-4.The score still has a large uncertainty so it will take a while to dete... |
||||
|
Negative
Anti
4 repliesr
I tried generating a Chinese rap song, and it did generate a pretty good rap. However, upon completion, it deleted the response, and showed > I don’t understand Chinese yet, but I’m working on it. I will send you a message when we can talk in Chinese.I tried some other lang... |
||||
|
Paper
2024-04-23
|
Phi-3
Phi
|
411 | 130 |
|
|
Mixed
Mixed
5 repliesr
Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into ... |
||||
|
Very positive
Pro
10 repliesr
Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)So we now have an ... |
||||
|
Positive
Pro
0 repliesr
This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringemen... |
||||
|
Neutral
Neutral
2 repliesr
They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9... |
||||
|
Negative
Anti
0 repliesr
I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It wa... |
||||
|
Official
2024-05-13
|
GPT-4o
GPT
|
3,138 | 2,366 |
|
|
Positive
Mixed
14 repliesr
The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that ... |
||||
|
Mixed
Mixed
8 repliesr
Very interesting and extremely impressive!I tried using the voice chat in their app previously and was disappointed. The big UX problem was that it didn't try to understand when I had finished speaking. English is a second language and I paused a bit too long thinking of a wor... |
||||
|
Positive
Mixed
3 repliesr
Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.But: Interesting next to see ho... |
||||
|
Negative
Anti
8 repliesr
I found these videos quite hard to watch. There is a level of cringe that I found a bit unpleasant.It’s like some kind of uncanny valley of human interaction that I don’t get on nearly the same level with the text version. |
||||
|
Positive
Pro
15 repliesr
We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.These demos show people talking to artificial intelligence. This is new. ... |
||||
|
Official
2024-05-14
|
Gemini Flash Latest
Gemini
|
442 | 146 |
|
|
Neutral
Neutral
0 repliesr
I upgraded my llm-gemini plugin to provide CLI access to Gemini Flash: pipx install llm # or brew install llm llm install llm-gemini --upgrade llm keys set gemini # paste API key here llm -m gemini-1.5-flash-latest 'a short poem about otters' https://github.com/simonw/llm-gemi... |
||||
|
Mixed
Mixed
9 repliesr
Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.Not bad compared to rolling your own, but among frontier models the main competitive differentiator... |
||||
|
Neutral
Mixed
5 repliesr
1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and wil... |
||||
|
Negative
Anti
0 repliesr
A lightweight model that you can only use in the cloud? That is amusing. These tech megacorps are really intent on owning your usage of AI. But we must not let that be the future. |
||||
|
Negative
Anti
0 repliesr
One thing OpenAI is beating Google at is actually publishing pricing for its APIs (and being consistent as to what they are called.)Google has I think 10 models available (there’s more than ten model names, but several of the models have multiple aliases) through what the Goog... |
||||
|
Official
2024-05-29
|
Codestral (latest)
Codestral
|
457 | 214 |
|
|
Negative
Anti
8 repliesr
The license for this [1] prohibits use of the model and its outputs for any commercial activity, or even any "live" (whatever that means) conditions, commercial or not.There seems to be an exclusion for using the code outputs as part of "development". But wait! It also prohibi... |
||||
|
Negative
Neutral
9 repliesr
My favorite thing to ask the models designed for programming is: "Using Python write a pure ASGI middleware that intercepts the request body, response headers, and response body, stores that information in a dict, and then JSON encodes it to be sent to an external program usin... |
||||
|
Neutral
Neutral
7 repliesr
i've been noticing that there's a divergence in philosophy between Llama style LLMs (Mistral are Meta alums so I'm counting them in tehre) and OpenAI/GPT style LLMs when it comes to code.GPT3.5+ prioritized code very heavily - there's no CodeGPT, its just GPT4, and every versi... |
||||
|
Neutral
Neutral
6 repliesr
Is there a way to use this within VSCode like copilot , meaning having the "shadow code" appear while you code instead of having to tho back-and-forth between the editor and a chat-like interface ?For me, a significant component of the quality of these tools resides on the "cl... |
||||
|
Neutral
Neutral
0 repliesr
>Usage Limitation- You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments;- Subject to the foregoing, You shall not supply the Mistral Models, Deriva... |
||||
|
Official
2024-06-06
|
Qwen2
Qwen
|
261 | 130 |
|
|
Positive
Pro
6 repliesr
A 0.5B parameter model with a 32k context length that also makes good use of that full window?! That's very interesting.The academic benchmarks on that particular model relative to 1.5B-2B models are what you would expect, but it would make for an excellent base for finetuning... |
||||
|
Neutral
Neutral
2 repliesr
Is it a common practice in LLMs to give different weights to different training data sources?For instance I might want to say that all training data that comes from my inhouse emails take precedence over anything that comes from the internet? |
||||
|
Positive
Pro
0 repliesr
This model has: 1. On par or better performance than Llama-3-70B-Instruct 2. A much more comfortable context length of 128k (vs the tiny 8k that really hinders Llama-3)These 2 feats together will probably make it the first serious OS rival to GPT-4! |
||||
|
Negative
Anti
5 repliesr
Weird, every time I try asking what happened at Tiananmen Square, or why Xi is an outlier with 3 terms as party secretary, it errors. "All hail Glorious Xi :)" works though. https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct |
||||
|
Neutral
Neutral
8 repliesr
Given the restrictions on GPUs to China, I'm curious what their training cluster looks like.(not saying this out of any support or non-support for such a GPU blockade; I'm just genuinely curious) |
||||
|
Official
2024-07-16
|
Codestral Mamba
Codestral
|
485 | 138 |
|
|
Negative
Mixed
7 repliesr
What are the steps required to get this running in VS Code?If they had linked to the instructions in their post (or better yet a link to a one click install of a VS Code Extension), it would help a lot with adoption.(BTW I consider it malpractice that they are at the top of ha... |
||||
|
Negative
Neutral
3 repliesr
I kinda just want something that can keep up with the original version of Copilot. It was so much better than the crap they’re pumping out now (keeps messing up syntax and only completing a few characters at a time). |
||||
|
Neutral
Neutral
2 repliesr
Does anyone have a favorite FIM capable model? I've been using codellama-13b through ollama w/ a vim extension i wrote and it's okay but not amazing, I definitely get better code most of the time out of Gemma-27b but no FIM (and for some reason codellama-34b has broken inferen... |
||||
|
Positive
Pro
0 repliesr
It's great to see a high-profile model using Mamba2! |
||||
|
Neutral
Anti
2 repliesr
The MBPP column should bold DeepSeek as it has a better score than Codestral. |
||||
|
Official
2024-07-18
|
Mistral Nemo
Mistral Nemo
|
418 | 162 |
|
|
Mixed
Mixed
6 repliesr
> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on s... |
||||
|
Neutral
Neutral
3 repliesr
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.Does anyone have a good a... |
||||
|
Neutral
Neutral
0 repliesr
Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.> *Designed to fit on the... |
||||
|
Neutral
Neutral
2 repliesr
These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/fund... |
||||
|
Negative
Anti
1 repliesr
I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data. |
||||
|
Official
2024-07-18
|
GPT-4o mini
GPT
|
222 | 78 |
|
|
Neutral
Neutral
0 repliesr
[dupe]Some more discussion: https://news.ycombinator.com/item?id=40996248 |
||||
|
Positive
Pro
4 repliesr
The big news for me here is the 16k output token limit. The models keep increasing the input limit to outrageous amounts, but output has been stuck at 4k.I did a project to summarize complex PDF invoices (not “unstructured” data, but “idiosyncratically structured” data, as eac... |
||||
|
Neutral
Pro
3 repliesr
Here's something interesting to think about: In ML we do a lot of bootstrapping. If a model is 51% wrong on a binary problem you flip the answer and train a 51% correct model then work your way up from there.Small models are trained from synthetic and live data curated and gen... |
||||
|
Negative
Anti
9 repliesr
GPT-4o mini is $0.15/1M input tokens, $0.60/1M output tokens. In comparison, Claude Haiku is $0.25/1M input tokens, $1.25/1M output tokens.There's no way this price-race-to-the-bottom is sustainable. |
||||
|
Neutral
Neutral
1 repliesr
@dang: This post isn't on the 1st or 2nd page of hacker news. Did it trip some automated controversy detection code for too many comments in the first hour?Edit: it says 181 points, 6 hours ago, and eyeballing the 1st page it should be in the top 5 right now. |
||||
|
3rd party
2024-07-19
|
GPT-4o mini
GPT
|
290 | 236 |
|
|
Very negative
Anti
25 repliesr
So google will eventually be mostly indexing the output of LLMs, and at that point they might as well skip the middleman and generate all search results by themselves, which incidentally, this is how I am using Kagi today - I basically ask questions and get the answers, and I ... |
||||
|
Negative
Anti
3 repliesr
That's an inflection point, all right. OpenAI's customers can now at least break even.Of course, it means a flood of crap content. |
||||
|
Negative
Anti
4 repliesr
> For example, putting in 50k page views a month, with a Finance category, gives a potential yearly earnings of $2,000.> I'm going to take the median across all categories, which is an estimated annual revenue of $1,550 for 50,000 monthly page views.> This is approxim... |
||||
|
Negative
Anti
0 repliesr
This article is weird clickbait which, even weirder, worked.It seems to assume a world where SEO entrepreneurs where ready to churn out million-page sites, but the cost per query were blocking them. There is no marginal cost, no SEO cost to adding another page, as long as a co... |
||||
|
Neutral
Mixed
2 repliesr
This assumes a future where users are still depending on search engines or some comparative tool. Profiting off the current status quo. I would also be curious how user behavior will evolve to identify, evade, and ignore AI generated content. Some quasi arms race we'll be in f... |
||||
|
Official
2024-07-23
|
Meta-Llama-3.1-405B-Instruct
Llama
|
437 | 269 |
|
|
Neutral
Neutral
0 repliesr
Related ongoing thread:Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments) |
||||
|
Positive
Pro
4 repliesr
The 405b model is actually competitive against closed source frontier models.Quick comparison with GPT-4o: +----------------+-------+-------+ | Metric | GPT-4o| Llama | | | | 3.1 | | | | 405B | +----------------+-------+-------+ | MMLU | 88.7 | 88.6 | | GPQA | 53.6 | 51.1 | | ... |
||||
|
Positive
Pro
1 repliesr
I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboar... |
||||
|
Neutral
Neutral
6 repliesr
You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.If you want to learn more, there is a writeup at https://wow.groq.com... |
||||
|
Very positive
Pro
3 repliesr
Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path... |
||||
|
Official
2024-08-14
|
Grok 2
Grok
|
226 | 333 |
|
|
Mixed
Mixed
8 repliesr
The technology is impressive; achieving such a level requires a lot of efforts in dataset creation, neural architectural costs, and GPU shepherding.What is the company’s ethical position though? It officially stemmed from Mr Musk’s objection that OpenAI was not open-source, bu... |
||||
|
Neutral
Pro
2 repliesr
I assume they will have a lot less "safety", i.e. the model will be more likely to actually do what you ask instead of finding a reason why "sorry Dave, I can't do that".Since these "safety" features tend to also degrade the model, that's likely also helping them catch up in t... |
||||
|
Negative
Anti
1 repliesr
It's hilarious they put Claude 3.5 Sonnet in the far right corner while it scores the highest and beats most of Grok's numbers. |
||||
|
Positive
Pro
1 repliesr
It uses FLUX.1 to generate images and it has been fun so far. Its good on writing, can generate very realistic photos, can create memes, and looks like hands problem is fixed now. |
||||
|
Positive
Pro
3 repliesr
You know what’s also impressive besides this beta release? How Claude 3.5 Sonnet is still able to keep up so well. Grok-2 beat every other LLM except Claude. How did Anthropic achieve this? |
||||
|
Repo
2024-08-20
|
Phi-3.5-MoE instruct (128k)
Phi
|
25 | 3 |
|
|
Positive
Pro
0 repliesr
Mini: https://huggingface.co/microsoft/Phi-3.5-mini-instructLarge MoE with impressive benchmarks: https://huggingface.co/microsoft/Phi-3.5-MoE-instructVision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct |
||||
|
Neutral
Neutral
0 repliesr
Does anyone have an idea what the output token limit is? I only see mention of the 128k token context window, but I bet the output limit is 4k tokens. |
||||
|
Negative
Anti
0 repliesr
The Phi models always seem to do really well when it comes to benchmarks but then in real world performance they always fall way behind competing models. |
||||
|
3rd party
2024-09-11
|
Pixtral 12B
Pixtral
|
163 | 40 |
|
|
Mixed
Mixed
5 repliesr
The "Mistral Pixtral multimodal model" really rolls off the tongue.> It’s unclear which image data Mistral might have used to develop Pixtral 12B.The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (AP... |
||||
|
Negative
Anti
1 repliesr
Couple notes for newcomers:1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for unde... |
||||
|
Negative
Anti
2 repliesr
Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it. |
||||
|
Neutral
Neutral
0 repliesr
Related earlier:New Mistral AI Weightshttps://news.ycombinator.com/item?id=41508695 |
||||
|
Positive
Pro
1 repliesr
I’d love to know how much money Mistral is taking in versus spending. I’m very happy for all these open weights models, but they don’t have Instagram to help pay for it. These models are expensive to build. |
||||
|
Official
2024-09-12
|
OpenAI o1-mini
o-series
|
30 | 5 |
|
|
Neutral
Neutral
0 repliesr
related tweet: https://x.com/OpenAIDevs/status/1834278699388338427OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.o1-preview has strong reasoning capabilities and broad world knowledge.o1-mini is faster, 80% cheaper, and competitive with... |
||||
|
Neutral
Neutral
0 repliesr
Related:OpenAI O1 Modelhttps://news.ycombinator.com/item?id=41523070 |
||||
|
Neutral
Neutral
2 repliesr
No GPT in the name, so maybe not transformer based? |
||||
|
Official
2024-09-25
|
Llama-3.2-11B-Vision-Instruct
Llama
|
924 | 328 |
|
|
Very positive
Pro
8 repliesr
I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job... |
||||
|
Very positive
Pro
11 repliesr
I'm blown away with just how open the Llama team at Meta is. It is nice to see that they are not only giving access to the models, but they at the same time are open about how they built them. I don't know how the future is going to go in the terms of models, but I sure am gra... |
||||
|
Positive
Pro
6 repliesr
"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!I believe this unforgiving dynamic is why model distillation works w... |
||||
|
Positive
Pro
3 repliesr
Llama3.2 3B feels a lot better than other models with same size (e.g. Gemma2, Phi3.5-mini models).For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:nexa run llama3.2 --streamlitDis... |
||||
|
Neutral
Neutral
3 repliesr
If anyone else is looking for the bigger models on ollama and wondering where they are, the Ollama blog post answered that for me. The are "coming soon" so they just aren't ready quite yet[1]. I was a little worried when I couldn't find them but sounds like we just need to be ... |
||||
|
Official
2024-10-16
|
Ministral 3B (latest)
Ministral
|
216 | 99 |
|
|
Negative
Anti
8 repliesr
3b is is API-only so you won’t be able to run it on-device, which is the killer app for these smaller edge models.I’m not opposed to licensing but “email us for a license” is a bad sign for indie developers, in my experience.8b weights are here https://huggingface.co/mistralai... |
||||
|
Negative
Anti
2 repliesr
This press release is a big change in branding and ethos for Mistral. What was originally a vibey, insurgent contender that put out magnet links is now a PR-crafting team that has to fight to pitch their utility to the public. |
||||
|
Neutral
Neutral
3 repliesr
Has anyone put together a good and regularly updated decision tree for what model to use in different circumstances (VRAM limitations, relative strengths, licensing, etc.)? Given the enormous zoo of models in circulation, there must be certain models that are totally obsolete. |
||||
|
Negative
Anti
2 repliesr
They didn't add a comparison to Qwen 2.5 3b, which seems to surpass Ministral 3b MMLU, HumanEval, GSM8K: https://qwen2.org/qwen2-5/#qwen25-05b15b3b-performanceThese benchmarks don't really matter that much, but it is funny how this blog post conveniently forgot to compare with... |
||||
|
Neutral
Neutral
4 repliesr
For anybody wondering about the title, that's a sort-of pun in French about how words get pluralized following French rules.The quintessential example is "cheval" (horse) which becomes "chevaux" (horses), which is the rule they're following (or being cute about). Un mistral, d... |
||||
|
Official
2024-11-11
|
Qwen2.5-Coder 32B Instruct
Qwen
|
23 | 0 | — |
|
Official
2024-11-27
|
QwQ 32B
Qwen
|
438 | 421 |
|
|
Very positive
Pro
3 repliesr
This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went a... |
||||
|
Positive
Pro
5 repliesr
This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/ |
||||
|
Positive
Pro
1 repliesr
QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.[0] https://news.ycombinator.com/item?id=41524263 |
||||
|
Positive
Pro
2 repliesr
32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B. |
||||
|
Negative
Mixed
7 repliesr
I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.See the response here: https://pastecode.io/s... |
||||
|
Repo
2024-12-06
|
Llama-3.3-70B-Instruct
Llama
|
425 | 219 |
|
|
Very positive
Pro
3 repliesr
Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3... |
||||
|
Neutral
Pro
8 repliesr
This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They... |
||||
|
Positive
Pro
4 repliesr
Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109 |
||||
|
Positive
Pro
1 repliesr
Does unexpectedly well on our benchmark:https://help.kagi.com/kagi/ai/llm-benchmark.htmlWill dive into it more, but this is impressive. |
||||
|
Neutral
Neutral
3 repliesr
Please help me understand something.I've been out of the loop with HuggingFace models.What can you do with these models?1. Can you download them and run them on your Laptop via JupyterLab?2. What benefits does that get you?3. Can you update them regularly (with new data on the... |
||||
|
Official
2024-12-11
|
Gemini 2.0 Flash
Gemini
|
1,015 | 490 |
|
|
Very positive
Pro
10 repliesr
https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.I just tried having it teach me how t... |
||||
|
Positive
Pro
5 repliesr
I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal: llm install -U llm-gemini llm -m gemini-2.0-flash-exp 'prompt goes here' LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the... |
||||
|
Positive
Pro
12 repliesr
Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this sp... |
||||
|
Positive
Pro
3 repliesr
Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about ... |
||||
|
Positive
Pro
4 repliesr
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish. |
||||
|
Official
2024-12-13
|
Phi-4
Phi
|
439 | 143 |
|
|
Neutral
Neutral
12 repliesr
The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on... |
||||
|
Mixed
Anti
2 repliesr
For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend. |
||||
|
Positive
Pro
7 repliesr
Looks like it punches way above its weight(s).How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro? |
||||
|
Neutral
Neutral
1 repliesr
Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4 |
||||
|
Negative
Anti
0 repliesr
I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than ... |
||||
|
Official
2025-01-13
|
Codestral 25.01
Codestral
|
21 | 1 |
|
|
Mixed
Mixed
0 repliesr
256k context window and FIM support sounds good. I can't see it mentioned on the release page how big the model is, they say sub-100B but they are comparing it to 22B and the 22B seems to hold up quite well against it. If we're talking 80B vs 22B then it seems like quite a hea... |
||||
|
Repo
2025-01-20
|
DeepSeek R1
DeepSeek
|
1,843 | 663 |
|
|
Positive
Pro
23 repliesr
OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...The one I'm running is the 8.54GB file. I'm using Ollama like this: ollama run hf.co/unsloth/DeepSeek-... |
||||
|
Positive
Pro
22 repliesr
Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It c... |
||||
|
Positive
Pro
4 repliesr
> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.We've been running ... |
||||
|
Mixed
Mixed
6 repliesr
Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, Ne... |
||||
|
Positive
Pro
6 repliesr
The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.Their model crushes it on closed-system tasks... |
||||
|
Paper
2025-01-25
|
DeepSeek R1
DeepSeek
|
1,351 | 1,056 |
|
|
Positive
Pro
10 repliesr
we've been tracking the deepseek threads extensively in LS. related reads:- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet.... |
||||
|
Positive
Pro
6 repliesr
I've been using https://chat.deepseek.com/ over My ChatGPT Pro subscription because being able to read the thinking in the way they present it is just much much easier to "debug" - also I can see when it's bending it's reply to something, often softening it or pandering to me ... |
||||
|
Neutral
Neutral
10 repliesr
DeepSeek-R1 has apparently caused quite a shock wave in SV ...https://venturebeat.com/ai/why-everyone-in-ai-is-freaking-ou... |
||||
|
Very positive
Pro
3 repliesr
DeepSeek V3 came in the perfect time, precisely when Claude Sonnet turned into crap and barely allows me to complete something without me hitting some unexpected constraints.Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a hug... |
||||
|
Positive
Neutral
4 repliesr
Over 100 authors on arxiv and published under the team name, that's how you recognize everyone and build comradery. I bet morale is high over there |
||||
|
Official
2025-01-26
|
Qwen2.5 14B Instruct
Qwen
|
307 | 108 |
|
|
Neutral
Neutral
0 repliesr
Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/(via https://news.ycombinator.com/item?id=42832838, but we merged that thread hither) |
||||
|
Negative
Anti
13 repliesr
In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, et... |
||||
|
Neutral
Neutral
4 repliesr
Ollama has a num_ctx parameter that controls the context window length - it defaults to 2048. At a guess you will need to set that. |
||||
|
Neutral
Neutral
1 repliesr
Here are tips for running it on macOS using MLX: https://twitter.com/awnihannun/status/1883611098081099914 - using https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-... |
||||
|
Neutral
Neutral
3 repliesr
What's the SOTA for memory-centric computing? I feel like maybe we need a new paradigm or something to bring the price of AI memory down.Maybe they can take some of those hundreds of billions and invest in new approaches.Because racks of H100s are not sustainable. But it's cle... |
||||
|
Official
2025-01-30
|
Mistral Small 3
Mistral
|
620 | 194 |
|
|
Positive
Pro
8 repliesr
I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. Th... |
||||
|
Neutral
Pro
6 repliesr
Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-lic... |
||||
|
Neutral
Neutral
1 repliesr
Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:* run LLM evaluations systematically and at scale* share the data with the public in a rigorous and transparent wayWe use the UK governmen... |
||||
|
Negative
Anti
0 repliesr
Not so subtle in function calling example[1] "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", [1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-... |
||||
|
Neutral
Pro
0 repliesr
So the point of this release is1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)3) Performance, according to mistral, somewhere between Qwen 2... |
||||
|
Official
2025-01-31
|
o3-mini
o-series
|
962 | 904 |
|
|
Neutral
Neutral
15 repliesr
I used o3-mini to summarize this thread so far. Here's the result: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...For 18,936 input, 2,905 output it cost 3.3612 cents.Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes... |
||||
|
Neutral
Neutral
2 repliesr
I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: https://llm.datasette.io/en/stable/changelog.html#v0-21Example usage: llm -m o3-mini 'write a poem about a pirate and a walrus' \ -o reasoning_effort high Output (com... |
||||
|
Positive
Pro
4 repliesr
For AI coding, o3-mini scored similarly to o1 at 10X less cost on the aider polyglot benchmark [0]. This comparison was with both models using high reasoning effort. o3-mini with medium effort scored in between R1 and Sonnet. 62% $186 o1 high 60% $18 o3-mini high 57% $5 DeepSe... |
||||
|
Positive
Pro
14 repliesr
For years I've been asking all the models this mixed up version of the classic riddle and they 99% of the time get it wrong and insist on taking the goat across first. Even the other reasoning models would reason about how it was wrong, figure out the answer, and then still co... |
||||
|
Neutral
Neutral
14 repliesr
So far, it seems like this is the hierarchyo1 > GPT-4o > o3-mini > o1-mini > GPT-4o-minio3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf |
||||
|
3rd party
2025-02-05
|
Gemini 2.0 Flash
Gemini
|
1,303 | 447 |
|
|
Positive
Pro
33 repliesr
I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large... |
||||
|
Negative
Anti
17 repliesr
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.You... |
||||
|
Negative
Neutral
6 repliesr
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data. |
||||
|
Very negative
Anti
5 repliesr
We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I won... |
||||
|
Positive
Mixed
6 repliesr
(disclaimer I am CEO of llamaindex, which includes LlamaParse)Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so we... |
||||
|
3rd party
2025-02-20
|
Grok 3
Grok
|
132 | 211 |
|
|
Negative
Anti
7 repliesr
This article is weak and just general speculation.Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all ot... |
||||
|
Negative
Anti
2 repliesr
The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok ... |
||||
|
Negative
Anti
10 repliesr
Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.... |
||||
|
Negative
Anti
2 repliesr
I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Mayb... |
||||
|
Negative
Anti
0 repliesr
The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only... |
||||
|
Official
2025-02-24
|
Claude Sonnet 3.7
Claude
|
2,127 | 963 |
|
|
Positive
Pro
18 repliesr
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.Aider 0.75.0 is out with support for 3.7 Sonnet [1].Thinking supp... |
||||
|
Neutral
Neutral
82 repliesr
Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product. |
||||
|
Positive
Pro
8 repliesr
Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.https://help.kagi.com/kagi/ai/llm-benchmark.htmlAppears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, abo... |
||||
|
Positive
Pro
91 repliesr
You can get your HN profile analyzed by it and it's pretty funny :)https://hn-wrapped.kadoa.com/I'm using this to test the humor of new models. |
||||
|
Positive
Pro
2 repliesr
I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was... |
||||
|
Official
2025-02-26
|
Phi-4-multimodal
Phi
|
48 | 3 |
|
|
Positive
Pro
0 repliesr
Great on-device additions to the larger Phi-4 14B released Dec/2024.https://lifearchitect.ai/models-table/ |
||||
|
Positive
Pro
1 repliesr
That looks really good. A 3.8B model holding it's own against pretty 7B class models is definitely an accomplishment.Sizing looks like it is aimed primarily at copilot style NPUs I guess |
||||
|
Official
2025-02-27
|
GPT-4.5
GPT
|
1,136 | 988 |
|
|
Negative
Anti
48 repliesr
GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokensGPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokensIt sounds like it's so expensive and t... |
||||
|
Negative
Anti
12 repliesr
Is it official then?Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But w... |
||||
|
Neutral
Neutral
10 repliesr
I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments): hn-summary.sh 43197872 -m gpt-4.5-preview Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c... |
||||
|
Negative
Anti
9 repliesr
Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to s... |
||||
|
Mixed
Mixed
15 repliesr
First impression of GPT-4.5:1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.52. The style it writes is way better: it keeps the tone you ask and make... |
||||
|
Official
2025-03-05
|
QwQ 32B
Qwen
|
480 | 169 |
|
|
Mixed
Mixed
10 repliesr
Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't beli... |
||||
|
Mixed
Pro
6 repliesr
Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.These things are pretty interesting as they are developing. What US will do to retain its power?BTW I am Indian and we are not even in the race as coun... |
||||
|
Positive
Pro
4 repliesr
I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.20x smaller than Deep Seek! How small can these go? What kind of hardware can run this? |
||||
|
Neutral
Neutral
5 repliesr
To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ. |
||||
|
Mixed
Mixed
2 repliesr
It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported - it sometimes loses the plot after producing a lot of reasoning tokens.Overall though quite impressive i... |
||||
|
Official
2025-03-06
|
Mistral OCR
Mistral
|
1,756 | 417 |
|
|
Mixed
Mixed
9 repliesr
I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.You can see the samples here - https://huggingf... |
||||
|
Negative
Anti
7 repliesr
It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:https://i.imgur.com/jcwW5AG.jpegFor the blocks in the center, it outputs:> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 aoû... |
||||
|
Very positive
Pro
2 repliesr
This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.Specifically, this allows y... |
||||
|
Positive
Pro
3 repliesr
I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF. |
||||
|
Mixed
Mixed
2 repliesr
We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-geminiA high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and... |
||||
|
Paper
2025-03-12
|
Gemma 3 27B
Gemma
|
488 | 245 |
|
|
Neutral
Neutral
6 repliesr
Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.Gemma 3 model overview: https://ai.google.dev/gemma/docs/coreHuggingface collection: https://huggingface.co/collecti... |
||||
|
Neutral
Neutral
11 repliesr
Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.(Opinions our own and not of Google DeepMind.)PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957 |
||||
|
Positive
Pro
3 repliesr
Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good ou... |
||||
|
Neutral
Neutral
0 repliesr
Linking to he announcement (which links to his PDF) would probably be more useful.Introducing Gemma 3: The most capable model you can run on a single GPU or TPUhttps://blog.google/technology/developers/gemma-3/ |
||||
|
Mixed
Mixed
3 repliesr
Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).But the example image shows that this model still makes dumb errors or has a poo... |
||||
|
Official
2025-03-12
|
Gemini 2.0 Flash
Gemini
|
101 | 12 |
|
|
Negative
Anti
4 repliesr
It just tells me 'content not permitted'.For context, I was attempting to put a cup of hot chocolate into the hands of an anime character. |
||||
|
Positive
Pro
1 repliesr
Ever since OpenAI showed (but did not release) this type of multimodal output with 4o, I have been waiting for this to be available to the general public.It seems like really combining visuals at the level of generation capability means language understanding is fully grounded... |
||||
|
Negative
Anti
0 repliesr
Curious that they use "Use Gemini 2.0 Flash to tell a story and it will illustrate it with pictures, keeping the characters and settings consistent throughout", and their example is not consistent at all between two pictures. |
||||
|
Negative
Anti
0 repliesr
I was really hoping that there would be more character consistency, given the fact they mention it in the blog. It also doesn't seem to reliably follow styles like "watercolor illustration" or "line and wash". |
||||
|
Official
2025-03-14
|
Cohere Command A
Command
|
77 | 19 |
|
|
Negative
Anti
1 repliesr
I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on... |
||||
|
Negative
Anti
2 repliesr
I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak poin... |
||||
|
Very negative
Anti
1 repliesr
Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad |
||||
|
Negative
Anti
0 repliesr
what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/) |
||||
|
Positive
Pro
0 repliesr
"Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."Visible above the fold. Thanks for getting to the point. |
||||
|
Official
2025-03-19
|
o1-pro
o-series
|
131 | 129 |
|
|
Mixed
Mixed
5 repliesr
Pricing: $150 / 1M input tokens, $600 / 1M output tokens. (Not a typo.)Very expensive, but I've been using it with my ChatGPT Pro subscription and it's remarkably capable. I'll give it 100,000 token codebases and it'll find nuanced bugs I completely overlooked.(Now I almost fe... |
||||
|
Neutral
Neutral
4 repliesr
This is their first model to only be available via the new Responses API - if you have code that uses Chat Completions you'll need to upgrade to Responses in order to support this.Could take me a while to add support for it to my LLM tool: https://github.com/simonw/llm/issues/839 |
||||
|
Neutral
Neutral
5 repliesr
It cost me 94 cents to render a pelican riding a bicycle SVG with this one!Notes and SVG output here: https://simonwillison.net/2025/Mar/19/o1-pro/ |
||||
|
Neutral
Pro
3 repliesr
Assuming a highly motivated office worker spends 6 hours per day listening or speaking, at a salary of $160k per year, that works out to a cost of ≈$10k per 1M tokens.OpenAI is now within an order of magnitude of a highly skilled humans with their frontier model pricing. o3 pr... |
||||
|
Negative
Anti
2 repliesr
It has a 2023 knowledge cut-off, and 200k context window... ? That's pretty underwhelming. |
||||
|
Official
2025-03-24
|
Qwen2.5-VL-32B
Qwen
|
544 | 293 |
|
|
Neutral
Neutral
5 repliesr
Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/ |
||||
|
Very positive
Pro
1 repliesr
This model is available for MLX now, in various different sizes.I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this: uv run --with 'numpy<2' --with mlx-vlm \ ... |
||||
|
Very positive
Pro
2 repliesr
We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our ... |
||||
|
Positive
Pro
9 repliesr
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced M... |
||||
|
Neutral
Neutral
10 repliesr
Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable? |
||||
|
Official
2025-03-25
|
Gemini 2.5 Flash
Gemini
|
973 | 485 |
|
|
Very positive
Pro
14 repliesr
One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.Imagine for instance you give the LLM the profile of the love interest for your... |
||||
|
Very positive
Pro
22 repliesr
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. ... |
||||
|
Positive
Pro
4 repliesr
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.Plus it drew me a very decent pelican r... |
||||
|
Positive
Pro
3 repliesr
Tops our benchmark in an unprecedented way.https://help.kagi.com/kagi/ai/llm-benchmark.htmlHigh quality, to the point. Bit on the slow side. Indeed a very strong model.Google is back in the game big time. |
||||
|
Positive
Pro
2 repliesr
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.[0] https://aider.c... |
||||
|
Paper
2025-03-27
|
DeepSeek V3
DeepSeek
|
132 | 34 |
|
|
Neutral
Neutral
3 repliesr
The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.... |
||||
|
Negative
Anti
2 repliesr
The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts. |
||||
|
Neutral
Neutral
1 repliesr
Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-... |
||||
|
Neutral
Neutral
0 repliesr
Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18. |
||||
|
Positive
Pro
1 repliesr
Nice to see a return to open source in models and training systems. |
||||
|
3rd party
2025-03-27
|
Qwen2.5-Omni 7B
Qwen
|
29 | 3 |
|
|
Positive
Pro
0 repliesr
Qwen has been on fire lately. First QwQ now this! |
||||
|
Negative
Anti
1 repliesr
Dunno if the article author is in this, but there's an issue with the links. Hugging face goes to github, github goes to nowhere. |
||||
|
Official
2025-04-03
|
QVQ Max
Qwen
|
121 | 26 |
|
|
Positive
Pro
1 repliesr
I think this is old news, but this model does better than llama 4 maverick on coding. |
||||
|
Neutral
Neutral
5 repliesr
I wonder why are we getting these drops during the weekend. Is the AI race truly that heated? |
||||
|
Neutral
Neutral
1 repliesr
The about page doesn’t shed light on the composition of the core team nor their sources of incomes or funding. Am I overlooking something?> We are a group of people with diverse talents and interests. |
||||
|
Mixed
Mixed
0 repliesr
I’ve tried QVQ before, one issue I always stumbled upon is that it entered an endless loop of reasoning. It felt wrong letting it run for too long, I didn’t want to over consume Alibabas gpus.I’ll try QVQ-Max when I get home. It’s kinda fascinating how these models can interpr... |
||||
|
Official
2025-04-05
|
Llama 4 Scout 17B 16E Instruct
Llama
|
1,235 | 658 |
|
|
Neutral
Neutral
8 repliesr
General overview below, as the pages don't seem to be working well Llama 4 Models: - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each. - They are natively multimodal: text + image input, text-only output. - Key achie... |
||||
|
Neutral
Neutral
36 repliesr
"It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet."Perhaps. Or, maybe, "leaning left" by th... |
||||
|
Neutral
Neutral
5 repliesr
Model training observations from both Llama 3 and 4 papers:Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].For Llama 4 training, Meta doubled the compute, using ~... |
||||
|
Positive
Pro
11 repliesr
The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct... |
||||
|
Negative
Mixed
7 repliesr
This thread so far (at 310 comments) summarized by Llama 4 Maverick: hn-summary.sh 43595585 -m openrouter/meta-llama/llama-4-maverick -o max_tokens 20000 Output: https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4...And with Scout I got complete junk output for some r... |
||||
|
Official
2025-04-14
|
GPT-4.1
GPT
|
680 | 492 |
|
|
Negative
Neutral
16 repliesr
As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)- o3-mini (web search, CoT, canvas, but no i... |
||||
|
Neutral
Neutral
8 repliesr
Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year: SWE Aider Cost Fast Fresh Claude 3.7 70% 65% $15 77 8/24 Gemini 2.5 64% 69% $10 200 1/25 GPT-4.1 55% 53% $8 169 6/24 DeepSeek R1 49% 57% $... |
||||
|
Neutral
Neutral
8 repliesr
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:- telling the model to be persistent (+20%)- dont self-inject/parse toolcalls (+2%)- prompted planning (+4%)- JSON BAD - use X... |
||||
|
Mixed
Mixed
2 repliesr
I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.My take aways:- This is the first model from OpenAI that f... |
||||
|
Neutral
Pro
4 repliesr
From OpenAI's announcement:> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better s... |
||||
|
Official
2025-04-16
|
o3
o-series
|
555 | 507 |
|
|
Very negative
Anti
13 repliesr
Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...With right knowledge and web searches one can answer this question in a matte... |
||||
|
Positive
Pro
5 repliesr
Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA ... |
||||
|
Positive
Pro
7 repliesr
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]Incredible how resilient Claude models have been for best-in-coding class.[1] But by only about 1%, and inclusive of C... |
||||
|
Positive
Pro
3 repliesr
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).Using o4-min... |
||||
|
Negative
Anti
5 repliesr
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.GPT-4o mini: The new moon in August 2025 will occur on August 12.Llama 3.3 70B: The new moon in August... |
||||
|
Official
2025-04-17
|
Gemini 2.5 Flash
Gemini
|
1,076 | 560 |
|
|
Very positive
Pro
20 repliesr
Google making Gemini 2.5 Pro (Experimental) free was a big deal. I haven't tried the more expensive OpenAI models so I can't even compare, only to the free models I have used of theirs in the past.Gemini 2.5 Pro is so much of a step up (IME) that I've become sold on Google's m... |
||||
|
Neutral
Pro
3 repliesr
An often overlooked feature of the Gemini models is that they can write and execute Python code directly via their API.My llm-gemini plugin supports that: https://github.com/simonw/llm-gemini uv tool install llm llm install llm-gemini llm keys set gemini # paste key here llm -... |
||||
|
Positive
Pro
18 repliesr
Gemini flash models have the least hype, but in my experience in production have the best bang for the buck and multimodal tooling.Google is silently winning the AI race. |
||||
|
Positive
Pro
6 repliesr
One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks! https://ai.google.dev/gemini-api/docs/image-und... |
||||
|
Positive
Pro
3 repliesr
For a non programmer like me google is becoming shockingly good. It is giving working code the first time. I was playing around with it asked it to write code to scrape some data of a website to analyse. I was expecting it to write something that would scrape the data and late... |
||||
|
Official
2025-04-20
|
Gemma 3 27B
Gemma
|
602 | 276 |
|
|
Very positive
Pro
11 repliesr
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of mem... |
||||
|
Very positive
Pro
1 repliesr
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I... |
||||
|
Negative
Anti
3 repliesr
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn'... |
||||
|
Very positive
Pro
3 repliesr
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.I still like Gemini 2.5 Pro ... |
||||
|
Positive
Pro
0 repliesr
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of ent... |
||||
|
Official
2025-04-28
|
Qwen3 14B
Qwen
|
869 | 388 |
|
|
Negative
Anti
18 repliesr
I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Q... |
||||
|
Positive
Pro
4 repliesr
They have got pretty good documentation too[1]. And Looks like we have day 1 support for all major inference stacks, plus so many size choices. Quants are also up because they have already worked with many community quant makers.Not even going into performance, need to test fi... |
||||
|
Mixed
Mixed
6 repliesr
As is now traditional for new LLM releases, I used Qwen 3 (32B, run via Ollama on a Mac) to summarize this Hacker News conversation about itself - run at the point when it hit 112 comments.The results were kind of fascinating, because it appeared to confuse my system prompt te... |
||||
|
Neutral
Neutral
12 repliesr
Something that interests me about the Qwen and DeepSeek models is that they have presumably been trained to fit the worldview enforced by the CCP, for things like avoiding talking about Tiananmen Square - but we've had access to a range of Qwen/DeepSeek models for well over a ... |
||||
|
Neutral
Neutral
15 repliesr
With all the different open-weight models appearing, is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all... |
||||
|
Official
2025-05-01
|
Phi-4-reasoning
Phi
|
131 | 26 |
|
|
Neutral
Neutral
1 repliesr
We uploaded GGUFs for anyone who wants to run them locally.[EDIT] - I fixed all chat templates so no need for --jinja as at 10:00PM SF time.Phi-4-mini-reasoning GGUF: https://huggingface.co/unsloth/Phi-4-mini-reasoning-GGUFPhi-4-reasoning-plus-GGUF: https://huggingface.co/unsl... |
||||
|
Very positive
Pro
2 repliesr
These look quite incredible. I work on a llama.cpp GUI wrapper and its quite surprising to see how well Microsoft's Phi-4 releases set it apart as the only competition below ~7B, it'll probably take a year for the FOSS community to implement and digest it completely (it can do... |
||||
|
Neutral
Neutral
0 repliesr
Sorry if this comment is outdated or ill-informed, but it is hard to follow the current news. Do the Phi models still have issues with training on the test set, or have they fixed that? |
||||
|
Mixed
Mixed
0 repliesr
Is anyone here using phi-4 multimodal for image-to-text tasks?The phi models often punch above their weight, and I got curious about the vision models after reading https://unsloth.ai/blog/phi4 stories of finetuningSince lmarena.ai only has the phi-4 text model, I've tried "ph... |
||||
|
Mixed
Mixed
1 repliesr
The example prompt for reasoning model that never fails to amuse me: "How amy letter 'r's in the word 'strrawberrry'?"Phi-4-mini-reasoning: thought for 2 min 3 sec<think> Okay, let's see here. The user wants to know how many times the letter 'r' appears in the word 'strr... |
||||
|
Official
2025-05-07
|
Mistral Medium 3
Mistral
|
70 | 20 |
|
|
Negative
Anti
1 repliesr
It's not cheaper than Deepseek V3.1, though, and Deepseek outperforms on nearly everything. And only between 1- 3x the throughput based on the openrouter metrics (near equivalent throughput if you use an FP8 quant). Wish I could be a little more excited about this one. |
||||
|
Neutral
Neutral
1 repliesr
https://lifearchitect.ai/models-table/ |
||||
|
Neutral
Neutral
2 repliesr
I guess this one (Mistral Medium 3) won't be open? |
||||
|
Positive
Pro
0 repliesr
The medium model is crazy fast. Reminds me of maverick on groq (except better according to their own testing). |
||||
|
Neutral
Neutral
0 repliesr
The only hint at its size is that it requires "self-hosted environments of four GPUs and above". |
||||
|
Official
2025-05-20
|
Gemma 3n 4B
Gemma
|
441 | 163 |
|
|
Positive
Pro
12 repliesr
You can try it on Android right now:Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...Import the .task file in ... |
||||
|
Neutral
Neutral
3 repliesr
Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot ... |
||||
|
Mixed
Mixed
2 repliesr
According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-previewE4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4... |
||||
|
Positive
Pro
1 repliesr
On Hugging face I see 4B and 2B versions now -https://huggingface.co/collections/google/gemma-3n-preview-6...Gemma 3n Previewgoogle/gemma-3n-E4B-it-litert-previewgoogle/gemma-3n-E2B-it-litert-previewInteresting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE model... |
||||
|
Mixed
Pro
0 repliesr
It seems to work quite well on my phone. One funny side effect I've found is that it's much easier to bypass the censorship in these smaller models than in the larger ones, and with the complexity of the E4B variant I wouldn't have expected the "roleplay as my father who is ex... |
||||
|
Official
2025-05-21
|
Devstral
Devstral
|
701 | 148 |
|
|
Positive
Pro
4 repliesr
The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tagsI find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters bec... |
||||
|
Very positive
Pro
3 repliesr
The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run t... |
||||
|
Positive
Pro
1 repliesr
It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mis... |
||||
|
Positive
Pro
1 repliesr
It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions. |
||||
|
Mixed
Mixed
1 repliesr
*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*total duration: 35.016288581s load duration:... |
||||
|
Official
2025-05-22
|
Claude 4
Claude
|
2,013 | 1,170 |
|
|
Neutral
Pro
10 repliesr
An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)https://docs.anthropic.com/en/docs/about-claude/models/overv... |
||||
|
Positive
Pro
7 repliesr
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handl... |
||||
|
Negative
Anti
8 repliesr
> Users requiring raw chains of thought for advanced prompt engineering can contact salesSo it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly ref... |
||||
|
Negative
Anti
13 repliesr
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks. |
||||
|
Mixed
Anti
1 repliesr
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing ... |
||||
|
Repo
2025-05-28
|
DeepSeek R1 0528
DeepSeek
|
451 | 250 |
|
|
Neutral
Neutral
4 repliesr
Well that didn't take long, available from 7 providers through openrouter.https://openrouter.ai/deepseek/deepseek-r1-0528/providersMay 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B pa... |
||||
|
Negative
Mixed
4 repliesr
No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link |
||||
|
Positive
Pro
7 repliesr
I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare. |
||||
|
Neutral
Neutral
13 repliesr
Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses? |
||||
|
Neutral
Neutral
6 repliesr
What use cases are people using local LLMs for? Have you created any practical tools that actually increase your efficiency? I've been experimenting a bit but find it hard to get inspiration for useful applications |
||||
|
Official
2025-06-10
|
Magistral Small
Magistral
|
941 | 424 |
|
|
Neutral
Neutral
7 repliesr
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUFollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XLor./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1... |
||||
|
Negative
Anti
15 repliesr
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is... |
||||
|
Very negative
Anti
1 repliesr
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.I tried it, 80% of the "text" was recognised as images and output as whitespace so most of... |
||||
|
Positive
Pro
2 repliesr
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in ... |
||||
|
Neutral
Neutral
2 repliesr
Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/ |
||||
|
Official
2025-06-10
|
o3-pro
o-series
|
289 | 199 |
|
|
Mixed
Mixed
6 repliesr
I'm really hoping GPT5 is a larger jump in metrics than the last several releases we've seen like Claude3.5 - Claude4 or o3-mini-high to o3-pro. Although I will preface that with the fact I've been building agents for about a year now and despite the benchmarks only showing sl... |
||||
|
Neutral
Neutral
5 repliesr
The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these mo... |
||||
|
Neutral
Anti
4 repliesr
The benchmarks don’t look _that_ much better than o3. Does that mean Pro models are just incrementally better than base models, or are we approaching the higher end of a sigmoid function, with performance gains leveling off? |
||||
|
Negative
Anti
4 repliesr
I am still not willing to upgrade to a Pro account. I pay $20 a month for both Gemini and ChatGPT, and for what I need this is currently enough.I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on... |
||||
|
Positive
Pro
2 repliesr
here's a nice user review we published: https://www.latent.space/p/o3-prosama's highlight[0]:> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."I kept nu... |
||||
|
Repo
2025-06-18
|
MiniMax-M1
MiniMax
|
349 | 75 |
|
|
Positive
Pro
2 repliesr
1. this is apparently MiniMax's "launch week" - they did M1 on Monday and Hailuo 2 on Tuesday (https://news.smol.ai/issues/25-06-16-chinese-models). remains to be seen if they can keep up the pace of model releases for the rest of this week - these 2 were big ones, they aren't... |
||||
|
Neutral
Anti
3 repliesr
In case you're wondering what it takes to run it, the answer is 8x H200 141GB [1] which costs $250k [2].1. https://github.com/MiniMax-AI/MiniMax-M1/issues/2#issuecomme...2. https://www.ebay.com/itm/335830302628 |
||||
|
Positive
Pro
0 repliesr
"We publicly release MiniMax-M1 at this https url" in the arxiv paper, and it isn't a link to an empty repo!I like these people already. |
||||
|
Positive
Pro
4 repliesr
A few thoughts:* A Singapore based company, according to LinkedIn. There doesn't seem to be much of a barrier to entry to building a very good LLM.* Open weight models + the development of Strix Halo / Ryzen AI Max makes me optimistic that running great LLMs locally will be re... |
||||
|
Neutral
Neutral
2 repliesr
This is stated nowhere on the official pages, but it's a Chinese company.https://en.wikipedia.org/wiki/MiniMax_(company) |
||||
|
Repo
2025-06-20
|
Mistral Small 3.2
Mistral
|
23 | 0 | — |
|
3rd party
2025-07-10
|
Grok 4
Grok
|
437 | 604 |
|
|
Positive
Pro
5 repliesr
Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.Specialized coding model coming "in a few weeks". I notice they didn't talk ab... |
||||
|
Positive
Pro
9 repliesr
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent d... |
||||
|
Very positive
Pro
3 repliesr
I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc... |
||||
|
Neutral
Neutral
0 repliesr
"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%.""This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."https://x.com/arcprize/status/1943168950763950555 |
||||
|
Negative
Anti
16 repliesr
The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.I can already use Gemini 2.5 Pro for free in AI s... |
||||
|
3rd party
2025-07-10
|
Grok 4
Grok
|
328 | 253 |
|
|
Negative
Anti
5 repliesr
Here's something far more interesting about Grok 4: if you ask for its opinion on controversial subjects it sometimes runs a search on X for tweets "from:elonmusk" before it answers! https://simonwillison.net/2025/Jul/11/grok-musk/ |
||||
|
Negative
Mixed
5 repliesr
[edit to focus on pricing, leaving praise of Simon's post out despite being deserved]Simon claims, 'Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4.' This ignores the real price which skyroc... |
||||
|
Neutral
Anti
11 repliesr
Claude Code converted me from paying $0 for LLMs to $200 per month. Any co that wants a chance at getting that $200 ($300 is fine too) from me needs a Claude Code equivalent and a model where the equivalent's tools were part of its RL environment. I don't think I can go back t... |
||||
|
Negative
Anti
3 repliesr
Is it time for a new benchmark of "how easy is it to turn this AI into a 4chan poster", maybe it is since this seems to be an axis that Elon seems to want to distinguish his AI offering from everyone else's along. |
||||
|
Neutral
Neutral
4 repliesr
> My best guess is that these lines in the prompt were the root of the problem:The second line was recently removed, per the GitHub: https://github.com/xai-org/grok-prompts/commit/c5de4a14feb50... |
||||
|
3rd party
2025-07-11
|
Moonshot Kimi K2 Instruct
Kimi
|
348 | 179 |
|
|
Positive
Pro
7 repliesr
I tried Kimi on a few coding problems that Claude was spinning on. It’s good. It’s huge, way too big to be a “local” model — I think you need something like 16 H200s to run it - but it has a slightly different vibe than some of the other models. I liked it. It would definitely... |
||||
|
Neutral
Neutral
6 repliesr
Pelican on a bicycle result: https://simonwillison.net/2025/Jul/11/kimi-k2/ |
||||
|
Positive
Pro
3 repliesr
This is a very impressive general purpose LLM (GPT 4o, DeepSeek-V3 family). It’s also open source.I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:https:... |
||||
|
Positive
Pro
1 repliesr
Technical strengths aside, I’ve been impressed with how non-robotic Kimi K2 is. Its personality is closer to Anthropic’s best: pleasant, sharp, and eloquent. A small victory over botslop prose. |
||||
|
Neutral
Neutral
1 repliesr
Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB |
||||
|
Official
2025-07-22
|
Qwen3 Coder Flash
Qwen
|
765 | 366 |
|
|
Neutral
Neutral
13 repliesr
I'm currently making 2bit to 8bit GGUFs for local deployment! Will be up in an hour or so at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruc...Also docs on running it in a 24GB GPU + 128 to 256GB of RAM here: https://docs.unsloth.ai/basics/qwen3-coder |
||||
|
Positive
Pro
4 repliesr
> Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant firstI'm most excited for the smaller sizes because I'm interested in locally-runnable models that can sometimes write passable code, and I think we're getting close. But ... |
||||
|
Neutral
Neutral
8 repliesr
The "qwen-code" app seems to be a gemini-cli fork.https://github.com/QwenLM/qwen-code https://github.com/QwenLM/qwen-code/blob/main/LICENSEI hope these OSS CC clones converge at some point.Actually it is mentioned in the page: we’re also open-sourcing a command-line tool for a... |
||||
|
Neutral
Neutral
14 repliesr
At my work, here is a typical breakdown of time spent by work areas for a software engineer. Which of these areas can be sped up by using agentic coding?05%: Making code changes10%: Running build pipelines20%: Learning about changed process and people via zoom calls, teams cha... |
||||
|
Very positive
Pro
5 repliesr
I've been using it all day, it rips. Had to bump up toolcalling limit in cline to 100 and it just went through the app no issues, got the mobile app built, fixed throug hthe linter errors... wasn't even hosting it with the toolcall template on with the vllm nightly, just stock... |
||||
|
Official
2025-07-28
|
GLM-4.5
GLM
|
247 | 134 |
|
|
Negative
Anti
15 repliesr
I typed "Hello" in their chat [1] and it replied back with "Hello! I'm Claude, an AI assistant created by Anthropic. How can I help you today?"Hmmm....[1] https://chat.z.ai/ |
||||
|
Neutral
Neutral
3 repliesr
We have x.ai and z.ai, but why?(Sorry... dad joke.) |
||||
|
Positive
Pro
0 repliesr
I tested the model with Claude Code and my experience was it was at least as good as Sonnet 3.5, perhaps they designed it that way since they benchmarked with it.Hard to test more thoroughly on more complex problems since the API is being hammered, but I could get it to consis... |
||||
|
Negative
Anti
2 repliesr
Tried it with a few prompts. The vibes feel super weird to me, in a way that I don't know how to describe. I'm not sure if it's just that I'm so used to Gemini 2.5 Pro and the other US models. Subjectively, it doesn't feel very smart.I asked it to analyze a recent painting I m... |
||||
|
Neutral
Neutral
0 repliesr
I just added it to Gnod Search for easy comparison with the other public LLMs out there:https://www.gnod.com/search/aiInteresting that they claim it outperforms O3, Grok-4 and Gemini-2.5-Pro for coding. I will test that over the coming days on my coding tasks. |
||||
|
3rd party
2025-07-29
|
GLM-4.5-Air
GLM
|
577 | 396 |
|
|
Very positive
Pro
4 repliesr
> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high qualit... |
||||
|
Positive
Pro
2 repliesr
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.I believe we are vastly underestimating what our existing hardware is c... |
||||
|
Negative
Anti
3 repliesr
Did you understand the implementation or just that it produced a result?I would hope an LLM could spit out a cobbled form of answer to a common interview question.Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why... |
||||
|
Negative
Anti
6 repliesr
Most likely its training data included countless Space Invaders in various programming languages. |
||||
|
Mixed
Mixed
4 repliesr
I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.That said, for something like this, I’d probably get more out of simply finding an existing implementat... |
||||
|
Official
2025-08-04
|
Qwen-Image
Qwen
|
544 | 159 |
|
|
Positive
Pro
13 repliesr
Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge. |
||||
|
Mixed
Mixed
1 repliesr
Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).That being said, it still lags prett... |
||||
|
Positive
Pro
2 repliesr
The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited. |
||||
|
Neutral
Neutral
8 repliesr
This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on t... |
||||
|
Neutral
Neutral
0 repliesr
A silly question: do any of these models generate pixels and also vector overlays? I don't see why we need to solve the text problem pixel-for-pixel if we can just generate higher-level descriptions of the text (text, font, font size, etc). Ofc, it won't work in all situations... |
||||
|
Official
2025-08-05
|
Claude Opus 4.1
Claude
|
841 | 328 |
|
|
Neutral
Neutral
10 repliesr
All three major labs released something within hours of each other. This anime arc is insane. |
||||
|
Mixed
Mixed
5 repliesr
Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, sp... |
||||
|
Negative
Anti
22 repliesr
I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certa... |
||||
|
Neutral
Neutral
1 repliesr
They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon(He had been stuck in the Team Rocket hideout (I believe) for weeks) |
||||
|
Very negative
Anti
4 repliesr
Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.I've basically wasted the morning o... |
||||
|
Official
2025-08-07
|
GPT-5
GPT
|
2,063 | 2,482 |
|
|
Neutral
Neutral
109 repliesr
It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the differ... |
||||
|
Neutral
Anti
12 repliesr
GPT-5 knowledge cutoff: Sep 30, 2024 (10 months before release).Compare that toGemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)https://platform.openai.com/docs/models/comparehttps://deepmin... |
||||
|
Negative
Anti
12 repliesr
Going by the system card at: https://openai.com/index/gpt-5-system-card/> GPT‑5 is a unified system . . .OK> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which mod... |
||||
|
Negative
Anti
7 repliesr
I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.What excites me now is that Gemini... |
||||
|
Negative
Anti
13 repliesr
The marketing copy and the current livestream appear tautological: "it's better because it's better."Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks. |
||||
|
Repo
2025-08-11
|
GLM-4.5V
GLM
|
27 | 0 | — |
|
Paper
2025-08-12
|
GLM-4.5
GLM
|
417 | 83 |
|
|
Positive
Pro
2 repliesr
Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these... |
||||
|
Positive
Pro
6 repliesr
I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:1. Claude is somewhat bet... |
||||
|
Neutral
Neutral
1 repliesr
So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.T... |
||||
|
Positive
Pro
2 repliesr
Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level. Current cloud based models are fun and useful, but a tool that ... |
||||
|
Positive
Pro
0 repliesr
This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results ... |
||||
|
Official
2025-08-12
|
Claude Sonnet 4
Claude
|
1,324 | 692 |
|
|
Mixed
Mixed
16 repliesr
This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (... |
||||
|
Neutral
Pro
12 repliesr
A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max): 1. Build context for the work you're doing. Put lots of your codebase into the context window. 2. Do work, but ... |
||||
|
Mixed
Mixed
4 repliesr
This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users do... |
||||
|
Mixed
Mixed
20 repliesr
My experience with the current tools so far:1. It helps to get me going with new languages, frameworks, utilities or full green field stuff. After that I expend a lot of time parsing the code to understand what it wrote that I kind of "trust" it because it is too tedious but "... |
||||
|
Positive
Pro
8 repliesr
One of the most helpful usages of CC so far is when I simply ask:"Are there any bugs in the current diff"It analyzes the changes very thoroughly, often finds very subtle bugs that would cost hours of time/deployments down the line, and points out a bunch of things to think thr... |
||||
|
Official
2025-08-14
|
Gemma 3 270M
Gemma
|
824 | 317 |
|
|
Positive
Pro
30 repliesr
Hi all, I built these models with a great team. They're available for download across the open model ecosystem so give them a try! I built these models with a great team and am thrilled to get them out to you.From our side we designed these models to be strong for their size o... |
||||
|
Negative
Anti
18 repliesr
My lovely interaction with the 270M-F16 model:> what's second tallest mountain on earth?The second tallest mountain on Earth is Mount Everest.> what's the tallest mountain on earth?The tallest mountain on Earth is Mount Everest.> whats the second tallest mountain?The ... |
||||
|
Positive
Pro
3 repliesr
I've got a very real world use case I use DistilBERT for - learning how to label wordpress articles. It is one of those things where it's kind of valuable (tagging) but not enough to spend loads on compute for it.The great thing is I have enough data (100k+) to fine-tune and r... |
||||
|
Mixed
Mixed
14 repliesr
This model is a LOT of fun. It's absolutely tiny - just a 241MB download - and screamingly fast, and hallucinates wildly about almost everything.Here's one of dozens of results I got for "Generate an SVG of a pelican riding a bicycle". For this one it decided to write a poem: ... |
||||
|
Positive
Pro
6 repliesr
Apple should be doing this. Unless their plan is to replace their search deal with an AI deal -- it's just crazy to me how absent Apple is. Tim Cook said, "it's ours to take" but they really seem to be grasping at the wind right now. Go Google! |
||||
|
Official
2025-08-21
|
DeepSeek-V3.1
DeepSeek
|
778 | 263 |
|
|
Neutral
Neutral
6 repliesr
For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow../llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CP... |
||||
|
Mixed
Mixed
6 repliesr
For reference, here is the terminal-bench leaderboard:https://www.tbench.ai/leaderboardLooks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will... |
||||
|
Negative
Anti
5 repliesr
Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?https://artificialanalysis.ai/models/deepseek-v3-1-reasoning |
||||
|
Neutral
Anti
2 repliesr
It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoningPricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1 |
||||
|
Mixed
Mixed
2 repliesr
It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those. |
||||
|
Official
2025-08-26
|
Gemini 2.5 Flash Image
Gemini
|
1,093 | 475 |
|
|
Very positive
Pro
19 repliesr
This is the gpt 4 moment for image editing models. Nano banana aka gemini 2.5 flash is insanely good. It made a 171 elo point jump in lmarena!Just search nano banana on Twitter to see the crazy results. An example. https://x.com/D_studioproject/status/1958019251178267111 |
||||
|
Positive
Pro
7 repliesr
I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).https://genai-showdown.specr.netThis model gets 8 of the 12 prompts correct and easily comes within ... |
||||
|
Very negative
Anti
4 repliesr
Unfortunately, it suffers from the same safetyism than other many releases. Half of the prompts get rejected. How can you have character consistency if the model is forbidden from editing any human. And most of my photo editing involves humans, so basically this is just a usel... |
||||
|
Positive
Pro
3 repliesr
There is one thing Gemini 2.5 Flash Image can do that no other edit model can do: incorporate multiple images simultaneously without shenanigans due to its multimodality, e.g. for Flux Kontext, if you want to "put the person in the first image into the second image", you have ... |
||||
|
Positive
Pro
6 repliesr
I digitised our family photos but a lot of them were damaged (shifted colours, spills, fingerprints on film, spots) that are difficult to correct for so many images. I've been waiting for image gen to catch up enough to be able to repair them all in bulk without changing detai... |
||||
|
Official
2025-08-29
|
Grok Code Fast 1
Grok
|
512 | 477 |
|
|
Positive
Pro
10 repliesr
Tested this yesterday with Cline. It's fast, works well with agentic flows, and produces decent code. No idea why this thread is so negative (also got flagged while I was typing this?) but it's a decent model. I'd say it's at or above gpt5-mini level, which is awesome in my bo... |
||||
|
Negative
Anti
14 repliesr
It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.I would have thought it uncontroversial view among software engineers that token quality is much important than token ... |
||||
|
Negative
Anti
1 repliesr
Is this the model that is the "Coding" version of Grok-4 promised when Grok-4 had awful coding benchmarks?I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up wa... |
||||
|
Negative
Anti
2 repliesr
"On the full subset of SWE-Bench-Verified, grok-code-fast-1 scored 70.8% using our own internal harness."Let's see this harness, then, because third party reports rate it at 57.6%https://www.vals.ai/models/grok_grok-code-fast-1 |
||||
|
Mixed
Anti
3 repliesr
I've actually seem really good outputs from the regular Grok 4. The issue seemed to be that it didn't explain anything and just made some changes, which like, I said, were pretty good. I never wanted a faster version, I just wanted a bit more feedback and explanations for sugg... |
||||
|
Official
2025-09-12
|
Qwen3-Next 80B-A3B (Thinking)
Qwen
|
569 | 230 |
|
|
Positive
Pro
3 repliesr
Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...But Deepseek R1 adds embed_to... |
||||
|
Positive
Mixed
4 repliesr
Alibaba keeps releasing gold contentI just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysisAccording to Qwen Chat, Qwen3-Next has the f... |
||||
|
Negative
Anti
5 repliesr
llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"Here's a classic ASCII art representation of SpongeBob SquarePants: .------. / o o \ | | | \___/ | \_______/ llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \ "An ASCII of spongebob" Here's an ASCII art of SpongeB... |
||||
|
Positive
Pro
2 repliesr
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane. |
||||
|
Neutral
Neutral
8 repliesr
The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc |
||||
|
Official
2025-09-15
|
GPT-5-Codex
GPT
|
396 | 137 |
|
|
Positive
Pro
5 repliesr
Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).As someon... |
||||
|
Very positive
Pro
3 repliesr
I've been a hardcore claude-4-sonnet + Cursor fan for a long time, but in the last 2 months my usage went through the roof. I started with the basic Cursor subscription, then upgraded to pro, until I hit usage limits again. Then I started using my own Claude API key but I was ... |
||||
|
Very positive
Pro
4 repliesr
Codex CLI IDE just works, very impressed with the quality. If you tried it a while back and didn’t like it, try it again via the vscode extension generous usage included with plus.Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any lim... |
||||
|
Positive
Pro
2 repliesr
From my observation of the past 2 weeks is that Claude Code is getting dramatically worse and super low usage quota's while OpenAI Codex is getting great and has a very generous usage quota in comparison.For people that have not tried it in say ~1 month, give Codex CLI a try. |
||||
|
Positive
Pro
1 repliesr
It's been interesting reading this thread and seeing that others have also switched to using Codex over Claude Code. I kept running into a huge issue with Claude Code creating mock implementations and general fakery when it was overwhelmed. I spent so much time tuning my input... |
||||
|
Official
2025-09-15
|
GPT-5-Codex
GPT
|
250 | 141 |
|
|
Very positive
Pro
9 repliesr
I've been extremely impressed (and actually had quite a good time) with GPT-5 and Codex so far. It seems to handle long context well, does a great job researching the code, never leaves things half-done (with long tasks it may leave some steps for later, but it never does 50% ... |
||||
|
Neutral
Neutral
0 repliesr
This should probably be merged with the other GPT-5-Codex thread at https://news.ycombinator.com/item?id=45252301 since nobody in this thread is talking about the system card addendum. |
||||
|
Negative
Anti
2 repliesr
My problem _still_ with all of the codex/gpt based offerings is that they think for way too long. After using Claude 4 models through cursor max/ampcode I feel much more effective given it's speed. Ironically, Claude Code feels just as slow as codex/gpt (even with my company p... |
||||
|
Mixed
Mixed
2 repliesr
Interesting, the new model uses a different prompt in Codex CLI that's ~half the size (10KB vs. 23KB) of the previous prompt[0][1].SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchm... |
||||
|
Negative
Anti
2 repliesr
I signed up to OpenAI, verified my identity, and added my credit card, bought $10 of credits.But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.Are you meant to only use Codex with the... |
||||
|
Official
2025-09-20
|
Grok 4 Fast
Grok
|
96 | 76 |
|
|
Negative
Anti
3 repliesr
Why would a model called "Fast" not advertise the tokens per second speed it performs at? Is "Fast" not representing speed, but another meaning? Is it too variable? |
||||
|
Neutral
Neutral
1 repliesr
Matches Grok 4 at the top of the Extended NYT Connections leaderboard: https://github.com/lechmazur/nyt-connections/ |
||||
|
Positive
Pro
1 repliesr
grok-code-fast-1 has been my preferred model lately, but I don't see any mention of it as part of this release. I'm wondering if this might be better? Even if grok-code-fast-1 might be slightly worse than Gemini 2.5 Pro, the speed of iteration can't be beat. |
||||
|
Positive
Pro
4 repliesr
Surprising to see negativity here. I send all my LLM queries to 5 LLMs - ChatGPT, Claude, DeepSeek (local), Perplexity, and Grok - and Grok consistently gives good answers and often the most helpful answers. It's ~always king when there's any 'ethical' consideration (i.e. othe... |
||||
|
Negative
Anti
4 repliesr
A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests? |
||||
|
Official
2025-09-22
|
siliconflow/deepseek-v3.1-terminus
DeepSeek
|
101 | 28 |
|
|
Mixed
Mixed
0 repliesr
Notable performance improvement in agentic tool use: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-TerminusThe Deepseek provider may train on your prompts: https://openrouter.ai/deepseek/deepseek-v3.1-terminus |
||||
|
Neutral
Neutral
0 repliesr
The link is off. This link works https://api-docs.deepseek.com/updates#deepseek-v31-terminus |
||||
|
Very negative
Anti
1 repliesr
I tried V3.1 but it was driving me crazy by ignoring parts of user input, which R1 never did. I had many such instances when e.g. asking about running DeepSeek 671B it instead picked DeepSeek 67B because 671B is too large to exist so I must have made a mistake etc. I concluded... |
||||
|
Mixed
Mixed
4 repliesr
> What’s improved? Language consistency: fewer CN/EN mix-ups & no more random chars.It's good that they made this improvement. But is there any advantages at this point using DeepSeek over Qwen? |
||||
|
Positive
Pro
0 repliesr
Interesting--I'd seen Chinese characters surprise inserted when it was just repeating back input with one provider, but not others. (I'd also occasionally seen tokens surprise-translated to Chinese.)There's a GitHub bug about it that leads to more discussion here: https://gith... |
||||
|
Repo
2025-09-22
|
Qwen3-Omni Flash
Qwen
|
571 | 142 |
|
|
Positive
Pro
12 repliesr
Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.The Chinese are going to end up owning the AI... |
||||
|
Neutral
Neutral
3 repliesr
You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.It has an entertaining selection of different voices, including:*Dylan* - A teenager who grew up in Beijing's hutongs*Peter*... |
||||
|
Neutral
Neutral
4 repliesr
The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.I wonder if we'll see a macOS port soon - currently it very much needs an N... |
||||
|
Very positive
Pro
0 repliesr
Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.https://www.youtube.com/watch?v=_zdOrPju4_g |
||||
|
Positive
Pro
2 repliesr
Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to you... |
||||
|
Official
2025-09-23
|
Qwen3-VL 235B-A22B
Qwen
|
434 | 160 |
|
|
Very positive
Pro
11 repliesr
As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.Today I tried a handfu... |
||||
|
Very positive
Pro
4 repliesr
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said,... |
||||
|
Neutral
Neutral
2 repliesr
If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage ... |
||||
|
Very positive
Pro
2 repliesr
The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow |
||||
|
Negative
Anti
3 repliesr
Sadly it still fails the "extra limb" test.I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.Like every other model I have tested, it insists that the animals have their an... |
||||
|
Official
2025-09-25
|
Gemini 2.5 Flash
Gemini
|
540 | 279 |
|
|
Negative
Anti
14 repliesr
This really captures something I've been experiencing with Gemini lately. The models are genuinely capable when they work properly, but there's this persistent truncation issue that makes them unreliable in practice.I've been running into it consistently, responses that just s... |
||||
|
Neutral
Neutral
2 repliesr
I added support to these models to my llm-gemini plugin, so you can run them like this (using uvx so no need to install anything first): export LLM_GEMINI_KEY='...' uvx --isolated --with llm-gemini llm -m gemini-flash-lite-latest 'An epic poem about frogs at war with ducks' Re... |
||||
|
Negative
Neutral
5 repliesr
Serious question: If it's an improved 2.5 model, why don't they call it version 2.6? Seems annoying to have to remember if you're using the old 2.5 or the new 2.5. Kind of like when Apple released the third-gen iPad many years ago and simply called it the "new iPad" without a ... |
||||
|
Mixed
Mixed
8 repliesr
Google seems to be the main foundation model provider that's really focusing on the latency/TPS/cost dimensions. Anthropic/OpenAI are really making strides in model intelligence, but underneath some critical threshold of performance, the really long thinking times make workflo... |
||||
|
Neutral
Neutral
5 repliesr
Non-AI Summary:Both models have improved intelligence on Artificial Analysis index with lower end-to-end response time. Also 24% to 50% improved output token efficiency (resulting in lower cost).Gemini 2.5 Flash-Lite improvements include better instruction following, reduced v... |
||||
|
Repo
2025-09-29
|
DeepSeek V3.2 Exp
DeepSeek
|
309 | 50 |
|
|
Positive
Pro
2 repliesr
The 2nd order effect that not a lot of people talk about is price: the fact that model scaling at this pace also correlates with price is amazing.I think this is just as important to distribution of AI as model intelligence is.AFAIK there are no fundamental "laws" that prevent... |
||||
|
Positive
Pro
2 repliesr
Happy to see Chinese OSS models keep getting better and cheaper. It also comes with a 50% API price drop for an already cheap model, now at:$0.28/M Input ($0.028/M cache hit) > $0.42/M Output |
||||
|
Neutral
Neutral
2 repliesr
https://openrouter.ai/deepseek/deepseek-v3.2-exp |
||||
|
Neutral
Neutral
0 repliesr
Not sure if I get it correctly:They trained a thing to learn mimicking the full attention distribution but only filtering the top-k (k=2048) most important attention tokens so that when the context window increases, the compute does not go up linearly but constantly for the at... |
||||
|
Positive
Pro
0 repliesr
wow...gigantic reduction in cost while holding the benchmarks mostly steady. Impressive. |
||||
|
Official
2025-09-29
|
Claude Sonnet 4.5
Claude
|
1,585 | 786 |
|
|
Positive
Pro
12 repliesr
I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of bench... |
||||
|
Negative
Anti
21 repliesr
Anecdotal evidence.I have a fairly large web application with ~200k LoC.Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI)."implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing ... |
||||
|
Very negative
Anti
8 repliesr
I haven't shouted into the void for a while. Today is as good a day as any other to do so.I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection b... |
||||
|
Negative
Anti
8 repliesr
> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all ... |
||||
|
Mixed
Mixed
7 repliesr
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this chang... |
||||
|
Official
2025-09-30
|
GLM-4.6
GLM
|
43 | 10 |
|
|
Positive
Pro
1 repliesr
GLM 4.5 is a great budget option for me. After cancelling Claude sub ($20 USD one), I've been trying Gemini 2.5 Pro through Gemini CLI, and it was okay for landing pages and UI. However, I switched to GLM through Claude Code using their cheapest subscription and it's been pret... |
||||
|
Positive
Pro
0 repliesr
GLM 4.5 for me one of the best coding model when running with agents such as Claude Code and Opencode (excluding Claude and GPT-5).Looking forward to try GLM 4.6.But Z AI should do like KIMI¹ and launch a verifier to see how GLM models behave on 3rd party providers.[1] https:/... |
||||
|
Positive
Pro
1 repliesr
Very respectable. If Sonnet 4.5 was delayed by a day, GLM 4.6 would have been the obvious best choice for agentic coding... up there with gpt-codex.The z.ai monthly plan looks like a steal if you're working on personal hobby projects instead of paying the $200/mo price for Cla... |
||||
|
Neutral
Neutral
0 repliesr
It would be great if people that are using GLM-4.6 within Claude Code could chime in with their experience in using it. I'm a MAX subscriber and having extra cheap tokens would be very helpful. |
||||
|
Official
2025-10-15
|
Claude Haiku 4.5
Claude
|
730 | 287 |
|
|
Mixed
Mixed
6 repliesr
Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being t... |
||||
|
Negative
Neutral
19 repliesr
Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...... |
||||
|
Neutral
Neutral
1 repliesr
I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5. |
||||
|
Positive
Pro
8 repliesr
Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%... |
||||
|
Neutral
Neutral
4 repliesr
I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest. |
||||
|
Repo
2025-10-20
|
DeepSeek OCR
DeepSeek
|
1,003 | 244 |
|
|
Positive
Pro
7 repliesr
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required... |
||||
|
Neutral
Neutral
4 repliesr
I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-c... |
||||
|
Negative
Anti
5 repliesr
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.https://annas-archive.org/blog... |
||||
|
Neutral
Neutral
2 repliesr
For everyone wondering how good this and other benchmarks are:- the OmniAI benchmark is bad- Instead check OmniDocBench[1] out- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini- End to End OCR is still extremely tricky- composed pip... |
||||
|
Neutral
Neutral
7 repliesr
How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)? |
||||
|
3rd party
2025-11-02
|
Tongyi DeepResearch
Yi
|
365 | 153 |
|
|
Neutral
Neutral
7 repliesr
It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.Given the size of frontier models I would assume that th... |
||||
|
Negative
Anti
11 repliesr
Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return. |
||||
|
Neutral
Neutral
2 repliesr
I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch |
||||
|
Mixed
Pro
1 repliesr
This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though. |
||||
|
Neutral
Neutral
13 repliesr
Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.Constra... |
||||
|
Official
2025-11-06
|
Kimi K2 Thinking
Kimi
|
936 | 427 |
|
|
Neutral
Pro
5 repliesr
As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups fr... |
||||
|
Neutral
Neutral
4 repliesr
uv tool install llm llm install llm-moonshot llm keys set moonshot # paste key llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle' https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...Here's what I got using OpenRouter's moonshotai/kimi-k... |
||||
|
Mixed
Mixed
17 repliesr
It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run local... |
||||
|
Positive
Pro
9 repliesr
Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives? |
||||
|
Very positive
Pro
1 repliesr
From our tests, Kimi K2 Thinking is better than literally everything - gpt-5, claude 4.5 sonnet. the only model that is better than Kimi K2 thinking is GPT-5 codex.It's now available on https://okara.ai if anyone wants to try it. |
||||
|
Repo
2025-11-08
|
Codex Mini
GPT
|
56 | 54 |
|
|
Positive
Neutral
2 repliesr
I managed to get GPT-5-Codex-Mini to draw me a pelican. It's not a very good one! https://static.simonwillison.net/static/2025/codex-hacking-m...For comparison, here's GPT-5-Codex (not mini) https://static.simonwillison.net/static/2025/codex-hacking-d... and full GPT-5: https:... |
||||
|
Positive
Pro
0 repliesr
Since they also limited the usage of codex CLI quite a bit, this might help. I really like the gpt-5-codex model. It has the most impressive dynamic reasoning effort I've seen. Responding instantly to simple questions while thinking very long when necessary. It's also interest... |
||||
|
Negative
Anti
2 repliesr
Codex isn’t even a good model. |
||||
|
Negative
Anti
4 repliesr
GPT-5 and GPT-5-Codex are already not clever enough for anything interesting. |
||||
|
Official
2025-11-12
|
GPT-5.1
GPT
|
555 | 726 |
|
|
Negative
Anti
27 repliesr
I don’t want more conversational, I want more to the point. Less telling me how great my question is, less about being friendly, instead I want more cold, hard, accurate, direct, and factual results.It’s a machine and a tool, not a person and definitely not my friend. |
||||
|
Negative
Anti
22 repliesr
All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.I suspect this approach is a direct response to the backlash against removing 4o. |
||||
|
Negative
Anti
12 repliesr
> what romanian football player won the premier league> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.> ...> No Romanian footballer has ever won... |
||||
|
Mixed
Neutral
3 repliesr
I’ve seen various older people that I’m connected with on Facebook posting screenshots of chats they’ve had with ChatGPT.It’s quite bizarre from that small sample how many of them take pride in “baiting” or “bantering” with ChatGPT and then post screenshots showing how they “g... |
||||
|
Positive
Pro
6 repliesr
Seems like people here are pretty negative towards a "conversational" AI chatbot.Chatgpt has a lot of frustrations and ethical concerns, and I hate the sycophancy as much as everyone else, but I don't consider being conversational to be a bad thing.It's just preference I guess... |
||||
|
Official
2025-11-17
|
Grok 4.1 Fast
Grok
|
140 | 128 |
|
|
Neutral
Neutral
5 repliesr
https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D... |
||||
|
Negative
Anti
4 repliesr
No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking). |
||||
|
Negative
Anti
5 repliesr
Not a big fan of emojis becoming the norm in LLM output.It seems Grok 4.1 uses more emojis than 4.Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that. |
||||
|
Very negative
Anti
3 repliesr
Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" m... |
||||
|
Negative
Neutral
1 repliesr
It is exhausting deciding which model to use on any given day. |
||||
|
Paper
2025-11-18
|
Gemini 3 Pro Preview
Gemini
|
280 | 335 |
|
|
Neutral
Neutral
19 repliesr
Benchmarks from page 4 of the model card: | Benchmark | 3 Pro | 2.5 Pro | Sonnet 4.5 | GPT-5.1 | |-----------------------|-----------|---------|------------|-----------| | Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | | ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | | GPQ... |
||||
|
Positive
Mixed
13 repliesr
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding |
||||
|
Positive
Pro
1 repliesr
One benchmark I would really like to see: instruction adherence.For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outp... |
||||
|
Neutral
Neutral
8 repliesr
There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score. |
||||
|
Neutral
Neutral
6 repliesr
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/[email protected]` which obviously fails...Anyone happen to know why? Is this website by any change sharing information on safe medical abo... |
||||
|
Official
2025-11-18
|
Gemini 3 Pro Preview
Gemini
|
1,735 | 1,056 |
|
|
Positive
Pro
19 repliesr
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training dataGemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this pr... |
||||
|
Very positive
Pro
1 repliesr
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...I spent years building a compiler that takes our custom XML format and generat... |
||||
|
Positive
Pro
11 repliesr
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, e... |
||||
|
Neutral
Pro
5 repliesr
Static Pelican is boring. First attempt:Generate SVG animation of following:1 - There is High fantasy mage tower with a top window a dome2 - Green goblin come in front of tower with a torch3 - Grumpy old mage with beard appear in a tower window in high purple hat4 - Mage sends... |
||||
|
Negative
Anti
14 repliesr
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, ... |
||||
|
Official
2025-11-19
|
GPT-5.1 Codex Max
GPT
|
483 | 319 |
|
|
Mixed
Pro
22 repliesr
I've been using a lot of Claude and Codex recently.One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character... |
||||
|
Neutral
Neutral
14 repliesr
Rest assured that we are better at training models than naming them ;D- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0- Natively trained to work across many hours across multiple context windows via compaction- 30% mor... |
||||
|
Positive
Pro
4 repliesr
Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini ... |
||||
|
Negative
Anti
5 repliesr
OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)They were probably sitting on this for a while. That makes me think this is a ... |
||||
|
Neutral
Neutral
3 repliesr
Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm... |
||||
|
Official
2025-11-24
|
Claude Opus 4.5
Claude
|
1,113 | 506 |
|
|
Positive
Pro
19 repliesr
The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.Also notable: they're claiming SOTA prompt injection resi... |
||||
|
Negative
Mixed
15 repliesr
This is gonna be game-changing for the next 2-4 weeks before they nerf the model.Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases”... |
||||
|
Positive
Pro
24 repliesr
I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.I ... |
||||
|
Mixed
Mixed
1 repliesr
The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded... |
||||
|
Positive
Pro
9 repliesr
Seeing these benchmarks makes me so happy.Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.Ive been hol... |
||||
|
Repo
2025-12-01
|
DeepSeek-V3.2-Speciale
DeepSeek
|
20 | 0 | — |
|
Repo
2025-12-01
|
siliconflow/deepseek-v3.2
DeepSeek
|
982 | 465 |
|
|
Positive
Pro
6 repliesr
Well props to them for continuing to improve, winning on cost-effectiveness, and continuing to publicly share their improvements. Hard not to root for them as a force to prevent an AI corporate monopoly/duopoly. |
||||
|
Neutral
Pro
15 repliesr
How will the Google/Anthropic/OpenAI's of the world make money on AI if open models are competitive with their models? What hurt open source in the past was its inability to keep up with the quality and feature depth of closed source competitors, but models seem to be reaching... |
||||
|
Positive
Pro
2 repliesr
Worth noting this is not only good on benchmarks, but significantly more efficient at inference https://x.com/_thomasip/status/1995489087386771851 |
||||
|
Mixed
Mixed
1 repliesr
> DeepSeek-V3.2 introduces significant updates to its chat template compared to prior versions. The primary changes involve a revised format for tool calling and the introduction of a "thinking with tools" capability.At first, I thought they had gone the route of implementi... |
||||
|
Mixed
Mixed
7 repliesr
It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?Frontier models are far exceeding even the... |
||||
|
Official
2025-12-02
|
Mistral 3
Mistral
|
826 | 236 |
|
|
Very positive
Pro
7 repliesr
I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and fol... |
||||
|
Mixed
Mixed
3 repliesr
The new large model uses DeepseekV2 architecture. 0 mention on the page lol.It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar ... |
||||
|
Positive
Pro
2 repliesr
The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPUPelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/ |
||||
|
Positive
Pro
1 repliesr
Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.Mistral had the best small mode... |
||||
|
Positive
Pro
4 repliesr
Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things. |
||||
|
Official
2025-12-05
|
Gemini 3 Pro Preview
Gemini
|
566 | 295 |
|
|
Mixed
Mixed
27 repliesr
WellIt is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.In fact GPT5 wrote an edge detection... |
||||
|
Positive
Pro
4 repliesr
I do some electrical drafting work for construction and throw basic tasks at LLMs.I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon |
||||
|
Positive
Pro
2 repliesr
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this fro... |
||||
|
Neutral
Neutral
3 repliesr
Interesting "ScreenSpot Pro" results: 72.7% Gemini 3 Pro 11.4% Gemini 2.5 Pro 49.9% Claude Opus 4.5 3.50% GPT-5.1 ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Usehttps://arxiv.org/abs/2504.07981 |
||||
|
Neutral
Neutral
4 repliesr
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL. |
||||
|
Official
2025-12-09
|
Devstral 2 (latest)
Devstral
|
745 | 349 |
|
|
Positive
Pro
9 repliesr
llm install llm-mistral llm mistral refresh llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle" https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!(That said I'm not 100% certain I guessed the correct model ID, ... |
||||
|
Positive
Mixed
2 repliesr
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku ... |
||||
|
Positive
Pro
2 repliesr
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.It spent about half an hour, correctly identified wh... |
||||
|
Positive
Mixed
0 repliesr
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....Here is wha... |
||||
|
Mixed
Mixed
9 repliesr
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.I'm a bit saddened ... |
||||
|
Official
2025-12-10
|
Qwen3-Omni Flash Realtime
Qwen
|
316 | 106 |
|
|
Positive
Pro
7 repliesr
This is a 30B parameter MoE with 3B active parameters and is the successor to their previous 7B omni model. [1]You can expect this model to have similar performance to the non-omni version. [2]There aren't many open-weights omni models so I consider this a big deal. I would us... |
||||
|
Neutral
Neutral
4 repliesr
Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.ed... |
||||
|
Neutral
Neutral
2 repliesr
Is there a way to run these Omni models on a Macbook quantized via GGUF or MLX? I know I can run it in LMStudio or Llama.cpp but they don't have streaming microphone support or streaming webcam support.Qwen usually provides example code in Python that requires Cuda and a non-q... |
||||
|
Neutral
Neutral
0 repliesr
Wayback for those that can't reach https://web.archive.org/web/20251210164048/https://qwen.ai/b... |
||||
|
Neutral
Neutral
1 repliesr
The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.I'm curious how anyone has solved this |
||||
|
Official
2025-12-11
|
GPT-5.2
GPT
|
1,195 | 1,083 |
|
|
Neutral
Neutral
18 repliesr
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.The thing that would now make the biggest difference isn't "more intelligence", whatever that... |
||||
|
Negative
Anti
10 repliesr
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnO... |
||||
|
Negative
Mixed
9 repliesr
I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "pro... |
||||
|
Negative
Anti
5 repliesr
Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:- It is faster which is appreciated but not as fast as Opus 4.5- I see no changes, very little noticeable improvements over 5.1- I do not see any value in exchange ... |
||||
|
Neutral
Pro
5 repliesr
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.The medium-reasoning version also improves: 62.7 → 72.1.The no-reasoning version also improves: ... |
||||
|
Official
2025-12-16
|
gpt-image-1.5
gpt-image
|
522 | 256 |
|
|
Positive
Pro
16 repliesr
Okay results are in for GenAI Showdown with the new gpt-image 1.5 model for the editing portions of the site!https://genai-showdown.specr.net/image-editingConclusions- OpenAI has always had some of the strongest prompt understanding alongside the weakest image fidelity. This u... |
||||
|
Mixed
Mixed
4 repliesr
I have a Nano Banana Pro blog post in the works expanding on my experiments with Nano Banana (https://news.ycombinator.com/item?id=45917875). Running a few of my test cases from that post and the upcoming blog post through this new ChatGPT Image model, this new model is better... |
||||
|
Negative
Anti
7 repliesr
If this was a farm of sweatshop Photoshopers in 2010, who download all images from the internet and provide a service of combining them on your request, this would escalate pretty quickly.Question: with copyright and authorship dead wrt AI, how do I make (at least) new content... |
||||
|
Positive
Pro
2 repliesr
I am very impressed a benchmark I like to run is have it create sprite maps, uv texture maps for an imagined 3d modelNoticed it captured a megaman legends vibe ....https://x.com/AgentifySH/status/2001037332770615302and here it generated a texture map from a 3d characterhttps:/... |
||||
|
Negative
Anti
4 repliesr
It's really weird to see "make images from memories that aren't real" as a product pitch |
||||
|
Official
2025-12-17
|
Gemini 3 Flash Preview
Gemini
|
1,102 | 580 |
|
|
Very positive
Pro
24 repliesr
Don’t let the “flash” name fool you, this is an amazing model.I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fra... |
||||
|
Mixed
Mixed
8 repliesr
This is awesome. No preview release either, which is great to production.They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for outputFor comparison:Gemini 3.0 Flash: $0.50/M for input and $3.00/M for outputGemini 2.5 Fl... |
||||
|
Positive
Pro
4 repliesr
Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.Presumably a big motivation for them is to be first to get something good and cheap enough they c... |
||||
|
Mixed
Mixed
6 repliesr
These flash models keep getting more expensive with every release.Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world ... |
||||
|
Positive
Pro
3 repliesr
This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recogniti... |
||||
|
Official
2025-12-18
|
GPT-5.2 Codex
GPT
|
589 | 318 |
|
|
Very positive
Pro
15 repliesr
If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, metho... |
||||
|
Positive
Pro
9 repliesr
I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging t... |
||||
|
Mixed
Mixed
4 repliesr
Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context: SWE-Bench (Pro / Verified) Model | Pro (%) | Verified (%) -----------------... |
||||
|
Positive
Pro
4 repliesr
The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it... |
||||
|
Positive
Pro
3 repliesr
I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex s... |
||||
|
Official
2025-12-22
|
GLM-4.7
GLM
|
437 | 235 |
|
|
Positive
Pro
11 repliesr
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese prim... |
||||
|
Neutral
Neutral
5 repliesr
Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-co... |
||||
|
Mixed
Mixed
2 repliesr
Appears to be cheap and effective, though under suspicion.But the personal and policy issues are about as daunting as the technology is promising.Some the terms, possibly similar to many such services: - The use of Z.ai to develop, train, or enhance any algorithms, models, or ... |
||||
|
Negative
Anti
3 repliesr
I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉,我目前无法提供你需要的具体信息,如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">Analyze the User's Input: Question: "is it ok for l... |
||||
|
Very positive
Pro
2 repliesr
I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Mayb... |
||||
|
Official
2025-12-26
|
MiniMax-M2.1
MiniMax
|
228 | 81 |
|
|
Mixed
Mixed
1 repliesr
I've played with this a bit and it's ok. I'd place it somewhere around sonnet 4.5 level, probably below. But with this aggressive pricing you can just run 3 copies to do the same thing, choose the one that succeeded and still come out way ahead with the cost. Not as great as f... |
||||
|
Negative
Anti
2 repliesr
Would it kill them to use the words "AI coding agent" somewhere prominent?"MiniMax M2.1: Significantly Enhanced Multi-Language Programming, Built for Real-World Complex Tasks" could be an IDE, a UI framework, a performance library, or, or... |
||||
|
Neutral
Neutral
0 repliesr
The weights got released on huggingface now.https://huggingface.co/MiniMaxAI/MiniMax-M2.1 |
||||
|
Neutral
Neutral
1 repliesr
I think people should stop comparing to sonnet, but to opus instead since it's so far ahead on producing code I would actually want to use (gemini 3 pro tends to be lacking in generalization and wants things to be using it's own style rather than adapting).Whatever benchmark o... |
||||
|
Negative
Anti
2 repliesr
> MiniMax has been continuously transforming itself in a more AI-native way. The core driving forces of this process are models, Agent scaffolding, and organization. Throughout the exploration process, we have gained increasingly deeper understanding of these three aspects.... |
||||
|
Repo
2026-01-19
|
GLM-4.7
GLM
|
378 | 135 |
|
|
Positive
Pro
3 repliesr
Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't reall... |
||||
|
Positive
Pro
2 repliesr
I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designe... |
||||
|
Positive
Pro
8 repliesr
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to ... |
||||
|
Positive
Pro
2 repliesr
> SWE-bench Verified 59.2This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4. |
||||
|
Neutral
Neutral
1 repliesr
What’s the significance of this for someone out of the loop? |
||||
|
Official
2026-01-22
|
Qwen3-TTS
Qwen
|
744 | 225 |
|
|
Neutral
Neutral
9 repliesr
If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then pas... |
||||
|
Neutral
Neutral
4 repliesr
I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.pyYou can try it with uv (downloads a 4.5GB model on first run) like ... |
||||
|
Mixed
Mixed
2 repliesr
Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically i... |
||||
|
Very positive
Pro
3 repliesr
it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the mode... |
||||
|
Neutral
Neutral
9 repliesr
Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics. |
||||
|
Official
2026-01-26
|
Qwen3 Max
Qwen
|
502 | 424 |
|
|
Mixed
Mixed
5 repliesr
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the... |
||||
|
Neutral
Neutral
3 repliesr
It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?My problem with deep research tends to be that what it does is it searche... |
||||
|
Neutral
Neutral
4 repliesr
I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/... |
||||
|
Neutral
Neutral
2 repliesr
Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks. |
||||
|
Positive
Pro
0 repliesr
The most important benchmark:https://boutell.dev/misc/qwen3-max-pelican.svgI used Simon Willison's usual prompt.It thought for over 2 minutes (free account). The commentary was even more glowing than the image.It has a certain charm. |
||||
|
Official
2026-01-27
|
Kimi K2.5
Kimi
|
502 | 239 |
|
|
Neutral
Neutral
4 repliesr
Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.51T parameters, 32b active parameters.License: MIT with the following modification:Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or s... |
||||
|
Very positive
Pro
5 repliesr
The "Deepseek moment" is just one year ago today!Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all". |
||||
|
Positive
Pro
4 repliesr
> For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduc... |
||||
|
Neutral
Neutral
0 repliesr
I posted this elsewhere but thought I'd repost here:* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO* https://dashboard.safe.ai/ — CAIS' incredible dashboard* https://clocks.brianmoore.com/ — a visual comparison of how well models ... |
||||
|
Positive
Pro
3 repliesr
One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.GitHub: http... |
||||
|
Repo
2026-01-30
|
Kimi K2.5
Kimi
|
388 | 141 |
|
|
Positive
Pro
4 repliesr
I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good a... |
||||
|
Negative
Anti
5 repliesr
Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.examples to illustratehttps://ww... |
||||
|
Negative
Anti
0 repliesr
I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via... |
||||
|
Positive
Pro
0 repliesr
I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus. |
||||
|
Mixed
Mixed
2 repliesr
It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can... |
||||
|
Official
2026-02-03
|
Qwen3 Coder Flash
Qwen
|
735 | 429 |
|
|
Positive
Pro
14 repliesr
This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough ... |
||||
|
Neutral
Neutral
7 repliesr
For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next |
||||
|
Neutral
Neutral
2 repliesr
I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this: brew upgrade llama.cpp # or brew install if you don't have it yet Then: llama-cli \ -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \ --fit on \ --seed 3407 \ --temp 1.0 \ --top-p ... |
||||
|
Positive
Mixed
3 repliesr
It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance. |
||||
|
Positive
Pro
0 repliesr
I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTUVideo is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: ht... |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
2,346 | 1,031 |
|
|
Very positive
Pro
37 repliesr
Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 ... |
||||
|
Positive
Anti
7 repliesr
5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive! |
||||
|
Mixed
Mixed
12 repliesr
I'm still not sure I understand Anthropic's general strategy right now.They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.Meanwhile, Claude's general use cases are... fine. For generic r... |
||||
|
Neutral
Neutral
1 repliesr
Claude Code release notes: > Version 2.1.32: • Claude Opus 4.6 is now available! • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) • Claude now automatically records ... |
||||
|
Mixed
Mixed
23 repliesr
The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe... |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
154 | 50 |
|
|
Mixed
Mixed
4 repliesr
Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.My anecdotal experienc... |
||||
|
Neutral
Neutral
0 repliesr
Based on the article... is this basically just making Claude better at formatting and data presentation, or does it also get better at analysis? I get the impression it's the former. |
||||
|
Mixed
Mixed
0 repliesr
The benchmarks look good. Slide decks and spreadsheets look better. The people must use Claude Cowork and have their Claude Code moment and figure out the consequences. It will be really interesting to see articles like this (https://mitchellh.com/writing/my-ai-adoption-journe... |
||||
|
Mixed
Mixed
1 repliesr
And then you hand it to your boss who takes a 20 second look at it and asks why you made a projection that assume massive revenue growth and 3 years of perfectly flat utilities, insurance, G&A - no inflation etc.It does look really promising as a skeleton starting point th... |
||||
|
Negative
Neutral
0 repliesr
> The side-by-side outputs below show how output quality has improved from Claude Opus 4.5 to Opus 4.6.Disclaimer: I use AI to code (and I code for finance) and I love Anthropic.But: for f-ck's sake, I cannot click on the picture and have it show up in full. It stays at its... |
||||
|
Official
2026-02-05
|
GPT-5.3 Codex
GPT
|
1,530 | 605 |
|
|
Neutral
Pro
23 repliesr
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophicallyWith Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the... |
||||
|
Positive
Pro
6 repliesr
I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.GPT-5.3-codex scores 77.3. |
||||
|
Mixed
Mixed
6 repliesr
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks... |
||||
|
Positive
Pro
2 repliesr
Something that caught my eye from the announcement:> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own trainingI'm happy to see the Codex team moving to this kind of dogfooding. I think this was cr... |
||||
|
Neutral
Neutral
8 repliesr
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes. |
||||
|
Official
2026-02-05
|
Claude Opus 4.6
Claude
|
212 | 75 |
|
|
Negative
Anti
8 repliesr
Considering all the problems they've been having with over-charging Claude Code users over the past few weeks it's the very least they could do. Max subscribers are hitting their 5 hour usage limits in 30-40 minutes with a single instance doing light work, while Anthropic have... |
||||
|
Negative
Anti
1 repliesr
Is this one of those, Hey turn off the overcharge protection because you can go $50 into debt for free, and then maybe you'll just keep going and not notice you owe us an extra $500 type of situations? |
||||
|
Neutral
Neutral
1 repliesr
Meanwhile codex is running a free tier for a month right now for anyone who wants to experiment with it: https://openai.com/index/introducing-the-codex-app/Doesn't appear to include the new model though, only the state-of-yesterdays-art (literally yesterdays). |
||||
|
Very negative
Anti
2 repliesr
I'll pass on this $50, but please hire real human and fix your crappy app, Claude!This bug has been for years: in Claude (web or app), if you create a new chat at the middle of existing chat thinking or tool calling, the existing chat will be broken, either losing data, or bec... |
||||
|
Neutral
Neutral
0 repliesr
You can use this if you started your Pro or Max subscription before Wednesday, February 4, 2026 at 11:59 PM PT.Go to https://claude.ai/settings/usage, turn on extra usage and enable the promo from the notification afterwards.I received €42, top up was not required and auto-rel... |
||||
|
Official
2026-02-10
|
Qwen-Image-2.0
Qwen
|
422 | 192 |
|
|
Neutral
Neutral
10 repliesr
I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the ren... |
||||
|
Positive
Pro
2 repliesr
Couple of thoughts:1. I’d wager that given their previous release history, this will be open‑weight within 3-4 weeks.2. It looks like they’re following suit with other models like Z-Image Turbo (6B parameters) and Flux.2 Klein (9B parameters), aiming to release models that can... |
||||
|
Positive
Pro
3 repliesr
It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation. |
||||
|
Neutral
Neutral
1 repliesr
The "horse riding man" prompt is wild:"""A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky. Mid-ground, eye-level composition: A muscular, robust adult br... |
||||
|
Neutral
Neutral
11 repliesr
I recently tried out LMStudio on Linux for local models. So easy to use!What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen. |
||||
|
Official
2026-02-11
|
GLM-5
GLM
|
484 | 520 |
|
|
Mixed
Mixed
9 repliesr
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame. |
||||
|
Mixed
Pro
8 repliesr
Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be ye... |
||||
|
Positive
Pro
2 repliesr
Bought some API credits and ran it through opencode (model was "GLM 5").Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO sc... |
||||
|
Mixed
Mixed
1 repliesr
Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.[1] https:/... |
||||
|
Positive
Pro
5 repliesr
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clea... |
||||
|
Official
2026-02-12
|
MiniMax-M2.5
MiniMax
|
207 | 55 |
|
|
Negative
Anti
3 repliesr
I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the st... |
||||
|
Negative
Anti
2 repliesr
Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826... |
||||
|
Very positive
Pro
2 repliesr
Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kim... |
||||
|
Very negative
Anti
1 repliesr
Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective prop... |
||||
|
Negative
Anti
0 repliesr
M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking |
||||
|
Official
2026-02-12
|
Gemini 3 Pro Preview
Gemini
|
1,081 | 693 |
|
|
Positive
Pro
15 repliesr
Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)Wow.https://blog.google/innovation-and-ai/models-and-research/ge... |
||||
|
Negative
Neutral
11 repliesr
Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had K... |
||||
|
Positive
Pro
3 repliesr
I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a f... |
||||
|
Very positive
Pro
17 repliesr
Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind. |
||||
|
Neutral
Neutral
3 repliesr
Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved">Submi... |
||||
|
Official
2026-02-12
|
GPT-5.3 Codex Spark
GPT
|
890 | 387 |
|
|
Positive
Pro
13 repliesr
Wow, I wish we could post pictures to HN. That chip is HUGE!!!!The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× mor... |
||||
|
Positive
Pro
8 repliesr
I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d r... |
||||
|
Mixed
Mixed
7 repliesr
First thoughts using gpt-5.3-codex-spark in Codex CLI:Blazing fast but it definitely has a small model feel.It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a ... |
||||
|
Very positive
Pro
10 repliesr
Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing |
||||
|
Neutral
Neutral
2 repliesr
My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/ |
||||
|
Official
2026-02-13
|
GPT-5.2
GPT
|
574 | 401 |
|
|
Neutral
Mixed
26 repliesr
The headline may make it seem like AI just discovered some new result in physics all on its own, but reading the post, humans started off trying to solve some problem, it got complex, GPT simplified it and found a solution with the simpler representation. It took 12 hours for ... |
||||
|
Positive
Pro
15 repliesr
It's interesting to me that whenever a new breakthrough in AI use comes up, there's always a flood of people who come in to handwave away why this isn't actually a win for LLMs. Like with the novel solutions GPT 5.2 has been able to find for erdos problems - many users here (e... |
||||
|
Positive
Pro
2 repliesr
"An internal scaffolded version of GPT‑5.2 then spent roughly 12 hours reasoning through the problem, coming up with the same formula and producing a formal proof of its validity."When I use GPT 5.2 Thinking Extended, it gave me the impression that it's consistent enough/has a... |
||||
|
Positive
Mixed
9 repliesr
AI can be an amazing productivity multiplier for people who know what they're doing.This result reminded me of the C compiler case that Anthropic posted recently. Sure, agents wrote the code for hours but there was a human there giving them directions, scoping the problem, fin... |
||||
|
Mixed
Mixed
1 repliesr
It would be more accurate to say that humans using GPT-5.2 derived a new result in theoretical physics (or, if you're being generous, humans and GPT-5.2 together derived a new result). The title makes it sound like GPT-5.2 produced a complete or near-complete paper on its own,... |
||||
|
Official
2026-02-16
|
Qwen3.5 397B-A17B
Qwen
|
434 | 214 |
|
|
Neutral
Neutral
1 repliesr
For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5 |
||||
|
Negative
Anti
9 repliesr
You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question. |
||||
|
Positive
Pro
0 repliesr
"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training obje... |
||||
|
Negative
Anti
8 repliesr
Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8... |
||||
|
Positive
Pro
3 repliesr
Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder. |
||||
|
Official
2026-02-17
|
Claude Sonnet 4.6
Claude
|
1,346 | 1,226 |
|
|
Very negative
Anti
15 repliesr
I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their o... |
||||
|
Negative
Neutral
4 repliesr
They use the word "Sonnet" 60+ times on that page but never give the casual reader any context of what a "Sonnet model" actually is. Neither does their landing page. You have to scroll all the way to the footer to find a link under the "Models" section. You click it and you fi... |
||||
|
Mixed
Mixed
9 repliesr
I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 yearsIt is a far cry from Opus 4.6.Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing ana... |
||||
|
Neutral
Neutral
16 repliesr
The demise of saas has been overplayed imho. When companies buy software they are essentially buying something that solves a problem and the insurance that comes with that. Part of that means they get to pick up the phone and complain if something doesn't work and someone on t... |
||||
|
Positive
Neutral
9 repliesr
I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers. |
||||
|
Official
2026-02-19
|
Gemini 3.1 Pro Preview
Gemini
|
963 | 914 |
|
|
Negative
Anti
31 repliesr
I hope this works better than 3.0 ProI'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.It's stunningly good at reasoning, design, and generatin... |
||||
|
Positive
Pro
22 repliesr
People underrate Google's cost effectiveness so much. Half price of Opus. HALF.Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight____Update:3.1 was 40% of the cost to run AA index vs Opu... |
||||
|
Mixed
Mixed
2 repliesr
If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.While it gives me hope, I am going to pl... |
||||
|
Neutral
Neutral
9 repliesr
Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricingKnowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3Compare to Op... |
||||
|
Mixed
Mixed
8 repliesr
These models are so powerful.It's totally possible to build entire software products in the fraction of the time it took before.But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.It feels lik... |
||||
|
3rd party
2026-02-28
|
Qwen3.5 122B
Qwen
|
461 | 272 |
|
|
Mixed
Anti
18 repliesr
If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.I've been playing with Qwen3-Coder-... |
||||
|
Negative
Anti
19 repliesr
I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.So far Opus 4.6 and Gemini Pro are ve... |
||||
|
Neutral
Neutral
4 repliesr
I recently wrote a guide on getting:- llama.cpp- OpenCode- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)working on a M1 MacBook Pro (e.g. using brew).It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.... |
||||
|
Positive
Pro
4 repliesr
I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.I then discovered what quantization is by reading a blog post about binary quantizat... |
||||
|
Negative
Anti
3 repliesr
Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.Up until relatively recently, while people had already long been making these claim... |
||||
|
Official
2026-03-03
|
Gemini 3.1 Flash Lite Preview
Gemini
|
60 | 30 |
|
|
Neutral
Mixed
2 repliesr
Lots of comments about the price change, but Artifical Analysis reports that 3.1 Flash-Lite (reasoning) used fewer than half of the tokens of 2.5 Flash-Lite (reasoning).This will likely bring the cost below 2.5 flash-lite for many tasks (depends on the ratio of input to output... |
||||
|
Negative
Anti
0 repliesr
Unfortunate, significant price increase for a 'lite' model: $0.25 IN / $1.50 OUT vs. Gemini 2.5 Flash-Lite $0.10 IN / $0.40 OUT. |
||||
|
Negative
Anti
4 repliesr
For the last 2 years, startup wisdom has been that models will continue to get cheaper and better. Claude first, and now Gemini has shown that it's not the case.We priced an enterprise contract using Flash 1.5 pricing last summer, and today that contract would be unit economic... |
||||
|
Neutral
Anti
0 repliesr
That's a 150% increase in the input costs and 275% increase on output costs over the same sized previous generation (2.5-flash-lite) model |
||||
|
Positive
Pro
2 repliesr
You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.We benchmarked it for real-life voice-to-text use cases: <10s 10-30s 30s-1m 1-2m 2-3m Flash 2548 2732 3177 4583 5961 Flash L... |
||||
|
Official
2026-03-03
|
GPT-5.3 Chat
GPT
|
395 | 301 |
|
|
Very negative
Anti
22 repliesr
The single biggest issue for me with ChatGPT right now is how absolutely awful it sounds in every answer. "Why it matters", "the big picture", "it's not jut you", the awful emphasis, the quotations with rhetorical questions, etc.. I don't know if it's intentional so you can ea... |
||||
|
Negative
Anti
7 repliesr
I'm a bit confused by this branding (never even noticed that there was a 5.2-Instant), it's not a super fast 1000tok/s Cerebras based model which they have for codex-spark, it's just 5.2 w/out the router / "non-thinking" mode?I feel like openai is going to get right back to wh... |
||||
|
Negative
Anti
11 repliesr
Since the page mentions:> Better judgment around refusalsHas any AI company ever addressed any instance of a model having different rules for different population groups? I've seen many examples of people asking questions like, "make up a joke about <group>" and then ... |
||||
|
Negative
Anti
5 repliesr
I kind of chuckled when I read the headline "GPT‑5.3 Instant: Smoother, more ..."LLM companies starting to sound like cigarette advertisements. |
||||
|
Negative
Anti
6 repliesr
Is nobody else unsettled by the example? Strange timing to talk about calculating trajectories on long range projectiles? |
||||
|
3rd party
2026-03-04
|
Qwen Turbo
Qwen
|
783 | 360 |
|
|
Positive
Pro
8 repliesr
I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've h... |
||||
|
Negative
Anti
2 repliesr
There has been tension between Qwen's research team and Alibaba's product team, say the Qwen App. And recently, Alibaba tried to impose DAU as a KPI. It's understandable that a company like Alibaba would force a change of product strategy for any number of reasons. What puzzle... |
||||
|
Positive
Pro
1 repliesr
One thing I’ve noticed with local models is that people tolerate a lot more trial and error behavior. When a hosted model wastes tokens it feels expensive, but when a local model loops a bit it just feels like it’s “thinking.”If models like Qwen can get good enough for coding ... |
||||
|
Neutral
Neutral
9 repliesr
I wonder how a US lab hasn't dumped truckloads of cash into various laps to ensure these researchers have a place at their lab |
||||
|
Neutral
Neutral
5 repliesr
How do those companies make money? Qwen, GLM, Kimi, etc all released for free. I have no experience in the field, but from reading HN alone my impression was training is exceptionally costly and inference can be barely made profitable. How/why do they fund ongoing development ... |
||||
|
Official
2026-03-05
|
GPT-5.4
GPT
|
1,019 | 807 |
|
|
Mixed
Mixed
13 repliesr
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/Also per... |
||||
|
Negative
Anti
11 repliesr
I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m... |
||||
|
Positive
Pro
7 repliesr
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Cod... |
||||
|
Negative
Anti
19 repliesr
I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post f... |
||||
|
Negative
Mixed
10 repliesr
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it w... |
||||
|
3rd party
2026-03-05
|
GPT-5.4 Pro
GPT
|
93 | 2 |
|
|
Neutral
Neutral
0 repliesr
Comments moved to https://news.ycombinator.com/item?id=47265045. |
||||
|
Neutral
Neutral
0 repliesr
More discussion: https://news.ycombinator.com/item?id=47265005 |
||||
|
Official
2026-03-17
|
GPT-5.4 mini
GPT
|
248 | 145 |
|
|
Positive
Pro
7 repliesr
I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).- GPT-... |
||||
|
Positive
Pro
6 repliesr
To me, mini releases matter much more and better reflect the real progress than SOTA models.The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.Meanwhile, when a smaller / less powerful model releases a new ... |
||||
|
Mixed
Mixed
7 repliesr
I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I... |
||||
|
Neutral
Neutral
4 repliesr
Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican... |
||||
|
Neutral
Anti
2 repliesr
According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?GPT 5 mini: Input $0.25 / Output $2.00GPT 5 nano: Input: $0.05 / Output $0.40GPT 5.4 mini: Input $0.75 / Output $4.50G... |
||||
|
Official
2026-04-02
|
Qwen3.6 Plus
Qwen
|
594 | 213 |
|
|
Negative
Anti
11 repliesr
This is their hosted-only model, not an open weight model like they’ve become known for. They got a lot of good publicity for their open weight model releases, which was the goal. The hard part is pivoting from an open weight provider to being considered as a competitor to Cla... |
||||
|
Positive
Mixed
2 repliesr
I understand peoples reactions of Qwen team comparing against Opus 4.5 instead of 4.6. And them comparing against Gemini Pro 3.0 instead of 3.1. But calling it misleading is a bit of stretch in my eyes, people here are acting like we immediately forgot how previous generations... |
||||
|
Positive
Pro
2 repliesr
Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...I used the https://modelstudio.alibabacloud.com/ API to generate that one, which required signing up for an account and attaching PayPal billing - but it looks like OpenRouter are offering it ... |
||||
|
Negative
Anti
5 repliesr
Worth noting that this model, unlike almost all qwen models, is not open-weight, nor is the parameter count exposed. Also odd that it is compared against opus 4.5 even though 4.6 was released like 2 months ago. |
||||
|
Positive
Pro
2 repliesr
I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.I'm obviously not s... |
||||
|
Official
2026-04-02
|
Gemma 4 26B
Gemma
|
1,811 | 474 |
|
|
Positive
Pro
21 repliesr
Thinking / reasoning + multimodal + tool calling.We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!Guide for those interested: https://unsloth.ai/docs/models/gemma-4Also note to use temperature = 1.0, top_p ... |
||||
|
Mixed
Mixed
9 repliesr
I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.https://simonwillison.net/2026/Apr/2/gemma-4/The gemma-4-31b model is compl... |
||||
|
Neutral
Neutral
3 repliesr
Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: | Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t | |----------------|-------|-------|-------|------|-------|-------|-------|-------| | G4 31B | 85.2% | ... |
||||
|
Negative
Mixed
6 repliesr
Prompt:> what is the Unix timestamp for this: 2026-04-01T16:00:00ZQwen 3.5-27b-dwq> Thought for 8 minutes 34 seconds. 7074 tokens.> The Unix timestamp for 2026-04-01T16:00:00Z is:> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)Gemma-4-26b-a4b> Thou... |
||||
|
Neutral
Neutral
23 repliesr
Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can |
||||
|
Official
2026-04-07
|
GLM-5.1
GLM
|
617 | 262 |
|
|
Positive
Pro
3 repliesr
We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.But keeping in mind t... |
||||
|
Positive
Pro
3 repliesr
Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/ |
||||
|
Neutral
Anti
1 repliesr
Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.[0] https://huggingface.co/unsloth/GLM-5... |
||||
|
Mixed
Mixed
7 repliesr
To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine. |
||||
|
Positive
Pro
15 repliesr
Every single day, three things are becoming more and more clear: (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat (2) Local/private inference is the future of AI (3) There's *still* no killer product yet (so get to work!) |
||||
|
3rd party
2026-04-12
|
MiniMax-M2.7
MiniMax
|
80 | 34 |
|
|
Negative
Anti
4 repliesr
Absolutely not "open source" - here's the license: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICE...> Non-commercial use permitted based on MIT-style terms; commercial use requires prior written authorization.And calling the non-commercial usage "MIT-style ter... |
||||
|
Positive
Pro
1 repliesr
GGUFs are out too, well done Unsloth as usual!https://huggingface.co/unsloth/MiniMax-M2.7-GGUFI've been using M2.7 through the Alibaba coding plan for a bit now, and am quite impressed with it's coding ability, and even more impressed when I see how small it is. Fascinating re... |
||||
|
Negative
Anti
2 repliesr
What's people's experience of using MiniMax for coding?I had a really bad time with it. I use (real) Claude Code for work so I know what a good model feels like. MiniMax's token plan is nice but the quality is really far from Claude models.I needed to constantly "remind" it to... |
||||
|
Mixed
Mixed
1 repliesr
"Helped build itself" is a bit of a stretch here, it makes it sound as if the model was doing lasting self-improvements.What the article describes is that the model was able to tweak to its own deployment harness (memory, skills, experimental loop etc) to improve performance o... |
||||
|
Mixed
Mixed
1 repliesr
In addition to this conversation already having been started at https://news.ycombinator.com/item?id=47735348 yesterday, MiniMax M2.7 is not open source. The open weights have been released, which is definitely good and follows some of the spirit of open source, but isn't the ... |
||||