EVY Logo

Best LLMs for Writing in 2026

Aggregated benchmark data across EQ-Bench Creative Writing, LMArena Text, and Artificial Analysis — covering 25 models, updated weekly.

Last updated:  ·  25 models tracked  ·  3 tiers: Premium · Mid-Range · Budget

How we rank models for writing

This ranking combines three independent data sources to give the most complete picture of writing quality across frontier LLMs. No single benchmark captures the full picture — so we aggregate:

Show benchmark details
  • EQ Creative EQ-Bench Creative Writing — specialist benchmark using trained raters to assess narrative quality, emotional depth, prose style, and character voice. Elo scale ~1400–1940. The most relevant signal for marketing copy, long-form content, and creative work.
  • Arena Text LMArena Text — crowd-sourced human preference leaderboard. Broad signal across all text tasks: a model that consistently wins votes is generally pleasant, clear, and useful to read. Elo scale ~1460–1510.
  • EQ General EQ-Bench General — measures emotional intelligence in roleplay scenarios. A proxy for character voice quality and tonal control — useful for brand voice work. Note: high EQ-General does not automatically mean strong creative writing; interpret alongside EQ Creative.
  • Speed Artificial Analysis — median output tokens per second across providers. Matters for iterative draft workflows where waiting costs time. ~75 tokens ≈ 55 words.

Prices are per 1M tokens (input / output) and reflect standard API pricing. A dash (—) means the model has not yet appeared on that leaderboard — never an estimated or interpolated value.

EVY uses this data automatically

Instead of picking one model and hoping it fits every task, EVY routes each writing request — brand copy, long-form content, quick social posts — to the model best suited for that specific job. You get top-tier output without managing a single API key.

Try EVY free →

The best AI models for copywriting

Updated weekly  ·  Mar 8, 2026
Model EQ Creative Arena Text EQ General Speed Price / 1M
Claude Sonnet 4.6 Premium
Anthropic 🧠 EQ-Bench
1,936.2 1,890.9 50 t/s
$3.00 $15.00
Claude Opus 4.6 Premium
Anthropic 📋 Consensus
1,931.7 1,504 1,874.8 45 t/s
$5.00 $25.00
*gpt-5.3-chat Mid-Range
OpenAI 🧠 EQ-Bench
1,816.8 1,402.6
claude-sonnet-4.5 Premium
Anthropic 🧠 EQ-Bench
1,744.5 1,529.4
$3.00 $15.00
claude-opus-4-5-20251101 Premium
Anthropic 🧠 EQ-Bench
1,733.7 1,619.6
$5.00 $25.00
O3 Mid-Range
OpenAI 🧠 EQ-Bench
1,730.9 1,500
$2.00 $8.00
Kimi K2 Mid-Range
Moonshot AI 🧠 EQ-Bench
1,661.4 1,601.8 44 t/s
$0.55 $2.20
openrouter/horizon-alpha Mid-Range
Unknown 🧠 EQ-Bench
1,633.6 1,545.2
GLM-5 Budget
Zhipu AI 🧠 EQ-Bench
1,626.1 1,650.2 80 t/s
$0.80 $2.50
claude-opus-4 Premium
Anthropic 🧠 EQ-Bench
1,619.7 1,421.9
$15.00 $75.00
GPT-5.2 Premium
OpenAI 📋 Consensus
1,593.6 1,480 1,607.4
$1.25 $10.00
Kimi K2.5 Mid-Range
Moonshot AI 🧠 EQ-Bench
1,542.3 1,563.1
*moonshotai/Kimi-K2.5 Mid-Range
Moonshot AI 🧠 EQ-Bench
1,542.3 1,563.1
DeepSeek V3.2 Budget
DeepSeek 🧠 EQ-Bench
1,495.8
$0.28 $0.42
Gemini 3 Pro Mid-Range
Google 📋 Consensus
1,474.5 1,486 1,559.2 80 t/s
$2.00 $12.00
Qwen3-235B Budget
Alibaba 🧠 EQ-Bench
1,459 1,233.8
$0.18 $0.54
Gemini 3.1 Pro Mid-Range
Google 🧠 EQ-Bench
1,447.9 1,546
$2.11 $12.66
Mistral Medium 3 Budget
Mistral AI 🧠 EQ-Bench
1,445.3
$0.40 $2.00
GPT-4o Premium
OpenAI 🧠 EQ-Bench
1,443 1,393 185 t/s
$2.50 $10.00
GLM-4.7 Budget
Zhipu AI 🧠 EQ-Bench
1,363.4 1,442.4
$0.38 $1.70
MiniMax M2.5 Budget
MiniMax 🧠 EQ-Bench
1,295.2 395 t/s
$0.30 $1.20
GPT-5.4 Premium
OpenAI ⚠️ No Data
Gemini 3 Flash Mid-Range
Google 🏟️ Arena
1,473 250 t/s
$0.50 $3.00
Gemini 3.1 Flash-Lite Budget
Google ⚠️ No Data
$0.25 $1.50
Grok 4.1 Mid-Range
xAI 🏟️ Arena
1,473 163 t/s
$0.20 $0.50

← Scroll to see all columns →

EQ Creative & EQ General: EQ-Bench  ·  Arena Text: LMArena  ·  Speed: Artificial Analysis  ·  Prices per 1M tokens  ·  — = not yet on leaderboard  ·  Click any row for sources

The right model depends on the task

Benchmark leaderboards rank models globally — but the best model for a 2,000-word thought leadership article is not necessarily the best model for a 15-word social media headline. Here's how the leading models split across common writing tasks:

Narrative & long-form

Thought leadership, case studies, email newsletters, ghostwriting. Requires emotional depth, tonal consistency, and the ability to sustain voice across thousands of words.

Best picks: Claude Sonnet 4.6 · Claude Opus 4.6

Structured commercial copy

Product descriptions, landing pages, ad copy, LinkedIn posts. Requires clarity, persuasion structure, and format adherence more than creative flair.

Best picks: GPT-5.2 · Claude Sonnet 4.6

High-volume / fast drafts

Social media scheduling, meta descriptions, bulk content variation. Speed and cost matter more than peak quality; fast iteration wins here.

Best picks: Gemini 3 Flash · Grok 4.1 · Kimi K2

Brand voice & consistency

Any content where staying on-brand is non-negotiable. Requires strong instruction-following, tonal control, and memory of brand guidelines.

Best picks: Claude Sonnet 4.6 · Gemini 3.1 Pro

Managing this complexity manually — four API keys, four pricing tiers, a decision tree for every task type — is exactly the overhead that kills creative momentum. EVY eliminates the routing problem entirely.

EVY picks the right model. Every time, for every task.

EVY is an AI co-creator that runs inside any app on your Mac. Press the EVY-key, speak your idea, and EVY routes it to the ideal model — then writes, edits, or transforms it into finished content in your brand voice.

Frequently asked questions

Which LLM is best for creative writing in 2026?

Claude Sonnet 4.6 (Anthropic) leads the EQ-Bench Creative Writing leaderboard with an Elo score of 1936 as of March 2026, followed closely by Claude Opus 4.6 at 1932. Both excel at narrative quality, emotional depth, and character voice — the core skills that separate great writing from generic AI output.

What is EQ-Bench and why does it matter for writing?

EQ-Bench is an independent benchmark that evaluates large language models on emotional intelligence and narrative quality, using a panel of human raters. Its Creative Writing sub-leaderboard specifically measures story quality, emotional resonance, and prose style — making it the most relevant benchmark for marketing copy, long-form content, and creative work. Scores are on an Elo scale where higher is better, typically ranging from ~1400 to ~1940.

What is LMArena Text and how is it different from EQ-Bench?

LMArena Text (formerly LMSYS Chatbot Arena) measures human preference through head-to-head votes: two anonymous models answer the same prompt, and users pick the better response. It's a broad preference signal across all text tasks, not just writing. EQ-Bench Creative Writing is narrower and more specialist — it specifically evaluates narrative and emotional writing quality with trained raters rather than crowd votes.

Which LLM is the best value for writing tasks?

Kimi K2 by Moonshot AI offers the best performance-per-dollar for writing: an EQ-Bench Creative score of 1700 and EQ-General score of 1602 at just $0.60 input / $2.50 output per 1M tokens — roughly 5× cheaper than Claude Sonnet 4.6 with ~87% of its creative writing performance. GLM-5 (Zhipu AI) is another strong value option at $0.80/$2.50 with scores of 1626 EQ Creative and 1650 EQ General.

How often is this ranking updated?

Scores are updated weekly via an automated scraper that fetches the latest data from EQ-Bench and LMArena. Prices are reviewed manually and updated when providers announce changes. The 'Updated weekly' badge in the table header shows the date of the last successful update.

What does 'tokens per second' mean for writing?

Tokens per second (t/s) measures how fast a model outputs text — roughly, 75 tokens equals about 55 words. For writing workflows, speed matters when you need rapid iteration on drafts or real-time dictation-to-copy conversion. MiniMax M2.5 is the fastest tracked model at 395 t/s; Gemini 3 Flash at 250 t/s offers the best speed-to-cost ratio among paid models.

Does the best LLM for writing change depending on the task?

Yes — significantly. Claude Sonnet 4.6 and Claude Opus 4.6 lead on narrative and emotional writing. GPT-5.2 performs better on structured commercial copy where format consistency matters. Faster models like Gemini 3 Flash or Grok 4.1 suit high-volume, lower-stakes content. EVY handles this complexity automatically: it routes each writing request to the most suitable model based on task type, length, and brand requirements.

Can I use multiple LLMs for writing without switching between tools?

Yes — EVY runs on all major LLMs and automatically selects the best model for each task. You speak or type your request once; EVY decides whether the task calls for Claude's narrative depth, GPT-5.2's structured output, or a faster budget model for quick drafts. No API keys, no model-switching — EVY handles the routing silently in the background.