ThreadMarch 20, 2026

If you don't understand RLVR, you're two paradigm shifts behind

You can't understand the current moment in AI if you don't understand RLVR. If your mental model is still "really smart autocomplete," you're two major paradigm shifts behind the current generation of models.

AI17 posts

We entered the modern AI era thanks to the transformer architecture which made it possible to train language models at massive scale. Labs discovered that bigger models with more data produced qualitatively different behavior. GPT-3 was the first clear demonstration of this - 175 billion parameters, trained on hundreds of billions of tokens. Latent capabilities nobody designed simply emerged.

But there's a fundamental tension baked into how these models work. Every output is a series of probabilistic next-token predictions. The further you go without a corrective signal, the more those probabilities can drift. Scale made the models powerful but it didn't make them reliable. These models could generate impressively but they couldn't reliably follow instructions, hold a conversation, or avoid saying something unhinged. Powerful but hard to actually work with.

RLHF is the first major paradigm shift in how we trained these models. Instead of just training on predicting the next token, you add a human feedback loop. Humans rate model outputs on quality, honesty, tone. This is what turned a text completion engine into something that felt conversational, and this is what enabled the ChatGPT moment. "Smart autocomplete" becomes "autocomplete with an etiquette coach." The autocomplete metaphor gets stretched but still somewhat holds.

But RLHF introduced its own problems. You're now exposed to the biases of human labelers and still limited by what can realistically get labeled. I'm pretty sure RLHF is why every LLM writes with a thousand emdashes. Human graders probably labeled emdash-heavy writing as "better." The training signal is only as good as the humans providing it.

The next major paradigm shift is RLVR. The human feedback gets replaced by provable outcomes. Instead of asking "did a person think this was good?" you ask "did the model get this verifiable question right?" Did the math check out? Did the code compile and pass the tests? No more human labeling bias, the reward signal is just objective.

Models trained with RLVR develop reasoning strategies that weren't explicitly programmed. They learn to decompose problems, check their own work, backtrack when stuck. Not because anyone told them to, but because those strategies lead to verifiably correct answers.

This is where "smart autocomplete" completely breaks down. These models are exploring solution spaces and developing problem-solving strategies through reinforcement, something fundamentally different from what existed even 18 months ago.

But RLVR comes with its own tradeoff, and I think it's the most important one to understand. By removing human bias you concentrate all progress specifically around what's verifiable. The gains we've been seeing are as much about reliability as capability, and RLVR is built to solve exactly that — but in doing so it narrows where the gains land.

[@karpathy] writes about this really well in his 2025 year in review. Because RLVR trains against verifiable rewards, models "spike" in capability around verifiable domains like math, code, formal logic, and science with testable outcomes. The signal is clean so the improvement is dramatic.

Through RLVR Ethan Mollick's jagged frontier becomes Karpathy's jagged intelligence. The models are simultaneously a genius polymath and a confused grade schooler. The jaggedness isn't random, it maps directly to what is and isn't verifiable. The shape of the training determines the shape of the capability.

Writing that resonates emotionally, whether a product will be intuitive for new users, navigating ambiguous situations... there's no clean reward signal for any of that. RLHF helps some but it's inherently noisy. You can't unit test whether a piece of writing is moving. When the answer is uncertain or inherently subject to taste, these all sit in the troughs between the peaks.

Instead of asking "is AI good or bad at X" I think the useful reframe is "is X verifiable?" If it is, then the models will get good at it if they aren't already. If it isn't, they're probably worse than they appear. They may still be able to feign proficiency because they're very good at pattern-matching the form of good answers, but they will lack the substance.

This isn't just about what the models are good at today, it's predictive. If RLVR capability tracks verifiability, then the work that gets automated first will be the most repetitive and most verifiable. This is already playing out — coding was the first major use case for a reason.

That pattern will extend well beyond code. Anything with clear inputs, deterministic outputs, and tight feedback loops is on the near-term automation path. The further you get from verifiability (judgment, taste, ambiguity, context that can't be formalized), the harder it is for RLVR to reach.

A few days ago I wrote about what's still hard even with AI: figuring out what to build, good product taste, distribution, getting AI to work reliably for your specific use case. I think RLVR is the underlying explanation for why those things are still hard. None of them are verifiable. There's no clean reward signal for "did you build the right thing."

When people talk about job displacement by AI, these spaces of ambiguity are the rocks across the river. If you're asking how to succeed in the AI era, it's by leaning into these gaps where there is no "right" answer. RLVR is the key that simultaneously explains the incredible leap forward in capability and reveals how you can start to dance with AI instead of being replaced by it. Inherently it needs a partner, it only knows half the moves.

Originally on Threads ↗