April 29, 2026 · 14 min read

AI Engineering, Decoded.

Fifteen frames that separate people building real AI products from people still tweaking prompts in a chat window.

Shoh GMan @buildingscaling

AI Engineering: Building Applications with Foundation Models — title infographic comparing traditional ML to AI Engineering.

Most people in 2026 still think AI work means writing better prompts. That is the smallest, least interesting layer of the entire stack. The real job, the one nobody at the dinner party can describe, is engineering a probabilistic system to behave deterministically enough to ship to a paying customer.

This letter walks through the fifteen frames I use to think about that job. They cover the whole map: scaling laws, transformer mechanics, post-training, sampling, structured output, hallucinations, and the evaluation crisis that quietly eats most AI startups alive. If you read every section and look at every diagram, you will understand AI engineering at the level of someone shipping it for a living. That is the promise.

1. The paradigm shift.

Traditional ML versus AI Engineering. Traditional ML: neural networks and statistical models trained from scratch. AI Engineering: foundation models plus scalable systems, with stochastic outputs, scale, prompting, and probabilistic behavior as new dimensions. — **Frame 01**Traditional ML vs. AI Engineering

Traditional ML was a craft of training small bespoke models from scratch on tabular data. AI engineering is the opposite shape. You start with a foundation model that already cost someone else nine figures to train, and your entire job is adapting, constraining, and evaluating it inside a real product.

The discipline pivots on three new axes. Scale: you are renting compute, not buying it. Prompting: the instruction is now a first-class artifact. Probabilistic: the same input produces different outputs, and your system has to survive that. If your team still talks about "training the model" before talking about evals, they are solving last decade's problem.

2. The core delta.

Side by side comparison: Traditional ML trains from scratch on tabular structured data with deterministic outputs and feature engineering as the core challenge. AI Engineering adapts existing foundation models on unstructured text, code, and images, with open-ended probabilistic sequences, and context construction, prompting, and evaluation as the core challenge. — **Frame 02**Adapt, don't train

Read this one twice. The phrase "AI Engineering is less about developing models and more about adapting, constraining, and evaluating them" is the entire job description. If you can internalize that one sentence, you will stop hiring the wrong people and stop building the wrong tools.

Feature engineering used to be where the value was created. Today the value lives in context construction, prompting, and evaluation. Those three are the new feature engineering, and most teams are catastrophically underinvested in all three.

3. The stack.

The AI Engineering stack: three layers. Application Development on top includes Prompts and Context (RAG), AI Interfaces, and Evaluation. Model Development middle layer includes Supervised Finetuning, Alignment, Inference Optimization, and Datasets. Infrastructure base includes Compute Clusters, Model Serving, and Monitoring. — **Frame 03**Application, model, infrastructure

Three layers. Infrastructure at the bottom: compute, serving, monitoring. Model development in the middle: SFT, alignment, inference optimization, datasets. Application development on top: prompts and context (RAG), AI interfaces, and evaluation.

Notice which boxes the diagram draws thickest at the top. Evaluation and AI Interfaces. That is not a stylistic choice. That is the observation that for a startup building on top of foundation models, almost all the leverage is in those two boxes. If you are spending most of your time below the application layer, you are doing somebody else's job.

4. The physics of AI.

Scaling laws diagram with three gauges labeled Parameters (the model's learning capacity), Training Tokens (the volume of data fed in), and FLOPs (raw compute budget). Center reads: Optimal Training Tokens approximately 20 times Parameter Count. Two captions: Doubling the model size requires doubling the training data to remain compute-optimal, and Bigger isn't always better if you lack the data to feed it. — **Frame 04**The Chinchilla rule

The single most important number in modern AI: roughly 20 tokens of data per parameter for a compute-optimal training run. This is the Chinchilla finding restated, and it is why the "just make the model bigger" era ended around 2023.

The practical takeaway for an operator: bigger models without bigger and cleaner data are a waste of compute. When you read about a frontier model release, look at tokens trained, not parameter count. The lab that 5x'd the data on the same parameters will eat the lab that 5x'd the parameters on the same data.

5. The representation gap.

Training Data Representation Gap. English makes up 45.88% of training data, Russian 5.97%, under-represented languages like Burmese, Telugu, and Punjabi less than 0.1%. Tokenization comparison: Hello World is 2 tokens in English, 20 tokens in Burmese. Synthesis insight: Under-represented languages aren't just lower quality, they cost up to 10x more in API fees and compute latency due to inefficient tokenization. — **Frame 05**The latency and cost tax

If your product serves a non-English market, this slide is your unit economics. English gets 2 tokens for "Hello World." Burmese gets 20. The same user query costs you 10x more in API fees and runs 10x slower, before you have written a single line of code.

I see Western founders ignore this constantly. Then they expand to Indonesia or Vietnam and their margins quietly evaporate. The fix is not "switch model." The fix is to budget for the tokenization tax up front and design caching, summarization, and language-specific prompts to neutralize it.

6. The engine.

Transformer inference has two phases. Step 1 Prefill is parallel: fast processing that creates the initial key-value state. Step 2 Decode is sequential: an autoregressive bottleneck where tokens are generated strictly one by one. Caption: Inference latency is strictly bounded by the sequential nature of decoding. The model cannot parallelize the creation of a single thought. — **Frame 06**Prefill is fast, decode is the bottleneck

Two phases inside every transformer call: prefill (parallel, fast, eats your input) and decode (sequential, slow, generates one token at a time). The decode step is autoregressive by definition. The model cannot parallelize the creation of a single thought.

This is why streaming feels fast even when total latency is high. It is also why short outputs are dramatically cheaper than long ones, and why "generate a 5,000 token plan" is a bad pattern for any latency-sensitive product. Architect around the decode bottleneck or get punished by it.

7. The post-training pipeline.

Post-training pipeline as a factory line. Raw unstructured data goes through Pre-training (self-supervision via next-token prediction, consumes about 98% of compute, produces a capable but untamed autocomplete engine). Then Supervised Finetuning (behavior cloning via high-quality QA pairs, around 13,000 labeler pairs, produces a conversational assistant). Then Preference Finetuning RLHF/DPO (alignment via reward models and human rankings, produces a safe, human-aligned, customer-appropriate response model). — **Frame 07**Pre-train, SFT, RLHF/DPO

The model that costs $100M to train is the autocomplete engine. The model you actually use as a chat assistant is that engine plus two relatively cheap layers on top: supervised finetuning on roughly 13,000 high-quality QA pairs, then preference finetuning via human rankings (RLHF or DPO).

This matters because almost all the personality, safety, and helpfulness of a model lives in those last two stages. Open-weight bases are usable but feral. The "magic" of GPT, Claude, and Gemini is mostly the alignment pass. When you finetune, you are doing a smaller version of the same trick.

8. The mechanics of generation.

Magnifying glass over a logit vector of raw numbers (1.2, -0.5, 3.8). Arrow to a Softmax funnel that squashes unbounded logits into a clean 0-100% scale. Arrow to a Probabilities (logprobs) bar chart showing blue 50%, red 30%, car 1%. Caption: AI does not output text. It outputs a probability distribution over a vocabulary. — **Frame 08**The model outputs probabilities, not text

The most important mental model to internalize. The LLM does not output words. It outputs a probability distribution over its vocabulary. A vector of raw logits goes through a softmax and you get back a list of every possible next token with a percentage attached.

Once you see this clearly, every other piece of the puzzle clicks. Sampling, hallucination, structured output, evaluation, all of them are downstream consequences of one fact: the model is a probability machine, and we are constantly wrapping it in heuristics to make it look like a writer.

9. Sampling strategies.

Three sampling strategies as probability distribution charts. Temperature (the curve flattener): low temperature is robotic and deterministic, high temperature is creative and chaotic. Top-K (the vertical slice): discards all but the top X highest probability tokens. Top-P / Nucleus (the shaded area): dynamically discards the tail once cumulative probability reaches P, for example 0.9. — **Frame 09**Temperature, Top-K, Top-P

Three knobs that decide how the model picks from that probability distribution. Temperature reshapes the curve: low is sharp and deterministic, high is flat and creative. Top-K draws a hard line at the K most likely tokens. Top-P draws a soft line at whatever tokens cover the top P% of probability mass.

Practical settings: temp 0 for extraction and classification, temp 0.3 to 0.7 for assistant-style writing, top-p 0.9 with temp 0.7 for generative content. Temp 1.5 for brainstorming when you need surprise. Anyone who tells you "lower temperature equals more accurate" without context is bluffing.

10. Test-time compute.

Test-time compute, Best-of-N and Beam Search. A single prompt fans out into multiple parallel responses, all evaluated by a reward model verifier that selects the best one. Data point: sampling multiple responses and selecting the highest average logprob. Synthesis insight: Allocating compute to generate multiple responses at inference time can simulate the performance of a model 30x larger. — **Frame 10**Spend compute at inference, not just training

This is the trick behind most of the 2024-2026 reasoning model jump. Instead of relying on the model to nail the answer in one pass, you generate N answers, score them with a verifier, and ship the best one. It is shockingly effective.

The headline number: a Best-of-N or beam search loop at inference time can simulate the performance of a model 30x larger. For a startup, this is enormous leverage. You do not need a frontier model for hard reasoning if you are willing to spend the inference budget. Most teams have not even started thinking this way.

11. The structuring problem.

The structuring problem funnel. Garbage JSON output drops through four filters. Prompting (the nudge): requesting specific formats in system prompts, high failure rate. Post-processing (the defensive parser): for example LinkedIn's defensive YAML parser raising success from 90% to 99.99%. Constrained sampling (the grammar filter): physically masking or deleting invalid logits like a rogue brace before they pass through softmax. Finetuning (the foundation): training the model exclusively on structured data examples. Final output: clean JSON with structured true, format JSON, success 100%, model finetuned. — **Frame 11**Get JSON to actually parse

Every team that ships an AI product hits this wall: the model gives you slightly broken JSON 1-5% of the time, and your pipeline collapses on that 1-5%. There is a clean ladder of fixes, and you climb it in this order: prompting, defensive parsing, constrained sampling, finetuning.

Constrained sampling is the underrated weapon. You physically mask invalid tokens at the logit layer so the model literally cannot produce a malformed brace. LinkedIn famously took success from 90% to 99.99% with a defensive YAML parser alone. If your product depends on structured output, you should know the entire ladder cold.

12. The probabilistic tax.

Hallucination as a three-step snowball. 1. The Slip: model generates one slightly incorrect token based on tail-end probabilities. 2. The Feedback Loop: that generated token is instantly fed back into the context as absolute reality. 3. The Snowball: model autoregressively justifies the error, generating subsequent tokens that align with the hallucinated reality. Caption: Models hallucinate because they cannot mathematically differentiate between the factual prompt you provided and the fictitious token they just generated. — **Frame 12**Why hallucinations snowball

Hallucination is not a bug, it is a structural property of autoregressive generation. The model picks a slightly wrong token from the tail of the distribution. That token is instantly fed back into the context as if it were ground truth. Every subsequent token is now conditioned on a fact the model invented one step ago. Error compounds.

This is why short, well-grounded prompts hallucinate less than long open-ended ones, and why retrieval augmentation works. You are not curing hallucination, you are starving it. The defensive moves are: shorter generations, citations forced into context, verifier passes on output, and refusing to ask the model open-ended factual questions in the first place.

13. The evaluation crisis.

Close-ended ML has a single ground truth checkmark. Open-ended AI has a dense network of competing valid outputs. Evaluation crisis factors: 1. No definitive ground truth for open-ended text generations. 2. Rapid benchmark saturation, for example MMLU is quickly solved. 3. Expanding general-purpose capabilities require domain-expert evaluators (PhD-level vs first-grade validation). — **Frame 13**The hardest problem in the field

This is the one that quietly kills startups. With open-ended generation there is no single right answer. Benchmarks saturate within a year. And as models get smarter, evaluating their output requires people smarter than the median user, sometimes PhDs.

The implication for builders is uncomfortable: your eval system is probably your most underbuilt asset. If you are not running a regression suite of golden examples on every prompt change, you are flying blind. The teams that will dominate the next two years are not the ones with the best prompt. They are the ones with the best evals.

14. Entropy and perplexity.

Two grids of dots. Low entropy / low perplexity: highly predictable text, the model has learned it well and exhibits low surprise. High entropy / high perplexity: unpredictable text, high information and high model uncertainty. Definitions: Entropy is how much information or surprise a single token carries. Perplexity is the model's degree of uncertainty in predicting the next word. Synthesis insight: Perplexity is the ultimate lie detector. Exceptionally low perplexity on a benchmark dataset indicates the model memorized it during training (data contamination). — **Frame 14**The lie detector

Entropy measures the surprise carried by a single token. Perplexity measures the model's average uncertainty when predicting the next one. They are two views of the same underlying number.

The reason this matters in 2026: perplexity is the ultimate lie detector for benchmark contamination. If a model has suspiciously low perplexity on a public benchmark, it almost certainly saw that benchmark during training. Half the model rankings on Twitter are noise produced by leaderboards measuring memorization. Once you know to look for it, you cannot unsee it.

15. Exact vs. similarity evaluation.

Four evaluation methods compared. Functional Correctness (execution/code, pass@k, best for logic, binary outcome). Exact Match (trivia/math, best for facts, fails on complex phrasing). Lexical Similarity (ROUGE/BLEU, measures overlapping n-grams, rigid to synonyms). Semantic Similarity (embeddings, AI-as-a-Judge, best for open-ended text, captures meaning but subjective). Caption: As system complexity scales, evaluation must fundamentally shift from Exact Match heuristics to Semantic and AI-as-a-Judge methodologies. — **Frame 15**Pick the right ruler

Four rulers, four jobs. Functional correctness for code and logic: did the test pass. Exact match for facts and math: is the string identical. Lexical similarity (ROUGE, BLEU) for translation-style tasks: how much n-gram overlap. Semantic similarity for open-ended generation: does the meaning match.

The trap is that exact-match evals are easy to build, so most teams stop there. As your system gets more open-ended, exact-match becomes a worse and worse proxy for what users care about. You have to graduate to semantic similarity and AI-as-a-judge, and you have to do it before your eval scores start lying to you.

What this actually means for builders.

If you scrolled this far, here is the compression of the entire field into five non-negotiable beliefs.

The model is a probability machine. Every other concept (sampling, hallucination, structured output) is a wrapper around that one fact.
The leverage is in the application layer. Context, prompts, and evals do more for your product than swapping model providers.
Decode is the bottleneck, tokens are the tax. Architect for short outputs and budget for non-English token bloat from day one.
Test-time compute is the cheat code. Best-of-N and verifier loops simulate a model 30x larger for the cost of a few extra calls.
Your evals are your moat. Whoever has the best regression suite of golden examples wins the next two years.

If you remember one thing. AI engineering is not prompt writing. It is the discipline of taking a probabilistic system and making it behave deterministically enough to ship. The work is in context, constraints, and evaluation. The prompt is the tip of the iceberg.

I am building three products on top of this stack right now (Replaceability Index, an internal sales agent, and a fitness coach). Every section above is something I learned the expensive way. The frames are free. The reps are not.

If this letter was useful, the next ones will be too. They go out twice a week and cost you nothing.

— Shoh.