What's Actually Inside the Box
Demystifying LLMs Without the PhD Requirements
This article is part of a series
The Machinery Behind the Magic
Everyone's talking about AI, but most conversations oscillate between breathless hype and dismissive skepticism. This series takes a different approach: treat these tools as what they are—sophisticated machinery that rewards understanding. Like any powerful tool, they have strengths, limitations, quirks, and best practices. You don't need to be an engineer to benefit from knowing how your tools work. These articles will take you from "what even is this?" to "I know exactly when and how to use this.
You’ve probably heard the metaphor: ChatGPT is “autocomplete on steroids.” It’s the kind of explanation that gets tossed around in meetings, accompanied by knowing nods from people who may or may not actually know what they’re nodding about.
The metaphor isn’t wrong, exactly. It’s just insufficient—like calling a car “a horse carriage with an engine.” Technically accurate, misses the point, doesn’t help you drive.
So let’s open the hood. Not to turn you into a machine learning engineer, but because understanding what you’re working with changes how you work with it. A spreadsheet works the same whether or not you understand cell references. An LLM behaves differently based on how you approach it.
How We Got Here (The Short Version)
For decades, if you wanted a computer to understand language, you wrote rules about language. Subject-verb-object. Parse trees. Graduate students spent years encoding grammar, and the systems remained frustratingly brittle. Language is messier than rules can capture. Sarcasm alone probably broke a thousand PhD projects.
Neural networks changed everything. Instead of writing rules, you showed the system millions of examples and let it figure out the patterns itself. The transformer architecture in 2017 was the key breakthrough—it let models “pay attention” to relevant words no matter how far back they appeared.
Then came the scaling era. GPT-1 in 2018 was interesting. GPT-3 in 2020, with 175 billion parameters, produced text genuinely hard to distinguish from human writing. Each generation got bigger and meaningfully smarter in ways that surprised even the researchers.
ChatGPT launched in November 2022, and the general public noticed. Within two months: a hundred million users. Now there’s OpenAI, Anthropic’s Claude, Google’s Gemini, Meta’s LLaMA, and an explosion of open-source alternatives.
The Core Mechanism: Next Token Prediction
Here’s the fundamental thing: an LLM is a text prediction machine. Given some text, predict what comes next. That’s the entire training objective.
But “text” here doesn’t mean what you think.
LLMs Don’t See Letters—They See Tokens
The model doesn’t read characters or even whole words. It sees tokens—chunks it learned to treat as units. Common words like “the” are one token. Longer words get split:
"unbelievable" → ["un", "believ", "able"]
"strawberry" → ["str", "aw", "berry"]
This explains a famous failure. Ask most LLMs:
“How many Rs are in strawberry?”
They often get it wrong. Not because they’re stupid—because they literally can’t see individual letters. The model sees ["str", "aw", "berry"] and has to reason about character counts from those chunks. It’s like counting Rs through frosted glass that blurs letters into groups.
The Context Window: This Is the Whole Game
If you understand one thing about LLMs, understand this. Everything else follows from it.
The context window is the text the model can see when generating a response. Think of it as a text file with a maximum size. For Claude, that’s currently around 200,000 tokens (roughly 150,000 words). For ChatGPT, it varies by model—anywhere from 8,000 to 128,000 tokens.
Here’s what actually happens when you send a message:
[System instructions]
[Your first message]
[Model's first response]
[Your second message]
[Model's second response]
[Your current message]
← Model generates next token here
Everything above that arrow gets fed into the model. The model looks at ALL of it, generates exactly ONE token, appends that token, then looks at everything again to generate the next token. One at a time. Every single token in the response required re-reading the entire context.
The Model Has No Memory
This is crucial: the model itself stores nothing between messages.
When you send a message, the application you’re using (ChatGPT, Claude, whatever) assembles that text file—your conversation history plus system instructions—and feeds it to the model. The model generates a response. Then the model is done. It doesn’t “remember” anything.
The next time you send a message, the application assembles the text file again (now including the previous exchange) and feeds it to the model fresh. From the model’s perspective, every single message is the first time it’s seeing the conversation. It’s reading the whole transcript from the beginning, every time.
This is why the “context window” metaphor is so apt. It’s literally a window—a view into a document. The model can only see what’s in that window right now.
What Happens When You Hit the Limit?
That text file has a maximum size. When your conversation exceeds it, something has to go. Usually, the oldest messages get removed to make room for new ones.
[System instructions]
[Message 15] ← Messages 1-14 are gone
[Response 15]
[Message 16]
[Response 16]
...
The model doesn’t “forget” those early messages in any active sense. They’re just… not in the document anymore. The model never sees them because they’re not being fed in.
“But ChatGPT Remembers Things About Me”
You might have noticed that ChatGPT or Claude sometimes seems to remember things from weeks ago—your name, your job, preferences you’ve mentioned. This isn’t the model remembering. It’s a separate system.
These applications run a background process that extracts key facts from your conversations and stores them in a database. When you start a new conversation, those facts get injected into the context window:
[System instructions]
[Memory: User's name is Sarah. User is a marketing manager.
User prefers concise responses.] ← Injected from database
[Your new message]
The model sees this exactly the same way it sees any other text in the context. There’s no special “memory” capability—it’s literally just text that got prepended to your conversation. The application is doing the remembering; the model is just reading what it’s given.
This is also why memory is selective and sometimes wrong. The system that extracts facts is itself imperfect. It might miss things, misinterpret things, or store outdated information that contradicts what you’ve said since.
Why Conversations Drift (And What To Do About It)
Here’s a scenario. You ask the model to write something. The response isn’t quite right.
You: Write me a product description for...
Model: [Generates description A]
You: No, make it more casual
Model: [Generates description B]
You: Better, but don't mention the price
Model: [Generates description C]
You: Can you make the opening punchier?
Model: [Generates description D]
By the time you get to description D, here’s what’s in the context window:
[System instructions]
[Your original request]
[Description A - which you rejected]
[Your feedback: "more casual"]
[Description B - partially rejected]
[Your feedback: "don't mention price"]
[Description C - partially rejected]
[Your feedback: "punchier opening"]
[Description D]
The model is now generating while looking at THREE rejected or partially-rejected descriptions, plus your scattered feedback, plus the original request. It’s trying to synthesize all of this into what you want.
What did you actually want? You wanted something casual, without the price, with a punchy opening. But the model doesn’t know which parts of descriptions A, B, and C were good versus bad. Maybe the ending of A was perfect. Maybe the middle of B was great. From the model’s view, it just sees attempts and corrections, and it has to guess which elements to keep.
This is why conversations drift. The context gets polluted with wrong turns, partial attempts, and ambiguous feedback. Every failed attempt makes the next attempt harder.
Research consistently shows that the first response—given proper context—is usually the best you’ll get. The longer a conversation goes, the more noise accumulates.
The fix is often counterintuitive: start over. Take what you learned from the failed attempts, write a single clear prompt that incorporates all your requirements, and start a fresh conversation. You’ll often get a better result on the first try than you got after five rounds of refinement.
Hallucinations: Confidently Wrong
“Hallucination” has become the catch-all term for when LLMs produce false information with complete confidence. But understanding why this happens makes it avoidable.
The model generates text that fits the pattern. It’s trying to produce what would logically come next given what’s in the context window. If you ask for something that isn’t in the context, the model doesn’t say “I don’t have that information.” It generates what would be there if it existed.
Classic example:
You: “Give me three academic citations supporting the claim that…”Model: [Generates three official-looking citations with authors, journals, page numbers]
If the model doesn’t have real citations in its context (from training data, from a web search, from a document you provided), it will generate plausible-looking fake ones. Why? Because you asked for citations, and the pattern-completing thing to do is provide them. The format is predictable. Author names follow certain patterns. Journal names follow patterns. The model fills in the template with generated content.
This isn’t a bug to be fixed. It’s how the system works. The model optimizes for “what text should come next,” not “is this true.”
The Fix Is Straightforward
Don’t ask for things that aren’t in the context window.
If you need citations, give the model a document and ask it to cite from that document. If you need current information, make sure the model searches the web first (and verify those results exist). If you need facts, provide the facts and ask the model to work with them.
The hallucination problem largely disappears when you stop treating the model as an oracle and start treating it as a text processor that works with what you give it.
There are more sophisticated prompting techniques we’ll cover in a follow-up article, but this single principle—don’t ask for information that isn’t in the context—eliminates most hallucinations.
Training vs. Using: Two Different Things
Training is when the model learns. It reads billions of text examples and adjusts its internal parameters to predict better. This happens once, takes months, costs millions of dollars, and produces a frozen model.
Think of it like this: if you read a thousand detective novels and then tried to write one, you’d have absorbed patterns about how these stories work. The detective finds a clue in chapter two. There’s a red herring around the midpoint. You didn’t memorize any specific novel, but you learned the genre’s grammar. LLMs do this across virtually all written human knowledge.
Inference is when you use it. The model applies patterns it already learned. It’s not learning from your conversation—it’s the same pristine model every time. The model that responds to your tenth message is identical to the one that responded to your first. Only the context window changed.
Some models get fine-tuned: additional training on specific data to make them better at conversations or particular tasks. But once any training is done, the model is fixed.
What LLMs Are Good At
Pattern recognition and generation. Anything involving patterns in text: summarizing, translating, reformatting, style-matching. The model was built for this.
Following instructions. The model has seen millions of examples of different writing styles and registers. Business emails, technical docs, casual explanations—ask for a style, and it can usually match it.
Synthesis. Give the model information and ask it to combine, compare, extract, or restructure. This is where the context window becomes a superpower. You can paste in multiple documents and ask the model to find connections you might have missed.
Ideation. Need twenty approaches to a problem? The model generates options faster than you can evaluate them. Much of it will be mediocre, but buried in there are often angles you wouldn’t have considered. Generate many, curate ruthlessly, build on what’s useful.
Code. Surprisingly strong, because code is highly structured and massively documented online. The model has seen the same patterns thousands of times.
What LLMs Are Bad At
Anything requiring information they don’t have. This sounds obvious but it’s the root of most problems. Current events (unless they search). Niche facts not in training data. Your company’s internal processes. Anything that isn’t either in their training or in the context window you’ve provided.
Web search (sort of). Modern LLMs can search, but they’re using search engines the same way you would—typing queries, reading results. They’re not magically connected to the internet. A human with good search skills often finds specific information faster. The model’s advantage is synthesizing what it finds, not the searching itself.
Complex math and logic. The model pattern-matches on what correct math looks like rather than computing. Simple problems work. Complex multi-step reasoning accumulates errors.
Character-level tasks. Counting letters, reversing strings, anagrams—hard because of tokenization. The model doesn’t see individual characters.
Common Misconceptions
“It’s just Google with extra steps.”
Search retrieves existing documents. LLMs generate new text. When you search Google, you get links to things humans wrote. When you use an LLM, you get something that didn’t exist before. Yes, LLMs can search—but they’re generators that use search, not search engines.
“It’s going to take my job.”
The pattern so far: humans using these tools effectively outperform both the tools alone and humans without tools. “Learn to use the tools” beats “panic about the tools.”
The Mental Model That Matters
Here’s the framework that makes everything click:
The model only knows what’s in the context window. Training data gave it patterns and general knowledge. But for any specific task, it’s working with exactly what you’ve given it in that conversation—plus whatever system instructions and memory got injected.
The model generates what would logically come next. Not what’s true. Not what’s best. What fits the pattern given the context.
Every token is generated fresh. The model re-reads everything, generates one token, repeats. There’s no persistent state, no ongoing “thought process” between your messages.
Once you internalize this, you stop being surprised by LLM behavior. Hallucinations happen when you ask for things outside the context. Conversations drift because failed attempts pollute the context. Fresh starts work because you get a clean context.
The model is a sophisticated text-in, text-out machine. The text you put in determines the text you get out. Everything else is application scaffolding built around this core.