LLMs for Engineers — Part 2
In the last post, we uncovered something surprising:
ChatGPT is just predicting the next word.
No thinking. No understanding.
Just prediction.
But that leads to a bigger question:
👉 How does it get so good at this?
Because predicting random words is easy.
Predicting useful, meaningful answers is not.
So where does that ability come from?
The short answer
The internet.
But not in the way most people think.
ChatGPT is NOT connected to Google.
It doesn’t “search” the internet when you ask something.
Instead:
👉 It was trained on a massive dataset built from the internet
Let’s go one level deeper
Imagine this…
We are trying to build a model that can talk like the internet.
That means:
👉 We need to feed it a massive amount of data
articles
Wikipedia
code
forums
basically… anything and everything
This first phase is called:
Pre-training
But where does this data come from?
AI companies don’t just randomly scrape websites.
They use structured, large-scale datasets.
One of the most popular ones is:
👉 Common Crawl
You can check it here:
https://commoncrawl.org/
What is Common Crawl?
Think of it like a bot that:
Starts from a few websites
Follows links
Keeps crawling the internet
And collects:
HTML
scripts
styles
everything on the page
Basically… a raw dump of the internet.
But we can’t feed this directly to a model
Because raw web pages are messy:
ads
popups
broken code
navigation junk
👉 Feeding this directly would be useless.
So a LOT of filtering happens.
Let’s break it down step by step
1. URL Filtering
First, remove junk sources:
spam sites
malware
adult content
low-quality domains
2. Text Extraction
Now extract only useful content.
Remove HTML, CSS, menus
Keep only main text
👉 Think of it like:
Dropping packet headers and keeping the payload
3. Language Detection
Most models focus on English.
So they check:
👉 Is this page at least ~65% English?
If not → removed
This directly impacts what languages the model becomes good at.
4. Deduplication
Same content repeated = waste.
👉 Like analyzing the same packet capture 100 times
So duplicate pages, mirror sites are removed.
5. PII Removal
Finally, sensitive data is removed:
phone numbers
addresses
credit card info
So what’s left?
After all this…
👉 We get clean, high-quality text
millions of web pages
only meaningful content
no junk
And even after heavy filtering…
👉 It’s still 40+ TB of data
The key idea
Each piece of this text becomes a building block
Just like:
👉 packets flowing through a massive network
And this is what the model learns from.
Not the internet directly…
👉 but a cleaned, compressed version of it
Why this matters (this is big)
Now you understand:
Why ChatGPT feels so knowledgeable
Why it sometimes gives wrong answers
Why rare topics confuse it
Because:
It only knows what it has seen enough times
Engineer’s mental model ⚙️
Think of it like:
Training data = logs
Model = pattern learner
Output = best possible guess
One simple takeaway
ChatGPT is trained on a filtered version of the internet — not the internet itself
What’s coming next
So far we’ve seen:
What ChatGPT does → predict next word
Where it learns from → internet data
But here’s the real twist:
👉 Computers don’t understand text like we do.
So how does text even go inside a model?
Next in this series
👉 How AI reads text (Tokenization)
This will completely change how you think about language and machines.
If this clicked for you, you’re already ahead of most people using AI today.
Let’s go deeper 🚀
Smiles:)
Anurudh


