LLMs for Engineers — Part 3
Welcome back to our LLM Journey. If you are new to here , don’t miss to check previous posts before diving into this post.
So far, we’ve seen:
ChatGPT predicts the next word
It learns from internet-scale data
Lets talk about 👉 How does text even go inside a machine?
Because computers don’t understand:
words
sentences
meaning
Step 0: What you see vs what AI sees
You read:
“Hello world”
You see meaning instantly.
But a computer?
👉 It first converts this into binary (0s and 1s)
At the lowest level, everything becomes:
👉 01001000 01100101 …
Step 1: From text → bytes
Those binary bits are grouped into bytes (8 bits).
Each byte represents a character.
So:
H → some byte
e → another byte
l → another byte
Now your sentence becomes:
👉 A long sequence of numbers
Problem ⚠️
This representation has a big issue:
👉 The sequence becomes VERY long
And in AI:
Sequence length = cost
Longer sequence → more compute → slower training
Step 2: The key tradeoff
We have two options:
❌ Few symbols (0/1) → very long sequence
✅ More symbols → shorter sequence
So we choose:
👉 Increase vocabulary size to reduce sequence length
Step 3: Move to tokens
Instead of working with:
bits
or characters
We create:
Tokens = smarter chunks of text
Think of tokens like this
Not just words…
Tokens can be:
full words → “Hello”
partial words → “ing”
combinations → “ world”
even spaces
Real example
“Hello world” becomes:
👉 [15339, 1917]
These numbers are just:
👉 IDs of tokens
Important insight ⚠️
Tokens are NOT words.
They are:
compressed pieces of text
optimized for efficiency
based on patterns in data
Step 4: How tokens are created
This is done using:
Byte Pair Encoding (BPE)
Simple intuition (no jargon)
Imagine scanning text and finding:
👉 which patterns appear again and again
Example:
“th” appears a lot
“ing” appears a lot
“the” appears a lot
So we merge them into single tokens
What happens then?
Frequent patterns:
👉 become single tokens
Rare patterns:
👉 break into smaller pieces
Why this is powerful
Because:
reduces sequence length
keeps flexibility
adapts to any language
👉 Best of both worlds
Vocabulary size
Modern models use:
👉 ~100,000 tokens
Each token = unique ID
Step 5: Everything becomes a sequence
Now your input is no longer text.
It becomes:
👉 A 1D sequence of token IDs
Example:
👉 [1543, 9281, 77, 201…]
Let’s make this practical 🔍
So far, this might feel a bit abstract.
Let’s actually see tokens in action.
Try this yourself
Go to:
👉
https://tiktokenizer.vercel.app/?model=cl100k_base
(This is a tokenizer used in models like GPT-4)
What you’ll see
Type any sentence on the left.
Example:
“Hello world”
You’ll notice:
It splits into tokens
Each token has a unique ID
Now experiment (this is where it gets interesting)
Try these one by one:
1. Change spacing
“Hello world”
“Hello world”
👉 Different tokens
2. Change case
“Hello”
“hello”
👉 Different tokens again
3. Add punctuation
“Hello world”
“Hello world!”
👉 Tokenization changes
Key realization 💡
Even tiny changes in text:
👉 Completely change the token sequence
Why this matters (real-world impact)
This is exactly why:
Prompts behave differently
Formatting matters
Small tweaks change outputs
Because:
The model never sees your text
It only sees the tokens
One powerful way to think about it
When you write a prompt…
You’re not writing for a human.
👉 You’re designing a token sequence for a machine
Why 1D sequence matters
Because neural networks expect:
👉 sequence of symbols
Not paragraphs. Not documents.
Just:
👉 ordered list of tokens
Subtle but powerful behavior ⚠️
Small changes in input = different tokens
Example:
“Hello world”
“Hello world” (extra space)
“hello world” (lowercase)
👉 All produce different token sequences
This explains a LOT
Now you understand:
Why prompts are sensitive
Why formatting matters
Why spacing changes output
Why small tweaks give different results
Because:
You are changing the token sequence fed to the model
Engineer’s mental model ⚙️
Think of it like:
Text → encoding
Encoding → tokens
Tokens → IDs
IDs → model input
👉 Like packet encoding before transmission
The big picture
We started with:
👉 messy internet text
Now we have:
👉 clean sequence of token IDs
This is what the model actually sees.
One simple takeaway
AI does not read text.
It processes tokens (numbers)
What’s coming next
Now we have:
Data → tokens
Tokens → numbers
Next question:
👉 How does the model learn patterns from these tokens?
Next in this series
👉 The core engine: predicting the next token
(This is where the real magic — or math — begins)
Smiles :)
Anurudh







