Understanding embeddings: representing language in vector space.
Computers process numeric data. Words are discrete symbols with no intrinsic numerical representation. Converting “cat” to a number that encodes its meaning is non-trivial.
Give each word its own dimension: cat = [1, 0, 0, 0], dog = [0, 1, 0, 0], pizza = [0, 0, 1, 0], . . .
We need something where similar words get similar numbers.
Map every word to a short, dense vector (e.g. 300 numbers) where geometry reflects meaning.
But how do we learn these vectors? 
“Words that appear in similar contexts tend to have similar meanings.”
“Cat” and “dog” keep showing up near the same words, so we push their vectors closer together.
This is the theoretical foundation behind Word2Vec, GloVe, and all distributional embeddings.
Both use a simple neural network. They just reverse the direction.
| CBOW | Skip-gram |
|---|---|
| Context words $\to$ predict center word | Center word $\to$ predict context words |
| “The cat ______ on the” $\Rightarrow$ predict “sat” | “sat” $\Rightarrow$ predict “The”, “cat”, “on”, “the” |
| Average the context embeddings, then predict | One word in, predict each neighbor independently |
Key insight: No labels are needed. The text itself is the training signal — this is self-supervised learning.
The embeddings are the weight matrix $W$ of the network. They start random and get refined over millions of training examples.
Over time, words that keep appearing in similar contexts get pushed toward similar vectors. Words that never co-occur drift apart.
The loss function is just $-\log P(\text{correct word})$. Early on, the model is barely better than random. After billions of examples, it becomes very confident about which words belong near which other words.
GloVe: Build a word pair frequency matrix across the entire corpus. The ratios of co-occurrence counts encode semantic relationships.
Count how many times two words appear within 10 words of each other, across the entire corpus:
| The ratio $\frac{P(\text{solid | ice})}{P(\text{solid | steam})} = \frac{265}{30} \approx 8.9$ suggests that “solid” is strongly associated with “ice” but not “steam”. |
Word2Vec and GloVe assign one vector per word. If a word wasn’t in the training data (typos, rare words, new terms), it gets nothing.
FastText’s idea: Break words into character n-grams and sum their embeddings.
“playing” $\to$ <pl, pla, play, lay, ayi, yin, ing, ng>, ... \(\mathbf{v}_{\text{playing}} = \sum \mathbf{v}_{\text{n-gram}}\)
Trade-off: Much larger model ($\sim$2M n-grams vs. 400K words $\Rightarrow$ 5$\times$ more parameters).
FastText’s InnovationWord2Vec/GloVe Approach:”playing” $\rightarrow$ single atomic vector lookupIf “playing” not in vocabulary $\rightarrow$
(exam question: list all n-gram tokens or given n, say how many tokens in n-grams n-gram is a contiguous sequence of n items from a given sample of text or speech. always include <S>.
Word: “unhappy” (add boundary markers: <unhappy>)
<un, unh, nha, hap, app, ppy, py> <unh, unha, nhap, happ, appy, ppy> <unha, unhap, nhapp, happy, appy> 6-grams: <unhap, unhapp, nhappy, happy>
Everything we’ve seen so far — Word2Vec, GloVe, FastText — gives each word one fixed vector regardless of context. These are static embeddings.
The problem: “The river bank was steep” and “I went to the bank to deposit money” use completely different meanings of “bank,” but get the same vector.
Contextual embeddings (BERT, GPT) generate a different vector for each occurrence based on surrounding words.
“Bank” near “river” gets a water-related vector.
“Bank” near “deposit” gets a financial vector.
Key takeaway: Static embeddings are a compromise across all senses. This is a fundamental limitation, not a bug fixable with more data.
Similarity of embedding is computed using cosine similarity (angle), not the Euclidean distance.
\[\cos(\mathbf{u}, \mathbf{v}) = \frac{u_1 v_1 + u_2 v_2}{\sqrt{u_1^2 + u_2^2} \sqrt{v_1^2 + v_2^2}}\]Measures the angle between two vectors. Higher = more similar. Range: $[-1, 1]$.
Example: $(1, 4)$ vs. $(2, 3)$: $\cos = \frac{2+12}{\sqrt{17} \cdot \sqrt{13}} = \frac{14}{\sqrt{221}} \approx 0.941$
Example: Two documents about “machine learning” could have different word frequencies. Document A uses 100 words total, Document B uses 1000 words, but they discuss the same topic. The Euclidean distance would say they’re very different (a large distance) due to the document length. Cosine similarity ignores length and measures whether they talk about the same topics in the same proportions.
Reads as: $A : B :: ? : C$.
Example: $\text{Paris}(1, 4) - \text{France}(2, 4) + \text{Italy}(2, 3.1) = (1, 3.1)$
Nearest by cosine: $\text{Rome} (1, 3)$. $\quad \text{Paris} : \text{France} :: \text{Rome} : \text{Italy}$.
Remember: This only works when the two pairs share a consistent relationship. The math always gives an answer—it’s your job to judge whether it’s meaningful!
So far we talked about word embedding. Now how about setnence embediing? Simply, we can average all word vectors in a sentence: $\mathbf{\bar{v}} = \frac{1}{n} \sum \mathbf{v}_i$
Stopword removal: Words like “the,” “of,” and “and” add noise to the average because they appear everywhere and carry little meaning. Removing them lets content words dominate.
TF-IDF weighting: Instead of equal weights, give lower weight to words that appear in many documents (or both sentences). $\mathbf{\bar{v}} = \frac{\sum w_i \mathbf{v}_i}{\sum w_i}$ THis is to downeigh words like “the”, “of”…
Given a target word and window size = 2:
| Token | Vector | Token | Vector |
|---|---|---|---|
| berlin | (2, 5) | trade | (3, 1) |
| germany | (3, 5) | law | (2, 3) |
| madrid | (2, 4) | court | (3, 0) |
| barcelona | (3.2, 4) | stream | (0, 3) |
| spain | (3, 4.1) | current | (0, 3) |
| the | (1, 1) | ||
| and | (1, 1) |
Q1. compute cosine similarity for Berlin-Madrid and Berlin-Barcelona. which pair is more similar?
Berlin–Madrid: $\frac{4+20}{\sqrt{29} \cdot \sqrt{20}} = \frac{24}{\sqrt{580}} \approx \mathbf{0.997}$
Berlin–Barcelona: $\frac{6.4+20}{\sqrt{29} \cdot \sqrt{26.24}} = \frac{26.4}{27.585} \approx \mathbf{0.957}$
Berlin–Madrid is more similar.
Compute average embeddings for S1 and S2, then cosine similarity.
Question S1 = [trade, law, and, the, court, berlin, germany]. Window = 2.
(a) Find 4 context words. Average them. Is it more similar to court or law? (b) “Current” appears equally near water and time words in training. How does that affect its embedding?