A Soft Introduction to Word Embeddings

Up front, I’d like to state that this post is really an introduction to basic concepts surrounding Large Language Models (LLMs) and Natural Language Processing (NLP), with a useful application of those concepts in Excel. If you’re already familiar with the basics or you just want to skip right on to the Excel part, you can find it here.

The LLMs that are saturating the market today are obviously incredibly useful tools. They can be used to solve a huge variety of difficult problems that require tough-to-implement logic, such as grammar correction or sentence autocompletion. However, prior to the widespread adoption of LLMs and AI models, plenty of other methods for natural language processing existed.

That’s why, when I was tasked with matching hundreds of courses (~300) to a list of CIP Codes (~30) based solely on course descriptions, I decided against using AI to complete the process. The idea that I would use a large language model, which has as many as 100,000 tokens in its vocabulary and 175 billion parameters, for something that I can run pretty easily on my laptop didn’t sit well with me; it just felt like using a sledgehammer to drive a nail. In fact, the world of computer science actually has several pre-AI methods for natural language processing, which is what I decided to explore when I was tasked with this little categorizing conundrum.
This post aims to be part tutorial, part educational adventure into the world of math and word embeddings. We’ll start with some simple background about my decisions behind this process, then move on to a basic vocabulary for some terms in the Natural Language Processing (NLP) space, then dive into the math and finally, the implementation of this process in Microsoft Excel. Feel free to jump around and just gather what you want.

Why Excel? (click to expand)

Of course, Excel isn’t my first choice for a process like this, but I’m limited by some constraints in my organization. Firstly, I’d like to build a process that lives beyond me, and most folks sitting at a desk can’t be expected to handle something written in Python. Secondly, it’s actually pretty educational to see the sort of visual aid that Excel naturally provides for the high dimensional representation of a word. But, more to come on that.

Word Embeddings and Vectors - A General Overview

Review: what is a vector?

In high school and some college math classes, it’s pretty common to represent a vector as something that has a direction in space and a value representing distance (called magnitude). That’s totally accurate. But for the sake of what we’re looking at today, a vector is literally just a list of numbers. Those numbers might happen to represent something like an arrow in space with a direction and magnitude, at lower dimensions (2D and 3D), but ultimately they are still just a list of numbers. That’s really it.
A “word embedding” is just a vector that has been associated with a word; in our case, the numbers that make up that vector are based on its meaning.

Here’s a deceptively simple example of what it means to represent a word as a vector:
Let’s suppose for a second that we want to make word embeddings out of 3 words. A common set to showcase the power of word embeddings are “King” and “Queen”. In the scenarios below we’re going to increase the dimensions of these vectors one-at-a-time: from 1-D, to 2-D, 3-D, and finally N-D.
My aim here is to provide both a simple mathematical understanding of what a “word embedding” really is, and a sort of physical intuition behind representing something as a vector. The words we’ll be using are “King (K)”, “Queen” (Q), and “Person” (P).

What similarity looks like in n-dimensions

1-Dimensional Embedding

If I asked you to tell me how “royal” or “regal” the word King was on a scale from -1 to 1, what would you say? You would probably give a pretty high number, like 0.9, or 1. Similarly, Queen would be a high number. What about the word Person? In the absence of any other data, it seems reasonable to shove the word Person directly in the center, with a 0, since it doesn’t really have a connotation about royalty or regality.

Some extra intuition-boosting notes.

Now we have a simple number line, and we can already get some physical intuition about what it means to add and subtract these words. Since Queen and King are both “royal” to our number line, subtracting the two should put us somewhere near Person, and indeed it does:

But that’s actually beside the point for now, it’s just a note that makes understanding the embeddings easier.

The important part, is that now we can determine the similarity between these words: King, Person, and Queen along this dimension. Suppose I drop a third word into the mix, and we want to know which existing word it is most like. Some good examples might be “palace” (we’ll assign it the variable $P_a$) or “dirt” (D). A palace is very regal (r = 0.8), but dirt isn’t (r = -0.5), which means that our number line representation would pretty clearly show us that at least in terms of “royalty” or “regality”, “palace” belongs near King and Queen, and dirt is more similar to Person:

In this case, finding the distance between any two words is an obvious operation: look at the distance between them on the number line. Mathematically this is just subtracting:

\[ d_{\text{king} \to \text{dirt}} = \vert K - D \vert = 1 - -0.5 = 1.5 \]

\[ d_{\text{king} \to \text{palace}} = \vert K - P \vert = 1 - 0.8 = 0.2 \]

2-Dimensional Embedding

Sweet. Now let’s add another dimension into the mix. Remember that for our purposes, word embeddings (the vectors describing these words) are based around concrete concepts. Each number in a vector “embeds” some meaning, like royalty or (as in our next example) wealth.
The words King and Queen obviously convey wealth, while Person is once again neutral on the subject. We’ll actually throw another permanent word into the mix here to make things interesting: Serf ($\vec{S}$). Serfs were (are?) laborers that worked under land-owning lords in the feudal system. Although not as biting as “peasant,” the word still denotes poverty to some extent. So we’ll give it a value in the “wealth” dimension of -0.7; serfs are also obviously the opposite of royal, so, we’ll assign it a value of r = -1 in the royalty dimension.
For illustrative purposes I’ll put these values into a dead simple table so that anyone who isn’t familiar with vector notation can see how it works.

	K	Q	P	S
Royalty	1	1	0	-1
Wealth	0.8	0.8	0	-0.8

With the more compact vector notation (we’ll be using this going foward) these look like:

$$\vec{V} = \begin{bmatrix} r \\ w \end{bmatrix}$$\[ \vec{K} = \begin{bmatrix} 1 \\ 0.8 \end{bmatrix} \hspace{30px} \vec{Q} = \begin{bmatrix} 1 \\ 0.8 \end{bmatrix} \]\[ \vec{P} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \hspace{30px} \vec{S} = \begin{bmatrix} -1 \\ -0.8 \end{bmatrix} \]

For anyone who’s a little less inclined to this notation: these are really just coordinates, except instead of dimensions (x, y) we have dimensions (r, w). That representation might be more familiar–

And as you might expect, we can throw these on a 2-dimensional graph and start to see how these relationships work:

And once again we can use the distance between words to determine likeness along these dimensions. This time however we need to take a decidedly more quantitative approach, using the distance formula. You may remember formula for distance between two points $ (x_1, y_1) $ and $ (x_2, y_2) $ on a graph such as this is:

$$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$

Or in our case–

$$d = \sqrt{(r_2 - r_1)^2 + (w_2 - w_1)^2}$$

So if we want to determine the distance between, for example, Serf (-1, -0.8) and Person (0, 0), we calculate:

\[ d = \sqrt{(-.8 - 0)^2 + (-1 - 0)^2} = \sqrt{1.64} \approx 1.28 \]

And if we want to determine the distance between Serf (-1, -0.8) and Queen (1, 0.8), we calculate:

\[ d = \sqrt{(-1 - 1)^2 + (-0.8 - 0.8)^2} \]

\[ ... \]

\[ \approx 2.56 \]

Which tells us that Serf is more similar to Person than it is to Queen, which makes sense.

At this point it’s also useful to begin thinking about how we would find the nearest word, to any given word, in our mix. One convenient (but not the most efficient) way to do this is to calculate every word’s distance from every other word, and find the lowest value:

	K	Q	S	P
K	0	$d_{\vec{Q} \to \vec{K}}$	$d_{\vec{S} \to \vec{K}}$	$d_{\vec{P} \to \vec{K}}$
Q	$d_{\vec{K} \to \vec{Q}}$	0	$d_{\vec{S} \to \vec{Q}}$	$d_{\vec{P} \to \vec{Q}}$
S	$d_{\vec{K} \to \vec{S}}$	$d_{\vec{Q} \to \vec{S}}$	0	$d_{\vec{P} \to \vec{S}}$
P	$d_{\vec{K} \to \vec{P}}$	$d_{\vec{Q} \to \vec{P}}$	$d_{\vec{S} \to \vec{P}}$	0

Or when calculated:

Details

This table is probably better called a “matrix.”

3-Dimensional Embedding

Not much changes here that we weren’t already aware of, so I’ll just throw the “Famous” (f) dimension into the mix and once again visualize the results. This time we’ll include the words “Actress” (A) and “Mailman” (M) in the mix as well.

$$\vec{V} = \begin{bmatrix} r \\ w \\ f \end{bmatrix}$$\[ \vec{K} = \begin{bmatrix} 1 \\ 0.8 \\ 0.6 \end{bmatrix} \hspace{30px} \vec{Q} = \begin{bmatrix} 1 \\ 0.8 \\ 0.9 \end{bmatrix} \]\[ \vec{P} = \begin{bmatrix} 0 \\ 0 \\ -0.2 \end{bmatrix} \hspace{30px} \vec{S} = \begin{bmatrix} -1 \\ -0.8 \\ -0.7 \end{bmatrix} \]\[ \vec{A} = \begin{bmatrix} 0.3 \\ 0.4 \\ 0.9 \end{bmatrix} \hspace{30px} \vec{M} = \begin{bmatrix} -0.5 \\ -0.4 \\ -0.3 \end{bmatrix} \]

So how quantitatively different are these words? We will use the distance formula for this determination as well. The distance formula in 3 dimensions is really just a simple extension of the 2D formula. It is:

$$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$$

[graph/animation will be here eventually]

N-Dimensional Embedding**

**In case you’re wondering what “n-dimensional” means here, it just means that we are looking at a general case of word embeddings with “n” dimensions. So, our words might be represented with 10 dimensions, or 50, or 1000, it really doesn’t matter. Everything I’m about to say applies to any number “n” that you can plug in.

This is the important part, but it’s also where our intuition begins to fail us and we just need to trust the math. You can apply any number of traits about the words we’re throwing into the mix, and each one adds a dimension. So how do we measure the distance between words in an n-dimensional space? The distance formula actually applies to any number of dimensions, so we can still use it even when we have a huge number of dimensions. Expressed generally, the formula for the distance between two points is: $d(P_1, P_2) = \sqrt{(x_{2,1} - x_{1,1})^2 + (x_{2,2} - x_{1,2})^2 + \ldots + (x_{2,n} - x_{1,n})^2}$

This can be more compactly written using summation notation as:

$$d(P_1, P_2) = \sqrt{\sum_{i=1}^{n} (x_{2,i} - x_{1,i})^2}$$

Nearest Neighbor Search

We actually performed a nearest neighbor search in the 2D example above, but we didn’t call it that. There are plenty of fancy algorithms that make nearest-neighbor searches much more efficient, but since we intend to work with a relatively small dataset, brute force works just fine.

There is actually an equally important comparison between two words in a vector space, called cosine similarity, which is a measure of the angle between the two vectors instead of the distance between them. For our use case, it would perform the same function despite relying on a different calculation; I have opted to only use the distance formula here since algebra is easier than trigonometry to most.

Adapting for sentences

At this point we have some level of intuition about what it means to “add and subtract” words that are represented as vectors. If we add the word “King” to the word “Serf,” we would get something close to the word “Person” since the negative components of Serf (negative wealth, negative royalty, negative fame) would cancel out aspects of King (which has positive wealth, royalty, and fame).
To represent a whole sentence as a vector, we can add all of the words in the sentence together, and then just divide by the number of words in the sentence. Essentially just an average of the words in the sentence. We will call this a “sentence embedding” going forward, and we can actually use it to match sentences to one another that have similar meaning.

Using Nearest-Neighbor to Categorize

Finally, we have all the tools we theoretically need to run a query that categorizes something. Suppose we have word embeddings for the terms “Happiness” and “Sadness”, and we want to know if a given sentence is happy or sad: “I’m absolutely thrilled with the results!” Given the word embeddings for “absolutely,” “thrilled,” and “results,” we can add those vectors together and divide by 3 (in this case I’m removing unimportant words; not all implementations do this). Then, performing a nearest-neighbor search, we will find that our sentence is a happy one.

Implementing in Excel

I’m not finished :)

	K	Q	S	P
K	0	\(d_{\vec{Q} \to \vec{K}}\)	\(d_{\vec{S} \to \vec{K}}\)	\(d_{\vec{P} \to \vec{K}}\)
Q	\(d_{\vec{K} \to \vec{Q}}\)	0	\(d_{\vec{S} \to \vec{Q}}\)	\(d_{\vec{P} \to \vec{Q}}\)
S	\(d_{\vec{K} \to \vec{S}}\)	\(d_{\vec{Q} \to \vec{S}}\)	0	\(d_{\vec{P} \to \vec{S}}\)
P	\(d_{\vec{K} \to \vec{P}}\)	\(d_{\vec{Q} \to \vec{P}}\)	\(d_{\vec{S} \to \vec{P}}\)	0

NLP in Excel - An Introduction to Word Embeddings

Using Word2Vec and some basic math.