Hello there! Today we are going to talk about positional embeddings.

You should have heard of them, because wherever you see a transformer neural network, you also see positional embeddings attached to them.

And because transformers are now literally everywhere in machine learning, irrespectively of the data type, you should have come across positional embeddings already.

But what are positional embeddings, why do we need them in transformers in the first place, and how do they work? Well, today we are going to lift the mystery over positional embeddings.

Transformers are more and more important for machine learning and have been proven to work very well on any kind of data, especially when there is a lot of it to pretrain through self-supervision.

A big part of the data that we are interested in, comes in a specific order.

If we change this order, the meaning of the input might also change.

The transformer does not process the input in order, sequentially, but in parallel.

For each element, it combines information from the other elements through self-attention, but each element does this aggregation on its own, independently of what the other elements do or have done yet.

Because the transformer architecture per se, does not model the order of the input anywhere, we must encode the order of the input explicitly, for the transformer to know that one piece comes after the other and not in any other permutation.

This is where positional embeddings come in: They are sort of an identifier, an obvious hint for the transformer encoding the whereabouts of a piece of the input within the sequence.

These embeddings are then added to the initial vector representation of the input.

This addition can be intuitively understood as: look! the initial vector representation of the word “queen” is here in the multidimensional space.

It is close to “king”, because of word frequency distribution reasons -- king and queen often occur in the same context.

But then, we add this positional embedding identifying the order of the word.

Hereby we say: “queen” should move a bit along this dimension towards this specific part of the space to cluster with all the other first tokens in any other sequence.

Space and dimension which we do not really have in this 2d drawing, but you get the idea that by adding these vectors we move the initial representations into the direction of the other first tokens, or the second tokens, and so on.

At least, this is how Ms. Coffee Bean pictures it and you will possibly ask: But how can this little multi-dimensional shifting help the transformer? Well, remember we are working with a multi-dimensional spurious correlation identifying beast which is a neural network, therefore why should these systematic shifts not be enough of a hint for it? So, in short, positional embeddings are order or position identifiers added to vectors for the transformer to know the order of the sequence.

These positional embeddings, must fulfil some requirements: First, every position should have the same identifier, irrespectively of the sequence length or of what exactly the input is.

So while the sequence might change, the positional embedding stays the same.

Secondly, because the embeddings push the original vectors by a bit into let’s say… the “first token club”, they should not be too large, otherwise they push the vectors into very, very distinct subspaces where the positional similarity or dissimilarity overtones the semantic similarity.

Okay, then let’s see the solution to this problem, as presented by the “Attention is all you need” paper introducing the transformer.

Their choice of positional embeddings is not the easiest one and is best understood if you have some knowledge of Fourier analysis.

But let’s forget about that and explain it simply: You could think of the sequence as words coming in one after the other.

So the first idea is to take a non-periodic function and number your sequence from 1 to the total number of tokens you have.

But with this recipe, we violate the second requirement of positional embeddings, where we said that shifts should be small, in other words: bounded.

Then let’s choose a function that is bounded in the values it can take, like the sine or cosine, periodically returning to values between 1 and -1.

Sines and cosines have the upside that they are defined up to infinity, so even with enormous sequence lengths, we would still have something between -1 and 1 as elements for our positional embeddings.

In comparison to a sigmoid, which is also bounded, sines and cosines have a lot of variability also for big numbers.

Okay, so let’s choose a sine function to model this.

But we see that with this periodic function, the same outcome would repeat for different positions.

Let’s suppose for a moment that we do not want that and give our sine function such a low frequency, that even for our biggest sequence length, the numbers would not repeat.

With this, we almost have the same situation as our linear function, because we numbered our sequence in order, but this time, with unique values between 0 and 1.

This is still not optimal because we remember, our positional embeddings push the input a bit, to create “positional clubs”.

But we also do not want to push by too much, because otherwise distances between points become dominated by the position and not the semantics.

We also don’t want too little differences between the first or the second token, because in this case, the semantic signal would overtone the positional embedding signal that now contains values that differ very little between let’s say the first or the second token.

So what to do? If one value in one dimension is too small of a signal, let’s make our intent more obvious by using all the other dimensions too! For the next dimension, we could use a cosine which would look like this.

Yet we still have the problem that embedding values between one token and the next one, do not differ by much.

But if we increase the frequency of the cosine, the values differ more, because the slope has increased! Okay, so for this second dimension, let’s use a cosine with increased frequency to create more differentiated values.

The second and first dimension together are uniquely differentiating tokens one from another, but now also with values that differentiate consecutive tokens more.

And if we do this for every dimension, meaning that we alternate between sines and cosines with increasing frequency, we give enough information to ensure that the transformer cannot miss this order of the sequence.

So to recapitulate what we did: The last dimension tells us broadly where the token is, like is it more at the beginning or more toward the end of the sequence.

But because the values are so continuous, we do not have enough resolution to say if it is the first or the second token exactly.

Then the second to last dimension has higher frequency, thus higher resolution.

Because we already know that we are at the beginning of the sequence, we can pin down with more accuracy where we are exactly.

The next dimensions pin down with even more acuity where we are in the sequence.

Well, this was the sine and cosine way of encoding position.

But this is only one way to do it.

This sine and cosine embedding seems to work well on text, but what if you have other types of data? Like graphs.

Or images! Let us know, would you be interested in Relative Position Representations and / or learned positional embeddings? If yes, write us in the comments and I could convince Ms. Coffee Bean to cover this in one of our very next videos! Or let us know in the comments about any other exotic or not-so exotic positional embeddings you would like to know about.

Ms. Coffee Bean, why are you poking me? Aaa, thanks for reminding me! We now have a Patreon page and Ko-Fi page! So if you want and have the means to support us, now you can! We would greatly appreciate it! But if you can’t, it’s fine, it does not change anything.

Our videos will stay free to watch for everyone! Okay bye!