Transformer Explainer: Pocket Edition

· 189 words · 1 minute read

(Starting Note: This is an abridged/‘cheat sheet’ version of a much, much longer transformer explainer. If you want to read the long version, you can find it here!)


GPT-4 predicts what word will follow a sequence of text. How does GPT work?

If you google for “GPT model”, you find a graphic that looks like this:

Classic Transformer Graphic

This graphic gives a nice bird’s-eye view of a GPT model. But while it shows what steps happen in a GPT model, it doesn’t show how each step works.

To clarify what is actually going on at each step of the GPT graphic, I made an annotated version, pictured below. I hope it can be of help to anyone confused about the ‘specifics’ of the GPT model architecture (e.g. my past self!)

(I don’t provide much detail on the notation, parameters, and operations I refer to in the graphic below, so don’t worry if things seem confusing at first! For more explanation, please see the much longer version of this explainer. :)

Grace’s Annotated Transformer Graphic

I hope this graphic was helpful! As mentioned previously, you can also find a more detailed version of this explainer here.