This is what is different from word2vec/skip gram models vs transformers. Static vs Dynamic embedding; former one generates one embedding for a word, regardless of what context it was used in. But latter ones generate a dynamic embedding for a word, since attention is included while training the encoder for the embeddings.
In more detail:
There is only one vector for 'bank': a weighted average of bank the financial institution and bank the thing next to a river.
Q/A
The input embedding matrix of a transformer holds static embeddings, just like word2vec, How can I then get the contextual embedding of a specific word given a specific sentence?
Static embeddings use context for training and a lookup table for inference. Contextual embeddings use context in training and inference.
The initial layers still use static embeddings but then using self attention mechanism, a context aware embedding is generated for the word by looking at all other words in the sentence. So, when we have a different sentence using the same word, the dynamic embedding changes since the value of attention will be different for the word in different sentences.