n encoder-decodertransformers, the TTS alignment is learned in certain cross-attention heads of the decoder; while in decoder-only models,the alignment is learned in the self-attention layers.
Good point of difference between En-De vs De only models.