I often want it in a different format than the one it was originally published in (audio → text, text → audio, pdf → ebook). Automated conversion works but is cumbersome. Listening to text articles requires sending them to a special app and converting articles to ebooks is annoying and loses a lot of formatting and navigation.
This is a fascinating use-case which I find myself experiencing as well. As someone with a kinesthetic learning style, I can often struggle to absorb texts without audio, and vice versa.
The fractiousness of existing systems is a complex issue owing to business and technical considerations briefly discussed here (meta comment: Why is reference not a first class semantic construct within hypothes.is?)
Nonetheless, annotation of a fused audio/text stream would seem to be a really interesting product offering.