To check I understand the requirements here, you want to be able to index a conversation thread (annotation + all replies) as one ES document, and then in response to a query, return a data structure which contains the IDs of matching conversations plus the IDs of matching items (annotation or original reply) within those conversations?
So this is essentially the same problem as say, finding out which page matched if you were indexing multi-page documents?
Presumably ES can store position information with indexed terms. In that case here is one possible approach: Take all of the original items in the thread and serialize them into a single string - which is indexed with positional information, and separately the offsets of each item within that string are stored as a non-indexed field.
eg:
"content" field: annotation content | first reply | second reply
"offsets" field: <first reply ID>:<offset of first reply>,<second reply ID>:<offset of second reply>
When a search query is received, an ES query is performed to find the matching documents and get the offsets of matches within the "content" field. These offsets are then looked up in the "offsets" field to get the thread IDs.