I found this study very interesting, and despite my limited knowledge of pLMs and conformal statistics, I have a few comments about the results pertaining to section 3.1. Perhaps my comments may provide a data point on how non-experts may engage with the paper. Please feel free to take or leave any of my suggestions/remarks.
I really like the approach of establishing conformal guarantees for all the reasons stated in the introduction. I especially liked the genericism with which the application of conformal statistics to this problem is presented, and that it was made clear that an explicit "non-goal" of the study was to demo a new machine learning model for enzyme classification.
While reading, I kept thinking about the fact that members of a Pfam domain do not necessarily share the same biochemical function. This is because less than 0.1% of protein functional annotations are linked to experimental evidence (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9374478/) and the rest--the vast and overwhelming majority--are annotated transitively based on similarity scores of some kind.
With that in mind, I think the authors could do better to point out that the ground truth upon which their terms FP, TP, and FDR are defined, is itself a proxy for shared function. I don't believe this at all detracts from the results of the paper, but pointing out these assumptions would increase the trust of readers who question what you mean by terms like conformal "guarantees" and "true" positives. My apologies if you already explained this somewhere and I missed it.
Since JCVI Syn3.0 was published in 2016, it would be interesting to see whether the traditional search methods (BLAST & HMMSearch) still yield 20% unknown function, or whether or our annotations have since improved.
It would also be interesting to see if the Protein-Vec hits in the Syn3.0 case study that don't exceed lambda are systematically "worse" than the true positives, for example as measured by TM-score.
Thanks again for putting out this interesting study.