Across network architectures and complexity levels, scenario classification is only moderately accurate and strongly asymmetric as revealed by the confusion matrix.
To an extent, this makes me think back to the problem highlighted by Louca and Pennell (2020) of there being, for any given phylogeny of extant species only, large "congruence classes" of disparate diversification scenarios that could produce the observed trees. Could it be that this is still a problem that is faced by neural networks, as well as the more classical likelihood methods? Perhaps GNNs and the architectures you outline here could excel at inferring more localized (e.g. branch-specific, or tree-heterogeneous) diversification dynamics, rather than estimating tree-wide parameters under a global model?