We found that using MINE directly gave identical performance when the task was nontrivial, but became very unstable if the target was easy to predict from the context (e.g., when predicting a single step in the future and the target overlaps with the context).
all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper