In recent studies on GUIs, I collected data from over 100 crowdworkers. I often encountered a common issue where some of the data appeared to be from participants who were not performing the tasks diligently.
Using humans to test GUI's can cause issues as not all humans will perform the various tasks at the same level. It's important to understand what sample size is needed to collect an acurate enough data set. It will be interesting to see in the future how AI can make the testing not just more accurate, but to be able to do it more quickly.