This check is required for cases where the number of elements in an array is not evenly divisible by the thread block size, and as a result the number of threads launched by the kernel is larger than the array size
This is confusing: how the hell can the number of THREADS be greater than the number os elements?
I mean, unless there is at least one thread per element, this would make no sense. And, even though, this case is apparently absurd, because why would we need one thread PER ELEMENT at all?