I"m looking for the fastest way to check for the occurrence of NaN (
np.nan) in a NumPy array
np.isnan(X) is out of the question, since it builds a boolean array of shape
X.shape, which is potentially gigantic.
np.nan in X, but that seems not to work because
np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can"t tell. This is input validation for library code.)
Ray"s solution is good. However, on my machine it is about 2.5x faster to use
numpy.sum in place of
In : %timeit np.isnan(np.min(x)) 1000 loops, best of 3: 244 us per loop In : %timeit np.isnan(np.sum(x)) 10000 loops, best of 3: 97.3 us per loop
sum doesn"t require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why
sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that
min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand,
sum"s throughput seems constant regardless of whether there are NaNs and where they"re located:
In : x = np.random.rand(100000) In : %timeit np.isnan(np.min(x)) 10000 loops, best of 3: 153 us per loop In : %timeit np.isnan(np.sum(x)) 10000 loops, best of 3: 95.9 us per loop In : x = np.nan In : %timeit np.isnan(np.min(x)) 1000 loops, best of 3: 239 us per loop In : %timeit np.isnan(np.sum(x)) 10000 loops, best of 3: 95.8 us per loop In : x = np.nan In : %timeit np.isnan(np.min(x)) 1000 loops, best of 3: 326 us per loop In : %timeit np.isnan(np.sum(x)) 10000 loops, best of 3: 95.9 us per loop
This book serves as a practical guide on how to utilize big data to store, process, and analyze structured data, focusing on three of the most popular Apache projects in the Hadoop ecosystem: Apache S...
Systems programming provides the basis for global calculation. Developing performance-sensitive code requires a programming language that allows programmers to control the use of memory, processor tim...
Black Hat Python, 2nd Edition: Python Programming for Hackers and Pentesters PDF, 2nd Edition. Fully updated for Python 3, the second edition of this worldwide bestseller (over 100,000 copies sold)...
Data is “unreasonably effective”. Nobel laureate Eugene Wigner referred to the unreasonable effectiveness of mathematics in the natural sciences. What is big data? Its sizes are in the order of te...