Tuesday, November 3, 2015

Time your assumptions

This short post is sort of a reminder to myself not to get too lazy about assuming stuff.

Suppose you have a dictionary in Python. Millions of them actually, coming as a stream. You need to filter out dictionaries that have one or more of their values equal to None. So you start with the obvious

# dicts is a list of dictionaries
for d in dicts:
    if None in d.values():
        continue
    process(d)

Then you realize that d.values() returns list (or list-like in Python 3), and membership test in Python list costs you O(n) complexity, and that thought additionally backed up by the fact, that majority of the dictionaries in the stream do not have None in them, i.e. most of the time the code will scan and compare the whole list of values of every dictionary.

The next obvious step is to use the powers of set:

# dicts is a list of dictionaries
for d in dicts:
    if None in set(d.values()):
        continue
    process(d)

But then, some quiet bird in the back of your head keeps singing "Is it really worth it"? And then you slap your lob forehead - of course not! To build set one needs to scan the list anyway and that's already O(n); on top of that you need to build the set data structure itself and invoke membership testing, etc... Anyway, given a list of unknown values one needs to read each of them at least once for a membership testing.

In fact, micro-benchmarking shows exactly this

$ python -mtimeit "None in [1,2,3,4,5,6,7,8,9,10]"
1000000 loops, best of 3: 0.239 usec per loop
$ python -mtimeit "None in set([1,2,3,4,5,6,7,8,9,10])"
1000000 loops, best of 3: 0.535 usec per loop
I.e. the use of set, slows us down two times!

Python 3

It turns out that membership testing for a list in Python 3.4 is about 25-30% faster:
$ python3 -mtimeit "None in [1,2,3,4,5,6,7,8,9,10]"
1000000 loops, best of 3: 0.181 usec per loop