Pandas speed and larger test data

Great pandas course [Can you tell I discovered a WordPress feature…dropcap?]. This isn’t a thought I just came up with, but I remembered while watching the intro video , it said at around 1:30 into the video that pandas can operate on tens of millions of rows in milliseconds. Anyhoo, I have a one million records customer database. I created a dataframe by a SQL select of only 50,000 of those rows and and on a older 4GB computer computer it seemed to grind to a halt while doing a simple sum. Is it hung? At first I assumed it was mainly memory. But then again I wasn’t sure because the 4GB didn’t fill or start paging. So probably not.

But, on my newer/faster computer with 16GB, a simple select doing nothing else took maybe a half a second. But simply adding a non specific sum [df.sum()] it seemed to figure out that out of 16 fields only the ‘Amount’ could be summed, but took about 15 seconds, whereas if I was specific about the ‘Amount’ field to sum [df[‘Amount’].sum()], it was only slightly slower than the select by itself…still taking about a half a second.

So now that I don’t believe the problem was a memory issue, or hung, I reran the program on the older computer again. The non specific sum [df.sum()] took about 1min 37 seconds, whereas the specific sum on the ‘Amount’ field [df[‘Amount’].sum()], only took 1.6 sec.

So obviously you can get good performance from pandas. But like anything you need to be careful how you code it.

Also I think my test show the value of testing with a large amount of data. The datasets provided with the course are CSV files with maybe a few thousand records. Totally understandable for a course, however my one million record customer data file with 16 fields is only around 50MB compressed. With only a few thousand records, some issues can be hidden.