Pandas and a large file

I had to satisfy my curiosity and use Pandas on something substantial. So I did a read_csv on a million record file. Well IMHO…it’s fast. On my decent computer…not the fastest, by any stretch. The dataframe info, returned the following in the blink of an eye. Likewise a simple sum was instantaneous.

I also did something much more taxing…a sort_values on Last Name, First Name and it was noticeably slower (~1 sec) but IMHO, impressive considering it wasn’t an indexed SQL file…see the end for my 1st use of the timeit magic command as it was described.

df.info

<bound method DataFrame.info of Account Code1 Code2 Gender Prefix First \
0 4864130159876517 2 C M Mr Cameron
1 4029852595634794 1 B F Kamilah
2 4689177385753112 1 F M Mr Odis
3 4304237478464178 5 F M Stephan
4 4821479510829505 3 G F Angle
… … … … … … …
999995 4193458599551172 5 F M Mitchell
999996 4716923127249654 5 C M Mr Kendrick
999997 4818979260696413 3 F M Bernardo
999998 4118908054242008 1 B F Cardinal Celine
999999 4838239144084666 3 E M Mr Hyman

             Middle                  Last      Suffix       Birth  \

0 Milford Garza 1971-08-16
1 Raina Perkins 1983-12-16
2 Elias Shepherd 1969-02-06
3 Hayes DPM 1977-03-21
4 Cleora Huffman 1955-08-15
… … … … …
999995 Barron OD 1957-01-28
999996 Antwan Hickman 1968-11-28
999997 Kraig Newton 1996-02-07
999998 Shaunte Fry 1973-10-26
999999 Max Kennedy 1981-05-21

        Enroll  Amount                    Address                  City  \

0 1997-01-12 76.56 73 Piper Townline Whitlash
1 2002-02-28 61.56 56 Jean Avenue Johnson
2 2020-04-24 37.69 746 Spruce Alley Haverhill
3 2006-05-11 26.50 1108 Graham Bypass Cayey
4 1994-08-24 52.44 1210 Howth Parkway Locust Gap
… … … … …
999995 1976-11-22 61.76 623 Merrie Row Saugus
999996 2000-10-30 65.00 300 Vicksburg Nene Oxford
999997 2018-12-17 74.85 747 Chabot Circle Palmer
999998 1991-05-20 68.98 765 Bernal Heights Nene Manlius
999999 2008-10-25 90.13 149 Incinerator Turnpike Morristown

   State    Zip  

0 MT 59545
1 VT 5656
2 NH 3765
3 PR 633
4 PA 17840
… … …
999995 MA 1906
999996 MI 48370
999997 NE 68864
999998 IL 61338
999999 TN 37816

[1000000 rows x 16 columns]>

df[“Amount”].sum()
50502302.91999999

%timeit

%timeit df.sort_values([“Last”, “First”])
967 ms ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)