Pandas and a large file

I had to satisfy my curiosity and use Pandas on something substantial. So I did a read_csv on a million record file. Well IMHO...it's fast. On my decent computer...not the fastest, by any stretch. The dataframe info, returned the following in the blink of an eye. Likewise a simple sum was instantaneous.

I also did something much more taxing...a sort_values on Last Name, First Name and it was noticeably slower (~1 sec) but IMHO, impressive considering it wasn't an indexed SQL file...see the end for my 1st use of the timeit magic command as it was described.

df.info

<bound method DataFrame.info of Account Code1 Code2 Gender Prefix First \

0 4864130159876517 2 C M Mr Cameron

1 4029852595634794 1 B F Kamilah

2 4689177385753112 1 F M Mr Odis

3 4304237478464178 5 F M Stephan

4 4821479510829505 3 G F Angle

… … … … … … …

999995 4193458599551172 5 F M Mitchell

999996 4716923127249654 5 C M Mr Kendrick

999997 4818979260696413 3 F M Bernardo

999998 4118908054242008 1 B F Cardinal Celine

999999 4838239144084666 3 E M Mr Hyman

```

Middle Last Suffix Birth \

```

0 Milford Garza 1971-08-16

1 Raina Perkins 1983-12-16

2 Elias Shepherd 1969-02-06

3 Hayes DPM 1977-03-21

4 Cleora Huffman 1955-08-15

… … … … …

999995 Barron OD 1957-01-28

999996 Antwan Hickman 1968-11-28

999997 Kraig Newton 1996-02-07

999998 Shaunte Fry 1973-10-26

999999 Max Kennedy 1981-05-21

```

Enroll Amount Address City \

```

0 1997-01-12 76.56 73 Piper Townline Whitlash

1 2002-02-28 61.56 56 Jean Avenue Johnson

2 2020-04-24 37.69 746 Spruce Alley Haverhill

3 2006-05-11 26.50 1108 Graham Bypass Cayey

4 1994-08-24 52.44 1210 Howth Parkway Locust Gap

… … … … …

999995 1976-11-22 61.76 623 Merrie Row Saugus

999996 2000-10-30 65.00 300 Vicksburg Nene Oxford

999997 2018-12-17 74.85 747 Chabot Circle Palmer

999998 1991-05-20 68.98 765 Bernal Heights Nene Manlius

999999 2008-10-25 90.13 149 Incinerator Turnpike Morristown

```

State Zip

```

0 MT 59545

1 VT 5656

2 NH 3765

3 PR 633

4 PA 17840

… … …

999995 MA 1906

999996 MI 48370

999997 NE 68864

999998 IL 61338

999999 TN 37816

[1000000 rows x 16 columns]>

df["Amount"].sum()

50502302.91999999

%timeit

%timeit df.sort_values(["Last", "First"])

967 ms ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2020-07-01 23:05:49
Index