Create random customer change

Replace a few key parts (mainly name) of my create random customer program with the python “mimesis” package. To say its faster is an understatement, from ~30 minutes to 1 min 14sec to generate 1,000,000 records is pretty impressive. I must say they claim they can create 10K full names in 0.254 seconds and that sounds impressive, however I think they’re being modest because in addition to 10K names, my program creates 15 additional random fields and writes these to disk in 0m0.842s. And that just my 1st go at it. Code went from almost 340 lines to 275. What can I expect if/when I learn it’s other capabilities?

For example there should be a way to generate valid random CSZ where locations are valid. No Las Vegas, Alaska. Zip Codes should match city, state. Also can I get a valid mod 10 check digit number. I’m doing that myself with an external check digit routine. Obviously this library helps immensely.

I think this proves that the most significant bottleneck in my program was the generate random name routine. It used an SQLite database which it accessed at least 3 times (first, middle, last).

If I can create one million records in a little over one minute…do I even try very hard to improve it? Probably not! Thirty minutes…yes…1 min 14 sec…No. Thirty minutes for old program means 2 hours for 4 million records as opposed to ~5 minutes for new program. The speed improvement from 1hr 15min in Python 2 to 20min in Julia was worth the rewrite. Then a slight decrease in speed from Julia to Python 3 was acceptable because unlike Julia everything continues to work from each Python version to version update. Each time I would think…what can I do to make this run faster? But now I’m not now thinking if only I could create one million records even faster than a minute!

At my 1st real computer job a one million record database was large. But in this day of the internet, that’s nothing. Facebook has 2.6 billion monthly active users. Hence the term Big Data, which I heard on a podcast the other day by some data scientists, is not being used as much these days!

Below is a small sample of the random customer record. It’s in CSV format, for all intents and purposes…actually BSV. I made it an image because I’m tired and I couldn’t find an obvious way to insert fixed width text.

Click to enlarge in a new window!