Fake customer generator program Modification

A few months ago or maybe several years ago I generated some fake customer data, using a Python program I wrote. Then I wrote a program (actually a GnuCOBOL control break program) to print totals by state and realized the number of customers between states were about the same. So Wyoming had approx the same amount of fake “customers” as Florida. Because I assumed, I basically generate a random number between 1 and 50, then grab a state from that number. It’s something I never thought of at the time, but it became obvious when I created some reports. That’s crazy. So I need to do something about that.

Population.

Not so fast!

I’m smarter than I thought.

I need to re-run some tests, because I don’t generate random states, as I assumed, by a number between 1 and 50. I actually select city/state/zip at the same time by selecting random zip code from my zip code data base, which has 42,741 zip codes. So by that virtue, California has more zip codes than Wyoming, so there should be a more realistic spread by state. This is an old table, but still… the customer records generated should be more, realistically represented, than I thought they were.

I did a few SQL SELECTS, from an old SQLite database I created from this program, such as:

 SELECT count(*) FROM ZipCodes WHERE state="CA";.

Here are the results:

CA 2678
NY 2233
FL 1470
WY  197

So the odds are higher that I would select a CA zip code than a WY zip code. Maybe, not perfect (Today FL has a slightly larger population than NY)… but quite a bit more realistic. And acceptable to me.

I am smarter than I thought

In the past, I actually created a one million record SQLite customer database generated from my fake customer Python program . And doing a few queries proves I’m doing something right

CA 62983
NY 52620
FL 34977
WY  4672

So out of one million “customers” there are 62,983 from CA and only 4,672 from WY!

I don’t know why I was under the impression that the ratio of customers per state was roughly the same. But by selecting random zip codes for my fake customer records from a large zip code database I solved a problem without even realizing it could be a problem. This makes my fake customer database look even more real. So I’m able to create SQL statements like this (select count(*) from customers where state=”TX”) and get more realistic looking results.