Bye Bye Pycharm

It’s back to VS Code. Somehow a Python program I was working on with PyCharm in one of my Python subdirectory got saved to my home directory. That’s strange. That never happened in all the time I used VS Code. Oh well I’ll just move it back to where I keep almost all…if not all my Python programs. However now, after opening it in Pycharm and attempting to run it, it basically it tells me it can’t find it in my home directory. Well there’s a good reason for that. It’s not in my home directory! It’s in the Python subdirectory I opened it with PyCharm from. I don’t know what dumb ass explanation, that probably makes sense if you graduated from PyCharm U, as to why that happened, and quite frankly…I don’t care. Evidently all the basic file handling rules I’ve learned through the years, go out the window if you use the super advanced Pycharm. That’s insane. I want to spend my time right now concentrating on learning pandas not PyCharm. I’m certainly not Charmed!

Udemy Data Analysis with Pandas and Python course

Well predictably my progress in my Udemy Data Analysis with Pandas and Python course is slower than I’d like. Mostly as I previously said, the video doesn’t work properly. And I’m at this location ~11 hrs a day 5 days a week. It works fine on Linux Mint and Manjaro on my home computer, so I assume it’s a Mate problem. Never fear there are many good Data Science videos on YouTube. I’m not having a video playback problem on YouTube. Also it allows me a little time to play with PyCharm. Although I may switch to Atom on the older computer because PyCharm is noticeably sluggish on it.

Pandas and a large file

I had to satisfy my curiosity and use Pandas on something substantial. So I did a read_csv on a million record file. Well IMHO…it’s fast. On my decent computer…not the fastest, by any stretch. The dataframe info, returned the following in the blink of an eye. Likewise a simple sum was instantaneous.

I also did something much more taxing…a sort_values on Last Name, First Name and it was noticeably slower (~1 sec) but IMHO, impressive considering it wasn’t an indexed SQL file…see the end for my 1st use of the timeit magic command as it was described.

df.info

<bound method DataFrame.info of Account Code1 Code2 Gender Prefix First \
0 4864130159876517 2 C M Mr Cameron
1 4029852595634794 1 B F Kamilah
2 4689177385753112 1 F M Mr Odis
3 4304237478464178 5 F M Stephan
4 4821479510829505 3 G F Angle
… … … … … … …
999995 4193458599551172 5 F M Mitchell
999996 4716923127249654 5 C M Mr Kendrick
999997 4818979260696413 3 F M Bernardo
999998 4118908054242008 1 B F Cardinal Celine
999999 4838239144084666 3 E M Mr Hyman

             Middle                  Last      Suffix       Birth  \

0 Milford Garza 1971-08-16
1 Raina Perkins 1983-12-16
2 Elias Shepherd 1969-02-06
3 Hayes DPM 1977-03-21
4 Cleora Huffman 1955-08-15
… … … … …
999995 Barron OD 1957-01-28
999996 Antwan Hickman 1968-11-28
999997 Kraig Newton 1996-02-07
999998 Shaunte Fry 1973-10-26
999999 Max Kennedy 1981-05-21

        Enroll  Amount                    Address                  City  \

0 1997-01-12 76.56 73 Piper Townline Whitlash
1 2002-02-28 61.56 56 Jean Avenue Johnson
2 2020-04-24 37.69 746 Spruce Alley Haverhill
3 2006-05-11 26.50 1108 Graham Bypass Cayey
4 1994-08-24 52.44 1210 Howth Parkway Locust Gap
… … … … …
999995 1976-11-22 61.76 623 Merrie Row Saugus
999996 2000-10-30 65.00 300 Vicksburg Nene Oxford
999997 2018-12-17 74.85 747 Chabot Circle Palmer
999998 1991-05-20 68.98 765 Bernal Heights Nene Manlius
999999 2008-10-25 90.13 149 Incinerator Turnpike Morristown

   State    Zip  

0 MT 59545
1 VT 5656
2 NH 3765
3 PR 633
4 PA 17840
… … …
999995 MA 1906
999996 MI 48370
999997 NE 68864
999998 IL 61338
999999 TN 37816

[1000000 rows x 16 columns]>

df[“Amount”].sum()
50502302.91999999

%timeit

%timeit df.sort_values([“Last”, “First”])
967 ms ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pycharm

Another diversion. I’ve been using MS VS Code for a while and found it a decent IDE. However after a long break going to try PyCharm again. One thing I didn’t like about it before was I felt it was sluggish, especially on first use. That was back on my older system with 4GB memory. However at the time 4GB wasn’t terrible. It no longer feels sluggish on my Ryzen 16GB system. I was reminded about PyCharm again while listening to the latest Destination Linux podcast.

VS Code Pros:

  • Works with many languages.
  • Including Julia which I was using at the time, but not currently.
  • If I remember there was a Atom enhancement for Julia. If/when I choose to give it another try, and I probably will!

VS Code Cons:

  • It’s Microsoft. I don’t like simply saying because It’s Microsoft…that’s too easy. I’ve talked about MS before and I don’t like to jump on the let’s all dislike MS boat. I’ve been using Linux a long time and stated my reasons elsewhere. Today it’s more about not trusting a large company with a past of trying unsuccessfully, to kill Linux. An OS that has helped me to continue to learn about computer technology…free of charge.
  • I hate that it wants to be my everything editor. For regular text files I prefer a simple text editor! I’m guessing this is probably a easy fix. I briefly unsuccessfully tried a few things. I hated VS Code opening my text files where a long line would just continue off the screen and I would have to use the scroll bar to see the whole line.

Learning distractions

The below post shows how easily I can become distracted from my primary task…learning Python pandas! Because of this course I wanted more data to play with than the course provides. Sure it works well with a few thousand records but how would it fare with a more realistic larger dataset? So I decided to use a big data set that I created with my create random customer Python program, that I created a while ago to get a feel of how SQLite would handle bigger data. And to practice SQL selects. However my data didn’t have any numeric fields to practice pandas math routines on. I had added an amount field to my Julia program. but it no longer works. So I added it to my recent conversion of my Python 2 to Python 3 program. Well somewhere in the middle of the changes for Python 3, I learned about the Python “mimesis” package. I don’t remember where I first heard about it. Maybe a Python podcast, it was talked about in at least two. Anyway it all goes towards learning but my pandas progress slowed down a bit. But unless I satisfied my question about how pandas worked with real data it would be hard for me to maintain my enthusiasm to learn.

I mean really what was I thinking? A popular package used every day by data scientists around the world…and I’m wondering if it’s been tested. Still, sometimes you have to do something, just for your own satisfaction…

I figured it out!

The following is what I assume any professional Python developer is already well familiar with. However I’m usually familiar enough with the few packages I import for it not to be a problem. So it’s not something I’ve used in years.

I wanted to replace my account number create routine with the fake credit card number available in mimesis. By googling, I saw examples of it on the internet. However all the ways I tried failed. Supposedly credit_card_number(CardType…) was available as…

from mimesis import Person
person=Person()
ccn=person.credit_card_number(CardType…)

or…

from mimesis import Personel
personel=Personel()
ccn=personel.credit_card_number(CardType…)

Next I started guessing from what I knew…

from mimesis import Business
bus=Business()
ccn=bus.credit_card_number(CardType…)

I also tried Numbers like I did above with Business

All of these failed! I was just about to send an email asking for help, which I’d rather not do, if possible, when I remembered that Python has a way to expose their methods using “__dict__”. I had to google it…but I remembered!

import mimesis
for ls in mimesis.__dict__: print(ls)

I spotted Payment from that little bit of code…that’s probably it I thought! So from there I tried Payment…

from mimesis import Payment
for ls in Payment.__dict__: print(ls)

and I found it! So the solution (as of today) is…

from mimesis import Payment
pay=Payment()
ccn=pay.credit_card_number(CardType.VISA)

Udemy exercise 5

Coding exercise 5 after lecture 49

Udemy Exercise

The actual problem is above. My answers are below the last 3 comments.
I was sure my answer was correct however it was flagged as wrong!
Another student named Sridivya had a similar problem The instructor replied…


Boris Paskhaver Boris — InstructorAnswer 2 months ago
The code is looking for you to use the square bracket syntax instead.


Really? Where does it say that?
As you can see my answer actually used both methods to show I was paying attention.
My last answer used the square bracket syntax, which I used to show I was aware of both methods.
Nowhere does it say to use the bracket method!!!!
Also despite what they said my answer was NOT wrong.
Because you can easily type it into a Jupyter notebook and test it…and I did…and it works…and most importantly it was just taught!
He said his preferred way was to use brackets. But the Coding exercise didn’t say do it the way the instructor prefers.

In the grand scheme of things it’s a small complaint. My first after almost 50 steps. The course has been very good!

Create random customer change

Replace a few key parts (mainly name) of my create random customer program with the python “mimesis” package. To say its faster is an understatement, from ~30 minutes to 1 min 14sec to generate 1,000,000 records is pretty impressive. I must say they claim they can create 10K full names in 0.254 seconds and that sounds impressive, however I think they’re being modest because in addition to 10K names, my program creates 15 additional random fields and writes these to disk in 0m0.842s. And that just my 1st go at it. Code went from almost 340 lines to 275. What can I expect if/when I learn it’s other capabilities?

For example there should be a way to generate valid random CSZ where locations are valid. No Las Vegas, Alaska. Zip Codes should match city, state. Also can I get a valid mod 10 check digit number. I’m doing that myself with an external check digit routine. Obviously this library helps immensely.

I think this proves that the most significant bottleneck in my program was the generate random name routine. It used an SQLite database which it accessed at least 3 times (first, middle, last).

If I can create one million records in a little over one minute…do I even try very hard to improve it? Probably not! Thirty minutes…yes…1 min 14 sec…No. Thirty minutes for old program means 2 hours for 4 million records as opposed to ~5 minutes for new program. The speed improvement from 1hr 15min in Python 2 to 20min in Julia was worth the rewrite. Then a slight decrease in speed from Julia to Python 3 was acceptable because unlike Julia everything continues to work from each Python version to version update. Each time I would think…what can I do to make this run faster? But now I’m not now thinking if only I could create one million records even faster than a minute!

At my 1st real computer job a one million record database was large. But in this day of the internet, that’s nothing. Facebook has 2.6 billion monthly active users. Hence the term Big Data, which I heard on a podcast the other day by some data scientists, is not being used as much these days!

Below is a small sample of the random customer record. It’s in CSV format, for all intents and purposes…actually BSV. I made it an image because I’m tired and I couldn’t find an obvious way to insert fixed width text.

Click to enlarge in a new window!

mimesis

Looked into mimesis an interesting python package to generate fake data. Per their instructions installed into a python virtual environment.

I may use this for my create customer program to do some of the work in an effort to speed up my program. They say it’s fast, it seems fast. They claim it’s much faster than a similar Python library called Faker. I can take a look at Faker also, nothing saying I can’t use both. I can let it generate names. The address look more varied than mine. I don’t know its full capabilities but perhaps the account number also. Dates are another possibility. With this library I could add occupation. I may have to do CSZ because I don’t know if it can create valid combinations of the three. CSZ as I do it shouldn’t be too slow because all 3 are in the same record. It can do phone numbers but are they valid for the location?