Learning From Experiments
Intuitions in computing from simple data collection and analysis
Recently, I hacked together a development cluster to help with some data science and application development. My budget laptop couldn’t handle multiple browser tabs, the occasional video conference, slack, and training models. Fast forward 6 cores and 12 threads later and, in only days, I’ve hit a wall.
As a hobbyist PC builder responsible for custom laptops, workstations, and rigs for almost two decades, the recent explosion in High End Desktop (HEDT) processors has been salivating. Unfortunately, I’m a value nut. Until recently, my friends, family, and coworkers didn’t have workloads demanding enough for workstation CPUs that powerful. Because of their price, it also didn’t make sense to adopt the platforms versus staying at top-end SKUs and upgrading every few years. Skylake-X and Threadripper changed all of that.
It started with a fight. I said JSON is a waste of time, so is dealing with associated libraries in python. I bet pickling dictionaries or, better yet, simply importing .py files as dictionaries, would be faster, the code cleaner.
Data generation was accomplished via the faker library and a basic nested dictionary structure. Seperate pseudo-random ‘tags’ were selected for each level of the dictionary with a value at the leaf point. Data types vary from integer indices to randomly generated sentences to help eliminate cast bias in any of the load functions.
Data timing was completed with the time library by wrapping load functions in a start/stop wrapper. Each data type (.pickle, .py, .json) was loaded in as little code with as little fuss as I could manage into both a pandas dataframe and dictionary separately, with results being timed on each.
The Case for Threadripper
I tried to be conservative, generating data with large jumps in size. According to Linux’s Top command, with 4 cores and 8 threads pinned to 100%, collection of a few thousand datapoints from files containing 2 to 2 million records, would take over three days. I have other work to do. There was no way I could afford to run this investigation - a glorified library stress test - in the cloud. There is no monetary value in knowing which data storage type is theoretically faster in python frameworks. I can’t pay Amazon $12/day or more to run a programmer flight-of-fancy. But, it got me thinking. I’m testing libraries, and functions. What about a whole program?
Unit testing and more robust methods could easily swamp my hack-a-server with data generation not to mention running the development server that needs to process the data as it’s generated. How much does it cost to run production-level CI/CD with engineers pushing changes on a daily basis.
Suddenly, Threadripper doesn’t sound like such a bad idea.
For now, a minor, minor, sampling, of the search space.