Command-line CSV processing with Kilgore Trout

Some csv cleanup scripts grew into a little utility named kilgore.

It’s basically a wrapper for some pandas functions. I probably should have just done this with Jupyter notebook, but I’ve been doing so much shell scripting for the Unix class I’m taking this semester, it just seemed quicker and more universal. Or maybe it was just more fun.

Here’s the repo: https://github.com/jakekara/kilgoretrout

Cleanup module registry

kilgore lets you drop columns or select specific ones, prettify column headers (all lowercase, alpha-numeric and underscores), and limit the output to a number of rows. I also added a registry module where you can write scripts specific to your cleanup tasks, register them, and then call them with the “–load” argument.

Other features

kilgore can also output JSON with the –json – an array of rows, each represented by a dictionary of column name and cell value pairs.

Finally, kilgore can be forced to load the data as strings, overriding pandas’ column type inference.

Example scripts

The demo folder in the repo contains raw data, clean data, and scripts to get from raw to clean.

Here’s a slightly modified snippet from absenteeism.sh, for cleaning up CT schools asentee rate data.

    $ kilgore --skiprows 4 -i $FILENAME --pretty\
           --load clean_code_columns > clean/csv/$(basename $FILENAME).csv