First published
CSVKit is great. It’s more than a dozen command-line tools to streamline data work with CSVs, and it’s a worthy addition to any journalist’s toolbox. Note that many tasks that can be done in CSVKit can also be done in XSV, which is not maintained anymore but can also come in handy.
So much of building a career is building a toolbox of “it’s useful when it’s useful”-type tools.
Here are several examples of ways I solved journalistic problems with CSVKit.
Retrieve a list of all the unique values in a column
Sometimes we just need to know the distinct values in a column. Maybe it’s a list of incidents by state and we need to know how many states are represented in the dataset – in that case, we’d do something like this:
csvcut -c state example.csv | uniq | tail -n +2 | wc -l
What this command does:
csvcut
gets the column we’re interested in.uniq
pulls out the unique values, *including the header row*tail
slices off the top row for us...- ...so that
wc -l
gives us an accurate count of the number of lines, i.e. unique states
To double-check our work we'd run something like
csvcut -c state example.csv | uniq | tail -n +2 | less
Pare down the massive federal hurricane dataset
The NCEI’s IBTrACS hurricane dataset is lovely. It’s also huge, and getting it to a place where you can start working with it can take some doing.
This csvcut command gets you many key fields for later analysis.
csvcut -c "SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,WMO_WIND,WMO_PRES,WMO_AGENCY,TRACK_TYPE,DIST2LAND,LANDFALL,USA_AGENCY,USA_ATCF_ID,USA_LAT,USA_LON,USA_RECORD,USA_STATUS,USA_WIND,USA_SSHS" ibtracs.NA.list.v04r01.csv > hurricanes.csv
That data was later used in a python script that was used to answer the question "In each year, what was the earliest Category 4 hurricane?"
Download a JSON file and convert it to CSV
This one-liner will download the JSON file of per-county power outages in Texas and turn it into a CSV.
curl 'https://poweroutage.us/api/web/counties?key=9818916638&countryid=us&statename=Texas' | in2csv -f json -k WebCountyRecord > tx.csv
But say you wanted Texas and Louisiana in one CSV. Here's one way to accomplish that:
curl 'https://poweroutage.us/api/web/counties?key=9818916638&countryid=us&statename=Texas' | in2csv -f json -k WebCountyRecord > tx.csv && curl 'https://poweroutage.us/api/web/counties?key=9818916638&countryid=us&statename=Louisiana' | in2csv -f json -k WebCountyRecord | csvstack tx.csv > combined.csv
csvcut -c 2,6 cat-bills.csv | sort -u | grep Enacted
csvcut -c 2,6 cat-bills-latest.csv | sort -u | grep Enacted | sed -e 's/,Enacted//g' | wc -l
csvcut -t -c createdAtMillis,classification,tweetId notes-00000.tsv > notes-created-class-tweetid.csv
csvgrep -c "declarationDate" -r "^2024" IndividualsAndHouseholdsProgramValidRegistrations.csv.2 > fema-2024.csv
curl https://api.doge.gov/savings/contracts\?sort_by\=savings\&sort_order\=desc\&page\=1\&per_page\=500 | in2csv -f json -k result | less
You can email me at joe.murphy@gmail.com, or find another way to contact me here.