Examining CSV Data Columns From a Shell

It’s very handy to be able to pop open a shell and peek in your csv files. awk is a command that will do just that — it divides each line into fields based either on a whitespace separator or by a separator specified by -F. Here’s a script that prints the 8th data column, sorts it, and prints out the unique values. It runs in a couple seconds on a 10MM row csv file.

awk -F, '{print $8}' aad.uploadnewb.csv | sort | uniq

2009-04-22
2009-04-26
2009-05-01
2009-05-02
2009-05-03
2009-05-04
2009-05-05

If you prefer uniq to spit out the counts, it will do so with the -c command:

#  awk -F, '{print $8}' aad.uploadnewb.csv | sort | uniq -c
  341770 2009-04-22
  445955 2009-04-26
  242915 2009-05-01
  135366 2009-05-02
  943205 2009-05-03
  545772 2009-05-04
  745088 2009-05-05

For the record, the csv file in question looks like this:

# head aad.uploadnewb.csv
0,78,1,1.79487,0.0128205,0.14,0.14,2009-04-22,2009-04-22,2009-07-08 02:55:52
1,168,3,1.25,0.0178571,0.07,0.21,2009-04-22,2009-04-22,2009-07-08 02:55:52
2,168,31,10.4167,0.184524,0.0564516,1.75,2009-04-22,2009-04-22,2009-07-08 02:55:52
4,297,38,20.0673,0.127946,0.156842,5.96,2009-04-22,2009-04-22,2009-07-08 02:55:52
6,235,23,4.51064,0.0978723,0.046087,1.06,2009-04-22,2009-04-22,2009-07-08 02:55:52
12,52,4,1.34615,0.0769231,0.0175,0.07,2009-04-22,2009-04-22,2009-07-08 02:55:52
19,75,4,2.13333,0.0533333,0.04,0.16,2009-04-22,2009-04-22,2009-07-08 02:55:53

Equivalent code in R would be something like

> d <- read.csv(file='./aad.uploadnewb.csv', header=F, sep=',')
>
> # just the unique values
> unique(d$X8)
>
> # include counts
> table(d$X8)

This entry was posted in Data Munging, R, R Tip and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>