Cleaning Data in R: csv files

When you read csv files, you regularly encounter Excel encoded csv files which include extraneous characters such as commas, dollar signs, and quotes. Such a file might look like

col1, col2, col3
"1,233", "$12.79", "$1,333,233.17"
"470", "$1,113.22", "$0.12"


and call it dirty.csv.

Using read.csv will leave you with:

> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> data
   col1       col2           col3
1 1,233     $12.79  $1,333,233.17
2   470  $1,113.22          $0.12
> str(data)
'data.frame':	2 obs. of  3 variables:
 $ col1: Factor w/ 2 levels "1,233","470": 1 2
 $ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
 $ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1
>


which is not ideal. Unfortunately, converting the factors to numbers will fail, as they aren’t in a valid numeric format:

> data$col1
[1] 1,233 470
Levels: 1,233 470
> as.numeric(data$col1)
[1] 1 2
>

The easiest thing to do is to clean the columns of our data frame with a regular expression, as such:

> gsub( ',', '', data$col1)
[1] "1233" "470"
>

A better regular expression would clean all sorts of extraneous characters out of the columns:

> example <- '\'"$99.37$'
> example
[1] "'\"$99.37$"
> gsub('[^a-zA-Z0-9.]', '', example)
[1] "99.37"
> # -- also remove various money signs
> gsub('[^a-zA-Z0-9.]', '', paste(' €£', example) )
[1] "99.37"
>

And we can apply this to our columns as such:

> data2 <- data
> data$col1 <- gsub('[^a-zA-Z0-9.]', '', data$col1)
> data$col1 <- as.numeric(data$col1)
> str(data)
'data.frame':	2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
 $ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1
>

But this is annoying to have to type for each column — what if we had many? — so let’s do this programmatically:

> clean <- function(ttt){
+   as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data <- data2
> clean(data$col1)
[1] 1233  470
>
> # all columns at once
> data[] <- sapply(data, clean)
> str(data)
'data.frame':	2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: num  12.8 1113.2
 $ col3: num  1.33e+06 1.20e-01
>

Thus we can collapse all our work to this:

> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> clean <- function(ttt){
+   as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data[] <- sapply(data, clean)
> str(data)
'data.frame':	2 obs. of  3 variables:
 $ col1: num  1233 470
 $ col2: num  12.8 1113.2
 $ col3: num  1.33e+06 1.20e-01
>

This entry was posted in Data Munging, R. Bookmark the permalink.

One Response to Cleaning Data in R: csv files

  1. Mark Knecht says:

    Nice touch to use gsub to keep what you want to keep instead of trying to weed out the things you can’t allow.

    What about header rows? The header row it isn’t numerical and two header values such as DRAWDOWN ($) and DRAWDOWN(%) would both resolve to DRAWDOWN if the above worked on strings.

    Anyway I like the basic idea for the data portions of a csv file. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>