When you read csv files, you regularly encounter Excel encoded csv files which include extraneous characters such as commas, dollar signs, and quotes. Such a file might look like
col1, col2, col3
"1,233", "$12.79", "$1,333,233.17"
"470", "$1,113.22", "$0.12"
and call it dirty.csv.
Using read.csv will leave you with:
> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> data
col1 col2 col3
1 1,233 $12.79 $1,333,233.17
2 470 $1,113.22 $0.12
> str(data)
'data.frame': 2 obs. of 3 variables:
$ col1: Factor w/ 2 levels "1,233","470": 1 2
$ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
$ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1
>
which is not ideal. Unfortunately, converting the factors to numbers will fail, as they aren’t in a valid numeric format:
> data$col1
[1] 1,233 470
Levels: 1,233 470
> as.numeric(data$col1)
[1] 1 2
>
The easiest thing to do is to clean the columns of our data frame with a regular expression, as such:
> gsub( ',', '', data$col1)
[1] "1233" "470"
>
A better regular expression would clean all sorts of extraneous characters out of the columns:
> example <- '\'"$99.37$'
> example
[1] "'\"$99.37$"
> gsub('[^a-zA-Z0-9.]', '', example)
[1] "99.37"
> # -- also remove various money signs
> gsub('[^a-zA-Z0-9.]', '', paste(' €£', example) )
[1] "99.37"
>
And we can apply this to our columns as such:
> data2 <- data
> data$col1 <- gsub('[^a-zA-Z0-9.]', '', data$col1)
> data$col1 <- as.numeric(data$col1)
> str(data)
'data.frame': 2 obs. of 3 variables:
$ col1: num 1233 470
$ col2: Factor w/ 2 levels " $1,113.22"," $12.79": 2 1
$ col3: Factor w/ 2 levels " $0.12"," $1,333,233.17": 2 1
>
But this is annoying to have to type for each column — what if we had many? — so let’s do this programmatically:
> clean <- function(ttt){
+ as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data <- data2
> clean(data$col1)
[1] 1233 470
>
> # all columns at once
> data[] <- sapply(data, clean)
> str(data)
'data.frame': 2 obs. of 3 variables:
$ col1: num 1233 470
$ col2: num 12.8 1113.2
$ col3: num 1.33e+06 1.20e-01
>
Thus we can collapse all our work to this:
> data <- read.csv(file='~/stuff/blog/dirty.csv', header=T, sep=',')
> clean <- function(ttt){
+ as.numeric( gsub('[^a-zA-Z0-9.]', '', ttt))
+ }
> data[] <- sapply(data, clean)
> str(data)
'data.frame': 2 obs. of 3 variables:
$ col1: num 1233 470
$ col2: num 12.8 1113.2
$ col3: num 1.33e+06 1.20e-01
>
Nice touch to use gsub to keep what you want to keep instead of trying to weed out the things you can’t allow.
What about header rows? The header row it isn’t numerical and two header values such as DRAWDOWN ($) and DRAWDOWN(%) would both resolve to DRAWDOWN if the above worked on strings.
Anyway I like the basic idea for the data portions of a csv file. Thanks!