There is grandeur in this view of life

martins bioblogg

Using R: reading tables that need a little cleaning

with one comment

Sometimes one needs to read tables that are a bit messy, so that read.table doesn’t immediately recognize the content as numerical. Maybe some weird characters are sprinkled in the table (ever been given a table with significance stars in otherwise numerical columns?). Some search and replace is needed. You can do this by hand, and I know this is a task for which many people would turn to a one-liner of perl or awk, but I like to do my scripting in R.

Again, this is in the ”trivial of you only think it out” category of problems. But it gives a starting point for more subtle changes to text files that one might have to do from time to time.

Assume the data look something like this (I made this up):

id;col1;col2
id1; 0,2;0.55*
id2; -,1;,23

Thankfully, R’s conversion from numbers to text is pretty good. It will understand, for instance, that both .5 and 0.5 are numbers. If the problem was just an odd decimal separator, like the Swedish decimal comma, we could just change the dec parameter to read.table. In the above, we have mixed decimal separators. Let’s start out by reading the table as character.

character.data <- read.table("some_data.csv", sep=";",
                   colClasses="character", header=T)
row.names(character.data) <- character.data[,1]
character.data <- character.data[,-1]

This also stores away the sample ids in row names, and removes the first column with the ids. This is optional, but for this example I assume that all columns should be numeric. If that is not the case, the code below can of course be restricted to the numeric columns, and the character (or factor) columns can be added later.

A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors. After removing characters that aren’t numbers, decimal separators, or the minus sign, we change the decimal separator, and convert the vector to numeric. Finally, we make the list a data.frame, and propagate the row names.

library(stringr)
library(plyr)

data <- data.frame(llply(character.data, function(x) {
  x <- str_replace_all(x, pattern="[^0-9\\.\\,-]", replacement="")
  x <- str_replace(x, pattern="\\,", replacement=".")
  return(as.numeric(x))
}))
row.names(data) <- row.names(character.data)
About these ads

Written by mrtnj

24 mars, 2013 den 23:00

Publicerat i computer stuff

Taggad med ,

Ett svar

Prenumerera på kommentarer via RSS.

  1. […] together you end up with inconsistent field separators. I’ve written a post about a way to deal with that in some cases, but you can also search and replace the file with a text editor, or use a command-line tool like […]


Kommentera

Fyll i dina uppgifter nedan eller klicka på en ikon för att logga in:

WordPress.com Logo

Du kommenterar med ditt WordPress.com-konto. Logga ut / Ändra )

Twitter-bild

Du kommenterar med ditt Twitter-konto. Logga ut / Ändra )

Facebook-foto

Du kommenterar med ditt Facebook-konto. Logga ut / Ändra )

Google+ photo

Du kommenterar med ditt Google+-konto. Logga ut / Ändra )

Ansluter till %s

Följ

Få meddelanden om nya inlägg via e-post.

Gör sällskap med 1 118 andra följare

%d bloggers like this: