R performance issues using gsub and sapply -

August 15, 2011

i have data frame consisting of +10 million records (all_postcodes). [edit] here few records:

pcode  area  east    north   area2     area3      area4      area5 ab101aa 10  394251  806376  s92000003 s08000006  s12000033  s13002483 ab101ab 10  394232  806470  s92000003 s08000006  s12000033  s13002483 ab101af 10  394181  806429  s92000003 s08000006  s12000033  s13002483 ab101ag 10  394251  806376  s92000003 s08000006  s12000033  s13002483

i want create new column containing normalised versions of 1 of columns using following function:

pcode_normalize <- function (x) { x <- gsub("  ", " ", x) if (length(which(strsplit(x, "")[[1]]==" ")) == 0) { x <- paste(substr(x, 1, 4), substr(x, 5, 7)) } x }

i tried execute follows:

all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize)

but takes long. suggestions how improve performance?

all functions used in pcode_normalize vectorized. there's no need loop using sapply. looks you're using strsplit single-spaces. grepl faster.

using fixed=true in calls gsub , grepl faster, since you're not using regular expressions.

pcode_normalize <- function (x) {   x <- gsub("  ", " ", x, fixed=true)   sp <- grepl(" ", x, fixed=true)   x[!sp] <- paste(substr(x[!sp], 1, 4), substr(x[!sp], 5, 7))   x } all_postcodes$npcode <- pcode_normalize(all_postcodes$pcode)

i couldn't test this, since didn't provide example data, should on right path.

Search This Blog

Silver

R performance issues using gsub and sapply -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -