R performance issues using gsub and sapply -
i have data frame consisting of +10 million records (all_postcodes). [edit] here few records:
pcode area east north area2 area3 area4 area5 ab101aa 10 394251 806376 s92000003 s08000006 s12000033 s13002483 ab101ab 10 394232 806470 s92000003 s08000006 s12000033 s13002483 ab101af 10 394181 806429 s92000003 s08000006 s12000033 s13002483 ab101ag 10 394251 806376 s92000003 s08000006 s12000033 s13002483
i want create new column containing normalised versions of 1 of columns using following function:
pcode_normalize <- function (x) { x <- gsub(" ", " ", x) if (length(which(strsplit(x, "")[[1]]==" ")) == 0) { x <- paste(substr(x, 1, 4), substr(x, 5, 7)) } x }
i tried execute follows:
all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize)
but takes long. suggestions how improve performance?
all functions used in pcode_normalize
vectorized. there's no need loop using sapply
. looks you're using strsplit
single-spaces. grepl
faster.
using fixed=true
in calls gsub
, grepl
faster, since you're not using regular expressions.
pcode_normalize <- function (x) { x <- gsub(" ", " ", x, fixed=true) sp <- grepl(" ", x, fixed=true) x[!sp] <- paste(substr(x[!sp], 1, 4), substr(x[!sp], 5, 7)) x } all_postcodes$npcode <- pcode_normalize(all_postcodes$pcode)
i couldn't test this, since didn't provide example data, should on right path.
Comments
Post a Comment