R performance issues using gsub and sapply -
i have data frame consisting of +10 million records (all_postcodes). [edit] here few records:
pcode area east north area2 area3 area4 area5 ab101aa 10 394251 806376 s92000003 s08000006 s12000033 s13002483 ab101ab 10 394232 806470 s92000003 s08000006 s12000033 s13002483 ab101af 10 394181 806429 s92000003 s08000006 s12000033 s13002483 ab101ag 10 394251 806376 s92000003 s08000006 s12000033 s13002483 i want create new column containing normalised versions of 1 of columns using following function:
pcode_normalize <- function (x) { x <- gsub(" ", " ", x) if (length(which(strsplit(x, "")[[1]]==" ")) == 0) { x <- paste(substr(x, 1, 4), substr(x, 5, 7)) } x } i tried execute follows:
all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize) but takes long. suggestions how improve performance?
all functions used in pcode_normalize vectorized. there's no need loop using sapply. looks you're using strsplit single-spaces. grepl faster.
using fixed=true in calls gsub , grepl faster, since you're not using regular expressions.
pcode_normalize <- function (x) { x <- gsub(" ", " ", x, fixed=true) sp <- grepl(" ", x, fixed=true) x[!sp] <- paste(substr(x[!sp], 1, 4), substr(x[!sp], 5, 7)) x } all_postcodes$npcode <- pcode_normalize(all_postcodes$pcode) i couldn't test this, since didn't provide example data, should on right path.
Comments
Post a Comment