I am attempting to run a calculation on each row of a large dataset using the carcass package. Each row of the dataset contains the values to feed to the function, and a call number for an associated vector stored in a list object. I have created an apply() function to cycle through each row of the dataset and assemble a vector of the results that I can then append to the dataset.
The problem is, the calculation is very slow, and with 1000+ rows of data, it takes several hours to days to run. Looking for ways to speed this up, I am attempting to set up parallel processing using the parApply() functions, but it doesn't appear to be reading or including my list object or dataframe correctly.
What I've tried so far:
library(carcass)
library(doParallel)
library(parallel)
no_cores <- detectCores(logical = TRUE)
no_cores
cl <- makeCluster(no_cores-1)
registerDoParallel(cl)
test_data<-data.frame(No.=c(39,48,16,23,7,1),p=c(0.05,0.05,0.05,0.05,0.05,0.05),c=rep(0.708,6),ID=c(1,1,2,2,2,2))
I<-list(c(rep(c(7,10),8),7),c(rep(7,10)))
all_samples<- as.data.frame((test_data))
all_IDs<-I
seq_id_all<- seq_along(1:nrow(all_samples))
#Write function: high maxN
function_test <- function(row, I, maxN = 600000) {
tryCatch({
# Calculate main posterior estimate
p_main <- ettersonEq14(s = row[3], f = row[2], J = I[[row[4]]])
N_main <- posteriorN(p = p_main, nf = row[1], maxN = maxN, plot = FALSE)
# Return all estimates as a named vector
return(N_main = N_main$HT.estimate)
}, error = function(e) {
message("Error processing row: ", paste(row, collapse = ", "), " - ", e$message)
return(c(N_main = NA))
})
}
clusterExport(cl,list('function_test','all_samples','I'))
system.time(results <- c(parLapply(cl,all_samples,I=I,fun=function_test)))
stopCluster(cl)
What I got: a list object of each column in the original dataset, with an NA under each.
What I want: a vector of the 6 HT.estimates, one for each row.
How it works normally using the test data and function above:
library(pbapply)
results <- t(pbapply(test_data, 1, function_test, I = I))
I'm not sure where it's going wrong, and I'm new to both parallel processing and apply() functions in R; I've been banging my head against it for several hours now and gotten nowhere, so I figured I'd ask here. (I'm also new to StackExchange, so apologies if this is missing anything/formatted poorly.) Any suggestions would be appreciated.
Additional edits: I've added:
clusterCall(cl, function() library(carcass))
On the line under registerDoParallel(cl), and:
return(c(N_main=conditionMessage(e))
To the error function within the overall function. It is still printing all NAs, and is not printing an error message.
error=function(e) ...function, perhaps something likereturn(c(N_main=conditionMessage(e))). From there, you can see the actual error text and act accordingly.