Need a more efficient threshold matching with function for R -


not sure how best ask question, feel free edit question title if there more standard vocabulary use here.

i have 2 2-column data tables in r, first list of unique 2-variable values (u), shorter second, raw list of similar values (d). need function will, every 2-variable set of values in u, find 2-variable sets of values in d both variables within given threshold.

here's minimal example. actual data larger (see below, problem) , (obviously) not created randomly in example. in actual data, u have 600,000 1,000,000 values (rows) , d have upwards of 10,000,000 rows.

# first create table of unique variable pairs (no 2-column duplicates) u <- data.frame(pc1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),                 pc2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))  # now, create set of raw 2-variable pairs, may include duplicates d <- data.frame(pc1=sample(u$pc1,100,replace=t)*sample(90:100,100,replace=t)/100,                 pc2=sample(u$pc2,100,replace=t)*sample(90:100,100,replace=t)/100)  # set threshold defined 'close-enough' match between u , d values b <- 0.1 

so, first attempt loop values of u. works nicely, computationally intensive , takes quite while process actual data.

# make list output list of within-threshold  rows m <- list() # loop find values of d within threshold b of each value of u # output list have many items values of u # each list item, there may several thousand matching rows in d # note there's timing command (system.time) in here keep track of performance system.time({   for(i in 1:nrow(u)){       m <- c(m, list(which(abs(d$pc1-u$pc1[i])<b & abs(d$pc2-u$pc2[i])<b)))   }  }) m 

that works. thought using function apply() more efficient. is...

# make user-defined function threshold matching match <- function(x,...){   which(abs(d$pc1-x[1])<b & abs(d$pc2-x[2])<b) } # run function apply() command. system.time({   m <- apply(u,1,match) }) 

again, apply function works , faster loop, marginally. may big data problem need bit more computing power (or more time!). thought others might have thoughts on sneaky command or function syntax dramatically speed up. outside box approaches finding these matching rows welcome.

somewhat sneaky:

library(iranges) ur <- with(u*100l, iranges(pc2, pc1)) dr <- with(d*100l, iranges(pc2, pc1)) hits <- findoverlaps(ur, dr + b*100l) 

should fast once number of rows sufficiently large. multiply 100 integer space. reversing order of arguments findoverlaps improve performance.


Comments

Popular posts from this blog

c++ - OpenCV Error: Assertion failed <scn == 3 ::scn == 4> in unknown function, -

php - render data via PDO::FETCH_FUNC vs loop -

The canvas has been tainted by cross-origin data in chrome only -