BiocNeighbors 1.20.1
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3365 387 4429 3465 9376 6313 2631 589 1196 5023
## [2,] 797 1399 8666 3960 6249 9065 3824 6220 7205 6688
## [3,] 3475 9791 8731 7414 6006 8139 9495 8093 986 2631
## [4,] 9765 2480 7507 8983 5599 6721 2587 6637 9028 4711
## [5,] 5483 7628 3967 5853 4548 8975 3147 9958 5897 9196
## [6,] 9439 2005 4096 1539 6159 5570 4068 430 9253 1052
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8092533 0.9486768 0.9561160 0.9899128 1.001937 1.0039014 1.0290518
## [2,] 0.9459962 0.9881017 1.0045476 1.0061874 1.012161 1.0141041 1.0566691
## [3,] 1.0371169 1.1213335 1.1217155 1.1229984 1.125019 1.1251875 1.1469314
## [4,] 0.9693286 0.9813520 0.9984147 1.0018961 1.019948 1.0381206 1.0564381
## [5,] 0.8171108 0.8676689 0.9020651 0.9142418 0.963010 0.9866728 0.9891987
## [6,] 1.0201658 1.0376300 1.0569617 1.0641049 1.067901 1.0690782 1.0736088
## [,8] [,9] [,10]
## [1,] 1.0356061 1.0424124 1.0511163
## [2,] 1.0612347 1.0883250 1.0936329
## [3,] 1.1523470 1.1568710 1.2112980
## [4,] 1.0602041 1.0615220 1.0905735
## [5,] 0.9909155 0.9919524 0.9941895
## [6,] 1.0899429 1.0903524 1.1001512
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4207 1925 9813 6743 5739
## [2,] 3594 1155 4905 7474 3241
## [3,] 3107 3576 5033 4005 8632
## [4,] 386 6812 3247 3232 2080
## [5,] 1848 2506 2118 8294 1904
## [6,] 5432 5558 7992 9020 1708
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.1086656 1.1229142 1.1396971 1.2005340 1.2044449
## [2,] 0.9193776 0.9433740 0.9795136 0.9867874 0.9871328
## [3,] 0.9664251 1.0471556 1.0674046 1.1045169 1.1432260
## [4,] 0.8694986 0.9339198 0.9925212 1.0182252 1.0460166
## [5,] 1.0522285 1.0948354 1.0979621 1.1414605 1.1561661
## [6,] 0.8329313 0.8960790 0.9602339 0.9741856 0.9886567
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmppMkKua/file16de5a52146b73.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.2 Patched (2023-11-13 r85521)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.1 knitr_1.45 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.8 S4Vectors_0.40.2 htmltools_0.5.7
## [7] stats4_4.3.2 sass_0.4.8 rmarkdown_2.25
## [10] grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.37 BiocManager_1.30.22 compiler_4.3.2
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.36.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.3.2 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.3.2 BiocGenerics_0.48.1 cachem_1.0.8