.. https://devguide.python.org/documentation/markup/#sections https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections # with overline, for parts : ###################################################################### * with overline, for chapters : ********************************************************************** = for sections : ====================================================================== - for subsections : ---------------------------------------------------------------------- ^ for subsubsections : ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ " for paragraphs : """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" .. # https://rsted.info.ucl.ac.be/ .. # https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#paragraph-level-markup .. # https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes .. # https://documatt.com/restructuredtext-reference/element/admonition.html .. # attention, caution, danger, error, hint, important, note, tip, warning, admonition, seealso .. # versionadded, versionchanged, deprecated, versionremoved, rubric, centered, hlist .. currentmodule:: scikitplot.cexternals._annoy .. _cexternals-annoy-index: ====================================================================== spotify/ANNoy (experimental) ====================================================================== This page documents the Annoy [0]_ integration shipped with scikit-plots. - High-level API: :py:mod:`~scikitplot.annoy` - Low-level bindings: :py:mod:`~scikitplot.cexternals._annoy` Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings for approximate nearest-neighbor search in high-dimensional vector spaces. [1]_ TL;DR ----- - **What it is:** C++ library with Python bindings for approximate nearest-neighbor (ANN) search. [1]_ - **Origin:** Developed at Spotify (Hack Week). [1]_ - **Since:** Open sourced in 2013. [3]_ - **Index type:** Forest of random projection trees. [1]_ - **Storage:** File-based indexes can be memory-mapped (mmap) and shared across processes. [1]_ - **Tuning:** Use ``n_trees`` (build) and ``search_k`` (query) to trade accuracy for speed. [1]_ - **Metrics:** Euclidean, Manhattan, cosine (angular), Hamming, dot (inner product). [1]_ Quick start ----------- .. .. jupyter-execute .. .. code-block:: python .. prompt:: python >>> import random random.seed(0) # from annoy import AnnoyIndex # from scikitplot.cexternals._annoy import AnnoyIndex from scikitplot.annoy import AnnoyIndex f = 40 t = AnnoyIndex(f, "angular") for i in range(1000): v = [random.gauss(0, 1) for _ in range(f)] t.add_item(i, v) t.build(10) # n_trees t.save("test.ann") u = AnnoyIndex(f, "angular") u.load("test.ann") # memory-mapped (mmap) print(u.get_nns_by_item(0, 10)) Workflow -------- 1. Create an :class:`~.AnnoyIndex` with vector length ``f`` and a metric. [1]_ 2. Add items with :py:obj:`~.AnnoyIndex.add_item`. [1]_ 3. Build the forest with :py:obj:`~.AnnoyIndex.build`. [1]_ 4. Query with :py:obj:`~.AnnoyIndex.get_nns_by_item` or :py:obj:`~.AnnoyIndex.get_nns_by_vector`. [1]_ 5. Persist with :py:obj:`~.AnnoyIndex.save` and load with :py:obj:`~.AnnoyIndex.load`. [1]_ Important rules ~~~~~~~~~~~~~~~ - Every added vector must have length ``f``. - Add items before calling :py:obj:`~.AnnoyIndex.build`. [1]_ - After :py:obj:`~.AnnoyIndex.build`, the index is used for queries. To add more items, discard the forest with :py:obj:`~.AnnoyIndex.unbuild`, add items, and build again. Persistence and sharing ----------------------- Save and load ~~~~~~~~~~~~~ - :py:obj:`~.AnnoyIndex.save` writes the index to a file. - :py:obj:`~.AnnoyIndex.load` memory-maps (mmap) the file for fast loading and sharing across processes. [1]_ Prefault (optional) ~~~~~~~~~~~~~~~~~~~ Some builds expose a ``prefault`` option for :py:obj:`~.AnnoyIndex.load`. When enabled, the loader may aggressively fault pages into memory. This is platform dependent. [1]_ On-disk build (large datasets) ------------------------------ Annoy can build an index directly into a file on disk. This is intended for datasets that are too large to fit into memory during index construction. [1]_ Workflow ~~~~~~~~ 1. Create the index. 2. Call :py:obj:`~.AnnoyIndex.on_disk_build` **before** adding any items. [1]_ 3. Add items. 4. Call :py:obj:`~.AnnoyIndex.build`. 5. Query the index, or load it from other processes with :py:obj:`~.AnnoyIndex.load`. [1]_ Important rules ~~~~~~~~~~~~~~~ - Call :py:obj:`~.AnnoyIndex.on_disk_build` **before** :py:obj:`~.AnnoyIndex.add_item`. [1]_ - After building in this mode, there is no need to call :py:obj:`~.AnnoyIndex.save` because the file is already the backing store. [1]_ Example ~~~~~~~ .. .. jupyter-execute .. .. code-block:: python .. prompt:: python >>> import random from scikitplot.annoy import AnnoyIndex random.seed(0) f = 40 a = AnnoyIndex(f, "angular") a.on_disk_build("big.ann") for i in range(1000): v = [random.gauss(0, 1) for _ in range(f)] a.add_item(i, v) a.build(10) # In another process, load the same file (mmap): b = AnnoyIndex(f, "angular") b.load("big.ann") print(b.get_nns_by_item(0, 10)) Tuning ------ - ``n_trees`` (build time): Larger values usually improve recall but increase build time and memory usage. [1]_ - ``search_k`` (query time): Larger values usually improve recall but make queries slower. [1]_ If ``search_k`` is not provided, Annoy uses a default based on the number of trees and the requested neighbor count. [1]_ Practical tips -------------- Choose stable item ids ~~~~~~~~~~~~~~~~~~~~~~ Annoy uses non-negative integer item ids and allocates storage up to ``max(id) + 1``. [1]_ If your external ids are sparse or non-numeric, keep a separate mapping to compact ids. Multi-process serving ~~~~~~~~~~~~~~~~~~~~~ A common serving pattern is: 1. Build once and write an index file. 2. In each worker process, load the same file with :py:obj:`~.AnnoyIndex.load` (mmap) and query. [1]_ Developer notes (C++) --------------------- ``AnnoyIndex`` is not copyable (it contains atomic state). In C++14, avoid copy initialization from temporaries. Prefer direct initialization:: using Index = Annoy::AnnoyIndex; Index t(f); // C++14 compatible See also -------- .. seealso:: * :ref:`ANNoy ` * :ref:`cexternals/ANNoy (experimental) ` * https://github.com/spotify/annoy * https://pypi.org/project/annoy .. seealso:: * :py:obj:`~scikitplot.annoy.Index.from_low_level` * :py:mod:`pickle` (Python standard library) * Alternative ANN libraries: - https://github.com/nmslib/hnswlib - https://github.com/spotify/voyager References ---------- .. [0] `Spotify AB. (2013, Feb 20). "Approximate Nearest Neighbors Oh Yeah" Github. https://pypi.org/project/annoy `_ .. [1] https://github.com/spotify/annoy (README: API, mmap, prefault, on-disk build, tuning) .. [2] https://pypi.org/project/annoy/ (Project description) .. [3] https://erikbern.com/2015/05/03/annoy-now-without-boost-dependencies-and-with-python-3-support.html (History note: open sourced in 2013)