.. docs/source/user_guide/annoy/annoy_index_vector_database.rst .. https://devguide.python.org/documentation/markup/#sections https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections # with overline, for parts : ###################################################################### * with overline, for chapters : ********************************************************************** = for sections : ====================================================================== - for subsections : ---------------------------------------------------------------------- ^ for subsubsections : ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ " for paragraphs : """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" .. # https://rsted.info.ucl.ac.be/ .. # https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#paragraph-level-markup .. # https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes .. # https://documatt.com/restructuredtext-reference/element/admonition.html .. # attention, caution, danger, error, hint, important, note, tip, warning, admonition, seealso .. # versionadded, versionchanged, deprecated, versionremoved, rubric, centered, hlist .. https://waldyrious.net/rst-playground/ .. https://rst-tutorial.yakimka.me/playground .. currentmodule:: scikitplot.annoy .. _annoy_index_vector_database: ====================================================================== Vector Similarity Search and Vector Database ====================================================================== This page explains vector databases in a simple way. - A **vector database** stores vectors (numbers). - It can search for *similar vectors* very fast. - This is useful for AI apps. A **vector database** stores, manages, and indexes high-dimensional vectors and is designed for low-latency similarity queries. Vector databases are popular for AI because they work well with unstructured data like text, images, and audio (after you convert them into embeddings). Vector similarity search ------------------------ Vector similarity search finds the items whose vectors are *closest* to a query vector. It is widely used in retrieval tasks such as semantic search, recommendations, and clustering. Distance and similarity metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different metrics define "closeness" in different ways: - **Dot product (inner product)** is often used as a similarity score. - Larger :math:`\mathbf{u}\cdot\mathbf{v}` means more similar. - Some libraries expect a *distance* to minimize, so they use the **negative dot product**. - **Cosine similarity** measures the *direction* (angle) between vectors. - Range: **-1 to 1** (for many embedding use cases, values are often **0 to 1**). - **1** means vectors point in the same direction (most similar). - **0** means vectors are orthogonal (no directional similarity). - **-1** means vectors point in opposite directions (most dissimilar). - **Cosine distance** converts cosine similarity into a distance. - A common definition is :math:`1 - \text{cosine\_similarity}`. - Range (with this definition): **0 to 2**. - **0** means most similar; values closer to **2** mean more dissimilar. - **Euclidean (L2) distance** measures straight-line distance. - Think: *as-the-crow-flies* distance in space. - Larger values mean vectors are farther apart. - **Manhattan (L1) distance** measures grid-like distance. - Think: moving along city blocks (right/left/up/down). - Often more robust to outliers than L2. - **Hamming distance** counts how many positions differ. - Used for **binary vectors** (0/1) or **equal-length strings**. - It is the number of indices where :math:`u_i \neq v_i`. Formulas ~~~~~~~~ For two vectors :math:`\mathbf{u}` and :math:`\mathbf{v}` of length :math:`k`: .. note:: Dot product is *not* scale-invariant: if you multiply :math:`\mathbf{u}` by 2, the dot product doubles. Using :math:`d_{\text{dot}} = -(\mathbf{u}\cdot\mathbf{v})` produces the same ranking as maximizing :math:`\mathbf{u}\cdot\mathbf{v}` (it just flips the sign). If vectors are L2-normalized, dot product and cosine similarity become equivalent. - **Dot product (similarity score)**: .. math:: s_{\text{dot}}(\mathbf{u}, \mathbf{v}) = \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{k} u_i v_i - **Negative dot product** (dot as a distance to minimize): .. math:: d_{\text{dot}}(\mathbf{u}, \mathbf{v}) = -(\mathbf{u} \cdot \mathbf{v}) = -\sum_{i=1}^{k} u_i v_i - **Cosine similarity**: .. math:: \text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert \mathbf{u} \rVert \, \lVert \mathbf{v} \rVert} - **Cosine distance** (common definition): .. math:: d_{\text{cos}}(\mathbf{u}, \mathbf{v}) = 1 - \text{cos\_sim}(\mathbf{u}, \mathbf{v}) - **Euclidean (L2) distance**: .. math:: d_{2}(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{k} (u_i - v_i)^2} - **Manhattan (L1) distance**: .. math:: d_{1}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \lvert u_i - v_i \rvert - **Hamming distance**: .. math:: d_{\text{ham}}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \mathbb{1}[u_i \neq v_i] Vector database vs “vector index library” ----------------------------------------- A *vector index library* (example: Annoy) is usually a library that you run inside your application process. A *vector database* is usually a separate service (or a database extension) that focuses on: - storing vectors + metadata - indexing vectors for fast search - scaling to large datasets and many users - operational features (replication, backups, monitoring, access control) Vector databases store vectors (example: pgvector with PostgreSQL) and support similarity search, often using approximate nearest neighbor methods in a pipeline for fast retrieval. Pros and cons of vector search ------------------------------ Vector similarity search can be very effective: - Efficient searching with special index structures (fast retrieval) - High accuracy for semantic similarity (meaning-based matches) - Range queries (search within a threshold) But there are also limitations: - High-dimensional data can be hard (needs special handling) - Scalability can be challenging for very large datasets - Distance metric choice matters (wrong metric = bad results) - Indexing/storage needs can be high (large vectors take space) 5 practical tips ---------------- Instaclustr suggests these practical steps for good results: 1. Clean and normalize data (reduce noise; keep a common scale) 2. Configure and tune algorithms (balance speed and accuracy) 3. Use sharding / partitioning for large datasets 4. Consider hardware acceleration (GPU/TPU) when needed 5. Handle high-dimensional data (e.g., dimensionality reduction when useful) Open source options ------------------- Instaclustr lists popular open source options including: Dedicated / vector-native options (examples) - Elasticsearch - Faiss - Qdrant - OpenSearch - Chroma - Milvus - Weaviate General-purpose databases with vector support (examples) - PostgreSQL (via extensions such as pgvector) - Others depending on your stack How to choose (simple rules) ---------------------------- Choose a vector index library (like Annoy) when: - you want something small and local - you control the process memory - you can rebuild the index when needed Choose a vector database when: - you need a shared service for many users/apps - you need storage + metadata filters + operations (backup/monitoring) - you need easy scaling and high availability