Vector Similarity Search and Vector Database#
This page explains vector databases in a simple way.
A vector database stores vectors (numbers).
It can search for similar vectors very fast.
This is useful for AI apps.
A vector database stores, manages, and indexes high-dimensional vectors and is designed for low-latency similarity queries.
Vector databases are popular for AI because they work well with unstructured data like text, images, and audio (after you convert them into embeddings).
Vector similarity search#
Vector similarity search finds the items whose vectors are closest to a query vector. It is widely used in retrieval tasks such as semantic search, recommendations, and clustering.
Distance and similarity metrics#
Different metrics define “closeness” in different ways:
Dot product (inner product) is often used as a similarity score.
Larger \(\mathbf{u}\cdot\mathbf{v}\) means more similar.
Some libraries expect a distance to minimize, so they use the negative dot product.
Cosine similarity measures the direction (angle) between vectors.
Range: -1 to 1 (for many embedding use cases, values are often 0 to 1).
1 means vectors point in the same direction (most similar).
0 means vectors are orthogonal (no directional similarity).
-1 means vectors point in opposite directions (most dissimilar).
Cosine distance converts cosine similarity into a distance.
A common definition is \(1 - \text{cosine\_similarity}\).
Range (with this definition): 0 to 2.
0 means most similar; values closer to 2 mean more dissimilar.
Euclidean (L2) distance measures straight-line distance.
Think: as-the-crow-flies distance in space.
Larger values mean vectors are farther apart.
Manhattan (L1) distance measures grid-like distance.
Think: moving along city blocks (right/left/up/down).
Often more robust to outliers than L2.
Hamming distance counts how many positions differ.
Used for binary vectors (0/1) or equal-length strings.
It is the number of indices where \(u_i \neq v_i\).
Formulas#
For two vectors \(\mathbf{u}\) and \(\mathbf{v}\) of length \(k\):
Note
Dot product is not scale-invariant: if you multiply \(\mathbf{u}\) by 2, the dot product doubles. Using \(d_{\text{dot}} = -(\mathbf{u}\cdot\mathbf{v})\) produces the same ranking as maximizing \(\mathbf{u}\cdot\mathbf{v}\) (it just flips the sign). If vectors are L2-normalized, dot product and cosine similarity become equivalent.
Dot product (similarity score):
\[s_{\text{dot}}(\mathbf{u}, \mathbf{v}) = \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{k} u_i v_i\]Negative dot product (dot as a distance to minimize):
\[d_{\text{dot}}(\mathbf{u}, \mathbf{v}) = -(\mathbf{u} \cdot \mathbf{v}) = -\sum_{i=1}^{k} u_i v_i\]Cosine similarity:
\[\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert \mathbf{u} \rVert \, \lVert \mathbf{v} \rVert}\]Cosine distance (common definition):
\[d_{\text{cos}}(\mathbf{u}, \mathbf{v}) = 1 - \text{cos\_sim}(\mathbf{u}, \mathbf{v})\]Euclidean (L2) distance:
\[d_{2}(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{k} (u_i - v_i)^2}\]Manhattan (L1) distance:
\[d_{1}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \lvert u_i - v_i \rvert\]Hamming distance:
\[d_{\text{ham}}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \mathbb{1}[u_i \neq v_i]\]
Vector database vs “vector index library”#
A vector index library (example: Annoy) is usually a library that you run inside your application process.
A vector database is usually a separate service (or a database extension) that focuses on:
storing vectors + metadata
indexing vectors for fast search
scaling to large datasets and many users
operational features (replication, backups, monitoring, access control)
Vector databases store vectors (example: pgvector with PostgreSQL) and support similarity search, often using approximate nearest neighbor methods in a pipeline for fast retrieval.
Pros and cons of vector search#
Vector similarity search can be very effective:
Efficient searching with special index structures (fast retrieval)
High accuracy for semantic similarity (meaning-based matches)
Range queries (search within a threshold)
But there are also limitations:
High-dimensional data can be hard (needs special handling)
Scalability can be challenging for very large datasets
Distance metric choice matters (wrong metric = bad results)
Indexing/storage needs can be high (large vectors take space)
5 practical tips#
Instaclustr suggests these practical steps for good results:
Clean and normalize data (reduce noise; keep a common scale)
Configure and tune algorithms (balance speed and accuracy)
Use sharding / partitioning for large datasets
Consider hardware acceleration (GPU/TPU) when needed
Handle high-dimensional data (e.g., dimensionality reduction when useful)
Open source options#
Instaclustr lists popular open source options including:
Dedicated / vector-native options (examples)
Elasticsearch
Faiss
Qdrant
OpenSearch
Chroma
Milvus
Weaviate
General-purpose databases with vector support (examples)
PostgreSQL (via extensions such as pgvector)
Others depending on your stack
How to choose (simple rules)#
Choose a vector index library (like Annoy) when:
you want something small and local
you control the process memory
you can rebuild the index when needed
Choose a vector database when:
you need a shared service for many users/apps
you need storage + metadata filters + operations (backup/monitoring)
you need easy scaling and high availability