Vector Similarity Search and Vector Database#

This page explains vector databases in a simple way.

A vector database stores vectors (numbers).
It can search for similar vectors very fast.
This is useful for AI apps.

A vector database stores, manages, and indexes high-dimensional vectors and is designed for low-latency similarity queries.

Vector databases are popular for AI because they work well with unstructured data like text, images, and audio (after you convert them into embeddings).

Vector similarity search#

Vector similarity search finds the items whose vectors are closest to a query vector. It is widely used in retrieval tasks such as semantic search, recommendations, and clustering.

Distance and similarity metrics#

Different metrics define “closeness” in different ways:

Dot product (inner product) is often used as a similarity score.
- Larger \(\mathbf{u}\cdot\mathbf{v}\) means more similar.
- Some libraries expect a distance to minimize, so they use the negative dot product.
Cosine similarity measures the direction (angle) between vectors.
- Range: -1 to 1 (for many embedding use cases, values are often 0 to 1).
- 1 means vectors point in the same direction (most similar).
- 0 means vectors are orthogonal (no directional similarity).
- -1 means vectors point in opposite directions (most dissimilar).
Cosine distance converts cosine similarity into a distance.
- A common definition is \(1 - \text{cosine\_similarity}\).
- Range (with this definition): 0 to 2.
- 0 means most similar; values closer to 2 mean more dissimilar.
Euclidean (L2) distance measures straight-line distance.
- Think: as-the-crow-flies distance in space.
- Larger values mean vectors are farther apart.
Manhattan (L1) distance measures grid-like distance.
- Think: moving along city blocks (right/left/up/down).
- Often more robust to outliers than L2.
Hamming distance counts how many positions differ.
- Used for binary vectors (0/1) or equal-length strings.
- It is the number of indices where \(u_i \neq v_i\).

Formulas#

For two vectors \(\mathbf{u}\) and \(\mathbf{v}\) of length \(k\):

Note

Dot product is not scale-invariant: if you multiply \(\mathbf{u}\) by 2, the dot product doubles. Using \(d_{\text{dot}} = -(\mathbf{u}\cdot\mathbf{v})\) produces the same ranking as maximizing \(\mathbf{u}\cdot\mathbf{v}\) (it just flips the sign). If vectors are L2-normalized, dot product and cosine similarity become equivalent.

Dot product (similarity score):

\[s_{\text{dot}}(\mathbf{u}, \mathbf{v}) = \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{k} u_i v_i\]
Negative dot product (dot as a distance to minimize):

\[d_{\text{dot}}(\mathbf{u}, \mathbf{v}) = -(\mathbf{u} \cdot \mathbf{v}) = -\sum_{i=1}^{k} u_i v_i\]
Cosine similarity:

\[\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\lVert \mathbf{u} \rVert \, \lVert \mathbf{v} \rVert}\]
Cosine distance (common definition):

\[d_{\text{cos}}(\mathbf{u}, \mathbf{v}) = 1 - \text{cos\_sim}(\mathbf{u}, \mathbf{v})\]
Euclidean (L2) distance:

\[d_{2}(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{k} (u_i - v_i)^2}\]
Manhattan (L1) distance:

\[d_{1}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \lvert u_i - v_i \rvert\]
Hamming distance:

\[d_{\text{ham}}(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{k} \mathbb{1}[u_i \neq v_i]\]

Vector database vs “vector index library”#

A vector index library (example: Annoy) is usually a library that you run inside your application process.

A vector database is usually a separate service (or a database extension) that focuses on:

storing vectors + metadata
indexing vectors for fast search
scaling to large datasets and many users
operational features (replication, backups, monitoring, access control)

Vector databases store vectors (example: pgvector with PostgreSQL) and support similarity search, often using approximate nearest neighbor methods in a pipeline for fast retrieval.

Pros and cons of vector search#

Vector similarity search can be very effective:

Efficient searching with special index structures (fast retrieval)
High accuracy for semantic similarity (meaning-based matches)
Range queries (search within a threshold)

But there are also limitations:

High-dimensional data can be hard (needs special handling)
Scalability can be challenging for very large datasets
Distance metric choice matters (wrong metric = bad results)
Indexing/storage needs can be high (large vectors take space)

5 practical tips#

Instaclustr suggests these practical steps for good results:

Clean and normalize data (reduce noise; keep a common scale)
Configure and tune algorithms (balance speed and accuracy)
Use sharding / partitioning for large datasets
Consider hardware acceleration (GPU/TPU) when needed
Handle high-dimensional data (e.g., dimensionality reduction when useful)

Open source options#

Instaclustr lists popular open source options including:

Dedicated / vector-native options (examples)

Elasticsearch
Faiss
Qdrant
OpenSearch
Chroma
Milvus
Weaviate

General-purpose databases with vector support (examples)

PostgreSQL (via extensions such as pgvector)
Others depending on your stack

How to choose (simple rules)#

Choose a vector index library (like Annoy) when:

you want something small and local
you control the process memory
you can rebuild the index when needed

Choose a vector database when:

you need a shared service for many users/apps
you need storage + metadata filters + operations (backup/monitoring)
you need easy scaling and high availability