SimilarityIndex#

class scikitplot.corpus.SimilarityIndex(config=None)[source]#

Multi-mode similarity index over CorpusDocument collections.

Parameters:
configSearchConfig or None, optional

Default search configuration. Can be overridden per query.

Parameters:

config (SearchConfig | None)

See also

scikitplot.corpus._schema.MatchMode

Enum of match modes.

scikitplot.corpus._adapters

Convert results to LangChain / MCP format.

Notes

User note: Build the index once, query many times:

index = SimilarityIndex()
index.build(documents)
results = index.search("What did Hamlet say about death?")

Developer note: The index stores references to the original documents. If documents are mutated after building, results are undefined.

Examples

>>> index = SimilarityIndex()
>>> # index.build(corpus_documents)
>>> # results = index.search("quantum computing")
build(documents)[source]#

Build the index from CorpusDocument instances.

Parameters:
documentsSequence[CorpusDocument]

Documents to index. Must have text (and optionally embedding, tokens, normalized_text).

Raises:
ValueError

If documents is empty.

Parameters:

documents (Sequence[Any])

Return type:

None

property has_embeddings: bool#

Whether dense embeddings are indexed.

property n_documents: int#

Number of indexed documents.

search(query, *, config=None, query_embedding=None)[source]#

Search the index.

Parameters:
querystr

Query text.

configSearchConfig or None, optional

Override default config for this query.

query_embeddingarray-like or None, optional

Pre-computed query embedding. Required for SEMANTIC mode if no embedding engine is attached.

Returns:
list[SearchResult]

Results sorted by descending score.

Parameters:
Return type:

list[SearchResult]