FixedWindowChunker#

class scikitplot.corpus.FixedWindowChunker(config=None, multilang_config=None)[source]#

Produce fixed-size sliding-window chunks over a document.

Handles all scripts via detect_script and split_cjk_chars for no-space East Asian scripts. Multilang metadata is attached to each chunk when enabled.

Parameters:

configFixedWindowChunkerConfig, optional: Chunker configuration.
multilang_configMultilangConfig, optional: Multilang feature flags.

Parameters:

config (FixedWindowChunkerConfig | None)
multilang_config (MultilangConfig | None)

Notes

User note (multilang): Fixed-window chunking is script-aware for token-unit mode (unit=TOKENS): CJK / Hiragana / Katakana text is split at character level rather than whitespace. Char-unit mode (unit=CHARS) is strictly grapheme-cluster agnostic (raw codepoint slices), which is safe for RAG pipelines that only need byte-aligned embedding windows.

Developer note: Inherits MultilangMixin. Every chunk produced by chunk carries metadata["multilang"] when multilang_config.enabled=True.

Examples

>>> cfg = FixedWindowChunkerConfig(
...     window_size=20, step_size=10, unit=WindowUnit.CHARS
... )
>>> chunker = FixedWindowChunker(cfg)
>>> result = chunker.chunk("The quick brown fox jumps over the lazy dog")
>>> result.chunks[0].text
'The quick brown fox '

attach_embedding(chunk, vector, *, model_name=None, model_version=None)[source]#

Return a new Chunk with an embedding attached.

Does NOT mutate the original Chunk (frozen dataclass).

Parameters:

chunkChunk: Any chunk produced by this chunker.
vectorlist[float]: Dense embedding vector.
model_namestr, optional: Encoder model name.
model_versionstr, optional: Encoder model version.

Returns:

Chunk: New frozen instance with metadata["multilang"]["embedding"] populated and metadata["embedding"] set at top level for compatibility with EmbeddedChunk.

Parameters:

chunk (Chunk)
vector (list[float])
model_name (str | None)
model_version (str | None)

Return type:

Chunk

Notes

User note: For batch embedding, use attach_embedding_batch which avoids per-chunk dict copies.

Developer note: Two embedding locations are written:

chunk.metadata["embedding"] — top-level key compatible with EmbeddedChunk and vector store adapters.
chunk.metadata["multilang"]["embedding"] — inside the multilang bundle for model provenance tracking.

attach_embedding_batch(chunks, vectors, *, model_name=None, model_version=None)[source]#

Return a new list of chunks with embeddings attached.

Parameters:

chunkslist[Chunk]: Chunks from this chunker.
vectorslist[list[float]]: One embedding vector per chunk. Must have same length as chunks.
model_namestr, optional: Encoder model name.
model_versionstr, optional: Encoder model version.

Returns:

list[Chunk]: New list; originals are unmodified.

Raises:

ValueError: If len(chunks) != len(vectors).

Parameters:

chunks (list[Chunk])
vectors (list[list[float]])
model_name (str | None)
model_version (str | None)

Return type:

list[Chunk]

chunk(text, doc_id=None, extra_metadata=None)[source]#

Split text into fixed-window chunks.

Parameters:

textstr: Raw document text.
doc_idstr, optional: Document identifier stored in metadata.
extra_metadatadict[str, Any], optional: Additional key/value pairs merged into result metadata.

Returns:

ChunkResult: Chunks and aggregate metadata.

Raises:

TypeError: If text is not a str.
ValueError: If text is empty or whitespace-only.

Parameters:

text (str)
doc_id (str | None)
extra_metadata (dict[str, Any] | None)

Return type:

ChunkResult

chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#

Chunk a list of documents.

Parameters:

textslist[str]: Input documents.
doc_idslist[str], optional: Parallel document identifiers.
extra_metadatadict[str, Any], optional: Shared metadata for every result.

Returns:

list[ChunkResult]: One result per document.

Raises:

TypeError: If texts is not a list.
ValueError: If doc_ids length mismatches texts.

Parameters:

texts (list[str])
doc_ids (list[str] | None)
extra_metadata (dict[str, Any] | None)

Return type:

list[ChunkResult]

Gallery examples#

corpus Knowledge and Information local .png with examples