FixedWindowChunker#

class scikitplot.corpus.FixedWindowChunker(config=None)[source]#

Produce fixed-size sliding-window chunks over a document.

Parameters:
configFixedWindowChunkerConfig, optional

Chunker configuration.

Parameters:

config (FixedWindowChunkerConfig | None)

Examples

>>> cfg = FixedWindowChunkerConfig(
...     window_size=20, step_size=10, unit=WindowUnit.CHARS
... )
>>> chunker = FixedWindowChunker(cfg)
>>> result = chunker.chunk("The quick brown fox jumps over the lazy dog")
>>> result.chunks[0].text
'The quick brown fox '
chunk(text, doc_id=None, extra_metadata=None)[source]#

Split text into fixed-window chunks.

Parameters:
textstr

Raw document text.

doc_idstr, optional

Document identifier stored in metadata.

extra_metadatadict[str, Any], optional

Additional key/value pairs merged into result metadata.

Returns:
ChunkResult

Chunks and aggregate metadata.

Raises:
TypeError

If text is not a str.

ValueError

If text is empty or whitespace-only.

Parameters:
Return type:

ChunkResult

chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#

Chunk a list of documents.

Parameters:
textslist[str]

Input documents.

doc_idslist[str], optional

Parallel document identifiers.

extra_metadatadict[str, Any], optional

Shared metadata for every result.

Returns:
list[ChunkResult]

One result per document.

Raises:
TypeError

If texts is not a list.

ValueError

If doc_ids length mismatches texts.

Parameters:
Return type:

list[ChunkResult]