WordChunker#

class scikitplot.corpus.WordChunker(config=None, gensim_dictionary=None, multilang_config=None)[source]#

Process a document at word level, producing normalised token chunks.

Each output Chunk contains:

  • text — space-joined normalised tokens (with optional n-grams).

  • metadata — token list, n-grams, token count, processing flags, optional Gensim BoW vector.

  • metadata["multilang"] — script detection, semanteme analysis, stopword counts, grapheme counts, preprocessing trace, timing, and raw text (when MultilangConfig is set).

Parameters:
configWordChunkerConfig, optional

Processing configuration.

gensim_dictionarygensim.corpora.Dictionary, optional

Pre-built Gensim dictionary. When provided (and cfg.build_gensim_corpus is True), each chunk’s metadata includes a "bow" Gensim BoW vector.

multilang_configMultilangConfig, optional

Multilang feature flags. Overrides config.multilang_config when provided explicitly.

Parameters:
  • config (WordChunkerConfig | None)

  • gensim_dictionary (Any | None)

  • multilang_config (MultilangConfig | None)

Notes

User note (multilang): Set multilang_config=MultilangConfig( include_semantemes=True, include_raw_text=True, include_preprocessing_trace=True) to get the full per-token semanteme analysis dict, preprocessing audit trail, and raw-vs-normalised text comparison in every chunk.

Developer note: Inherits MultilangMixin. Token-level metadata (token_count, stopword_count, unique_token_count) is computed directly from the processed token list and forwarded to _ml_build_meta.

Examples

>>> cfg = WordChunkerConfig(stemmer=StemmingBackend.PORTER)
>>> chunker = WordChunker(cfg)
>>> result = chunker.chunk("The quick brown foxes are jumping over lazy dogs.")
>>> "token_count" in result.chunks[0].metadata
True
attach_embedding(chunk, vector, *, model_name=None, model_version=None)[source]#

Return a new Chunk with an embedding attached.

Does NOT mutate the original Chunk (frozen dataclass).

Parameters:
chunkChunk

Any chunk produced by this chunker.

vectorlist[float]

Dense embedding vector.

model_namestr, optional

Encoder model name.

model_versionstr, optional

Encoder model version.

Returns:
Chunk

New frozen instance with metadata["multilang"]["embedding"] populated and metadata["embedding"] set at top level for compatibility with EmbeddedChunk.

Parameters:
  • chunk (Chunk)

  • vector (list[float])

  • model_name (str | None)

  • model_version (str | None)

Return type:

Chunk

Notes

User note: For batch embedding, use attach_embedding_batch which avoids per-chunk dict copies.

Developer note: Two embedding locations are written:

  1. chunk.metadata["embedding"] — top-level key compatible with EmbeddedChunk and vector store adapters.

  2. chunk.metadata["multilang"]["embedding"] — inside the multilang bundle for model provenance tracking.

attach_embedding_batch(chunks, vectors, *, model_name=None, model_version=None)[source]#

Return a new list of chunks with embeddings attached.

Parameters:
chunkslist[Chunk]

Chunks from this chunker.

vectorslist[list[float]]

One embedding vector per chunk. Must have same length as chunks.

model_namestr, optional

Encoder model name.

model_versionstr, optional

Encoder model version.

Returns:
list[Chunk]

New list; originals are unmodified.

Raises:
ValueError

If len(chunks) != len(vectors).

Parameters:
Return type:

list[Chunk]

static build_gensim_dictionary(token_lists, no_below=2, no_above=0.9, keep_n=None)[source]#

Build a gensim.corpora.Dictionary from token lists.

Parameters:
token_listslist[list[str]]

Processed token lists (one per document).

no_belowint

Filter tokens appearing in fewer than no_below documents.

no_abovefloat

Filter tokens appearing in more than no_above fraction of documents (0.0-1.0).

keep_nint, optional

Retain only the top keep_n most frequent tokens after filtering.

Returns:
gensim.corpora.Dictionary

Built and filtered dictionary.

Raises:
ImportError

If Gensim is not installed.

ValueError

If token_lists is empty.

Parameters:
Return type:

Any

chunk(text, doc_id=None, extra_metadata=None)[source]#

Process text into word-level chunks.

Parameters:
textstr

Raw document text.

doc_idstr, optional

Document identifier stored in each chunk’s metadata.

extra_metadatadict[str, Any], optional

Additional key/value pairs merged into result metadata.

Returns:
ChunkResult

Chunks and aggregate metadata.

Raises:
TypeError

If text is not a str.

ValueError

If text is empty or whitespace-only.

Parameters:
Return type:

ChunkResult

chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#

Process a list of documents into word-level chunks.

Parameters:
textslist[str]

Input documents.

doc_idslist[str], optional

Parallel document identifiers.

extra_metadatadict[str, Any], optional

Shared metadata merged into every result.

Returns:
list[ChunkResult]

One result per document.

Raises:
TypeError

If texts is not a list.

ValueError

If doc_ids length mismatches texts.

Parameters:
Return type:

list[ChunkResult]

static vocabulary_stats(token_lists)[source]#

Compute vocabulary statistics over a corpus.

Parameters:
token_listslist[list[str]]

Processed token lists, one per document.

Returns:
dict[str, Any]

Dictionary with keys: vocab_size, total_tokens, unique_tokens, avg_tokens_per_doc, top_20_tokens.

Raises:
ValueError

If token_lists is empty.

Parameters:

token_lists (list[list[str]])

Return type:

dict[str, Any]