WordChunker#
- class scikitplot.corpus.WordChunker(config=None, gensim_dictionary=None, multilang_config=None)[source]#
Process a document at word level, producing normalised token chunks.
Each output
Chunkcontains:text— space-joined normalised tokens (with optional n-grams).metadata— token list, n-grams, token count, processing flags, optional Gensim BoW vector.metadata["multilang"]— script detection, semanteme analysis, stopword counts, grapheme counts, preprocessing trace, timing, and raw text (whenMultilangConfigis set).
- Parameters:
- configWordChunkerConfig, optional
Processing configuration.
- gensim_dictionarygensim.corpora.Dictionary, optional
Pre-built Gensim dictionary. When provided (and
cfg.build_gensim_corpusisTrue), each chunk’s metadata includes a"bow"Gensim BoW vector.- multilang_configMultilangConfig, optional
Multilang feature flags. Overrides
config.multilang_configwhen provided explicitly.
- Parameters:
config (WordChunkerConfig | None)
gensim_dictionary (Any | None)
multilang_config (MultilangConfig | None)
Notes
User note (multilang): Set
multilang_config=MultilangConfig( include_semantemes=True, include_raw_text=True, include_preprocessing_trace=True)to get the full per-token semanteme analysis dict, preprocessing audit trail, and raw-vs-normalised text comparison in every chunk.Developer note: Inherits
MultilangMixin. Token-level metadata (token_count,stopword_count,unique_token_count) is computed directly from the processed token list and forwarded to_ml_build_meta.Examples
>>> cfg = WordChunkerConfig(stemmer=StemmingBackend.PORTER) >>> chunker = WordChunker(cfg) >>> result = chunker.chunk("The quick brown foxes are jumping over lazy dogs.") >>> "token_count" in result.chunks[0].metadata True
- attach_embedding(chunk, vector, *, model_name=None, model_version=None)[source]#
Return a new
Chunkwith an embedding attached.Does NOT mutate the original
Chunk(frozen dataclass).- Parameters:
- chunkChunk
Any chunk produced by this chunker.
- vectorlist[float]
Dense embedding vector.
- model_namestr, optional
Encoder model name.
- model_versionstr, optional
Encoder model version.
- Returns:
- Chunk
New frozen instance with
metadata["multilang"]["embedding"]populated andmetadata["embedding"]set at top level for compatibility withEmbeddedChunk.
- Parameters:
- Return type:
Chunk
Notes
User note: For batch embedding, use
attach_embedding_batchwhich avoids per-chunk dict copies.Developer note: Two embedding locations are written:
chunk.metadata["embedding"]— top-level key compatible withEmbeddedChunkand vector store adapters.chunk.metadata["multilang"]["embedding"]— inside the multilang bundle for model provenance tracking.
- attach_embedding_batch(chunks, vectors, *, model_name=None, model_version=None)[source]#
Return a new list of chunks with embeddings attached.
- Parameters:
- chunkslist[Chunk]
Chunks from this chunker.
- vectorslist[list[float]]
One embedding vector per chunk. Must have same length as
chunks.- model_namestr, optional
Encoder model name.
- model_versionstr, optional
Encoder model version.
- Returns:
- list[Chunk]
New list; originals are unmodified.
- Raises:
- ValueError
If
len(chunks) != len(vectors).
- Parameters:
- Return type:
list[Chunk]
- static build_gensim_dictionary(token_lists, no_below=2, no_above=0.9, keep_n=None)[source]#
Build a
gensim.corpora.Dictionaryfrom token lists.- Parameters:
- token_listslist[list[str]]
Processed token lists (one per document).
- no_belowint
Filter tokens appearing in fewer than no_below documents.
- no_abovefloat
Filter tokens appearing in more than no_above fraction of documents (0.0-1.0).
- keep_nint, optional
Retain only the top keep_n most frequent tokens after filtering.
- Returns:
- gensim.corpora.Dictionary
Built and filtered dictionary.
- Raises:
- ImportError
If Gensim is not installed.
- ValueError
If token_lists is empty.
- Parameters:
- Return type:
- chunk(text, doc_id=None, extra_metadata=None)[source]#
Process text into word-level chunks.
- Parameters:
- textstr
Raw document text.
- doc_idstr, optional
Document identifier stored in each chunk’s metadata.
- extra_metadatadict[str, Any], optional
Additional key/value pairs merged into result metadata.
- Returns:
- ChunkResult
Chunks and aggregate metadata.
- Raises:
- TypeError
If text is not a
str.- ValueError
If text is empty or whitespace-only.
- Parameters:
- Return type:
ChunkResult
- chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#
Process a list of documents into word-level chunks.
- Parameters:
- textslist[str]
Input documents.
- doc_idslist[str], optional
Parallel document identifiers.
- extra_metadatadict[str, Any], optional
Shared metadata merged into every result.
- Returns:
- list[ChunkResult]
One result per document.
- Raises:
- TypeError
If texts is not a list.
- ValueError
If doc_ids length mismatches texts.
- Parameters:
- Return type:
list[ChunkResult]
- static vocabulary_stats(token_lists)[source]#
Compute vocabulary statistics over a corpus.
- Parameters:
- token_listslist[list[str]]
Processed token lists, one per document.
- Returns:
- dict[str, Any]
Dictionary with keys:
vocab_size,total_tokens,unique_tokens,avg_tokens_per_doc,top_20_tokens.
- Raises:
- ValueError
If token_lists is empty.
- Parameters:
- Return type:
Gallery examples#
corpus Knowledge and Information local .png with examples