WordChunker#
- class scikitplot.corpus.WordChunker(config=None, gensim_dictionary=None)[source]#
Process a document at word level, producing normalised token chunks.
Each output
Chunkcontains:text— space-joined normalised tokens (with optional n-grams).metadata— token list, n-grams, token count, processing flags, optional Gensim BoW vector.
- Parameters:
- configWordChunkerConfig, optional
Processing configuration.
- gensim_dictionarygensim.corpora.Dictionary, optional
Pre-built Gensim dictionary. When provided (and
cfg.build_gensim_corpusisTrue), each chunk’s metadata includes a"bow"Gensim BoW vector.
- Parameters:
config (WordChunkerConfig | None)
gensim_dictionary (Any | None)
Examples
>>> cfg = WordChunkerConfig(stemmer=StemmingBackend.PORTER) >>> chunker = WordChunker(cfg) >>> result = chunker.chunk("The quick brown foxes are jumping over lazy dogs.") >>> "token_count" in result.chunks[0].metadata True
- static build_gensim_dictionary(token_lists, no_below=2, no_above=0.9, keep_n=None)[source]#
Build a
gensim.corpora.Dictionaryfrom token lists.- Parameters:
- token_listslist[list[str]]
Processed token lists (one per document).
- no_belowint
Filter tokens appearing in fewer than no_below documents.
- no_abovefloat
Filter tokens appearing in more than no_above fraction of documents (0.0-1.0).
- keep_nint, optional
Retain only the top keep_n most frequent tokens after filtering.
- Returns:
- gensim.corpora.Dictionary
Built and filtered dictionary.
- Raises:
- ImportError
If Gensim is not installed.
- ValueError
If token_lists is empty.
- Parameters:
- Return type:
- chunk(text, doc_id=None, extra_metadata=None)[source]#
Process text into word-level chunks.
- Parameters:
- textstr
Raw document text.
- doc_idstr, optional
Document identifier stored in each chunk’s metadata.
- extra_metadatadict[str, Any], optional
Additional key/value pairs merged into result metadata.
- Returns:
- ChunkResult
Chunks and aggregate metadata.
- Raises:
- TypeError
If text is not a
str.- ValueError
If text is empty or whitespace-only.
- Parameters:
- Return type:
ChunkResult
- chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#
Process a list of documents into word-level chunks.
- Parameters:
- textslist[str]
Input documents.
- doc_idslist[str], optional
Parallel document identifiers.
- extra_metadatadict[str, Any], optional
Shared metadata merged into every result.
- Returns:
- list[ChunkResult]
One result per document.
- Raises:
- TypeError
If texts is not a list.
- ValueError
If doc_ids length mismatches texts.
- Parameters:
- Return type:
list[ChunkResult]
- static vocabulary_stats(token_lists)[source]#
Compute vocabulary statistics over a corpus.
- Parameters:
- token_listslist[list[str]]
Processed token lists, one per document.
- Returns:
- dict[str, Any]
Dictionary with keys:
vocab_size,total_tokens,unique_tokens,avg_tokens_per_doc,top_20_tokens.
- Raises:
- ValueError
If token_lists is empty.
- Parameters:
- Return type: