ParagraphChunker#
- class scikitplot.corpus.ParagraphChunker(config=None, multilang_config=None)[source]#
Split a document into paragraph-level
Chunkobjects.Paragraph boundaries are blank lines (
\n\n) — script-universal and dependency-free. Within each paragraph the dominant Unicode script is detected and reported inchunk.metadata["multilang"]["script"].- Parameters:
- configParagraphChunkerConfig, optional
Chunker configuration.
- multilang_configMultilangConfig, optional
Multilang feature flags. Overrides
config.multilang_configwhen provided explicitly. Default:MultilangConfig()(script detection + semanteme count only, no raw text / trace).
- Parameters:
config (ParagraphChunkerConfig | None)
multilang_config (MultilangConfig | None)
Notes
User note (multilang): Blank-line paragraph splitting works for every script. Set
multilang_config=MultilangConfig( include_raw_text=True, include_preprocessing_trace=True)to attach full preprocessing audit to each chunk. To add embeddings after chunking, callattach_embedding_batch.Developer note: Inherits
MultilangMixin. Initialised viaself._ml_init(). Thechunk()method enriches each paragraph chunk with aMultilangChunkMetadict stored underchunk.metadata["multilang"].Examples
>>> chunker = ParagraphChunker() >>> text = "First paragraph.\n\nSecond paragraph." >>> result = chunker.chunk(text) >>> len(result.chunks) 2 >>> chunker_ml = ParagraphChunker( ... multilang_config=MultilangConfig(include_raw_text=True) ... ) >>> result_ml = chunker_ml.chunk("مرحبا.\\n\\nHello.") >>> result_ml.chunks[0].metadata["multilang"]["script"] 'arabic'
- attach_embedding(chunk, vector, *, model_name=None, model_version=None)[source]#
Return a new
Chunkwith an embedding attached.Does NOT mutate the original
Chunk(frozen dataclass).- Parameters:
- chunkChunk
Any chunk produced by this chunker.
- vectorlist[float]
Dense embedding vector.
- model_namestr, optional
Encoder model name.
- model_versionstr, optional
Encoder model version.
- Returns:
- Chunk
New frozen instance with
metadata["multilang"]["embedding"]populated andmetadata["embedding"]set at top level for compatibility withEmbeddedChunk.
- Parameters:
- Return type:
Chunk
Notes
User note: For batch embedding, use
attach_embedding_batchwhich avoids per-chunk dict copies.Developer note: Two embedding locations are written:
chunk.metadata["embedding"]— top-level key compatible withEmbeddedChunkand vector store adapters.chunk.metadata["multilang"]["embedding"]— inside the multilang bundle for model provenance tracking.
- attach_embedding_batch(chunks, vectors, *, model_name=None, model_version=None)[source]#
Return a new list of chunks with embeddings attached.
- Parameters:
- chunkslist[Chunk]
Chunks from this chunker.
- vectorslist[list[float]]
One embedding vector per chunk. Must have same length as
chunks.- model_namestr, optional
Encoder model name.
- model_versionstr, optional
Encoder model version.
- Returns:
- list[Chunk]
New list; originals are unmodified.
- Raises:
- ValueError
If
len(chunks) != len(vectors).
- Parameters:
- Return type:
list[Chunk]
- chunk(text, doc_id=None, extra_metadata=None)[source]#
Split text into paragraph-level chunks.
- Parameters:
- textstr
Raw document text.
- doc_idstr, optional
Document identifier stored in each chunk’s metadata.
- extra_metadatadict[str, Any], optional
Additional key/value pairs merged into the result metadata.
- Returns:
- ChunkResult
Chunks and aggregate metadata.
- Raises:
- TypeError
If text is not a
str.- ValueError
If text is empty or whitespace-only.
- Parameters:
- Return type:
ChunkResult
- chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#
Chunk a list of documents.
- Parameters:
- textslist[str]
Input documents.
- doc_idslist[str], optional
Parallel document identifiers.
- extra_metadatadict[str, Any], optional
Shared metadata for every result.
- Returns:
- list[ChunkResult]
One result per document.
- Raises:
- TypeError
If texts is not a list.
- ValueError
If doc_ids length mismatches texts.
- Parameters:
- Return type:
list[ChunkResult]