ParagraphChunker#

class scikitplot.corpus.ParagraphChunker(config=None, multilang_config=None)[source]#

Split a document into paragraph-level Chunk objects.

Paragraph boundaries are blank lines (\n\n) — script-universal and dependency-free. Within each paragraph the dominant Unicode script is detected and reported in chunk.metadata["multilang"]["script"].

Parameters:
configParagraphChunkerConfig, optional

Chunker configuration.

multilang_configMultilangConfig, optional

Multilang feature flags. Overrides config.multilang_config when provided explicitly. Default: MultilangConfig() (script detection + semanteme count only, no raw text / trace).

Parameters:

Notes

User note (multilang): Blank-line paragraph splitting works for every script. Set multilang_config=MultilangConfig( include_raw_text=True, include_preprocessing_trace=True) to attach full preprocessing audit to each chunk. To add embeddings after chunking, call attach_embedding_batch.

Developer note: Inherits MultilangMixin. Initialised via self._ml_init(). The chunk() method enriches each paragraph chunk with a MultilangChunkMeta dict stored under chunk.metadata["multilang"].

Examples

>>> chunker = ParagraphChunker()
>>> text = "First paragraph.\n\nSecond paragraph."
>>> result = chunker.chunk(text)
>>> len(result.chunks)
2
>>> chunker_ml = ParagraphChunker(
...     multilang_config=MultilangConfig(include_raw_text=True)
... )
>>> result_ml = chunker_ml.chunk("مرحبا.\\n\\nHello.")
>>> result_ml.chunks[0].metadata["multilang"]["script"]
'arabic'
attach_embedding(chunk, vector, *, model_name=None, model_version=None)[source]#

Return a new Chunk with an embedding attached.

Does NOT mutate the original Chunk (frozen dataclass).

Parameters:
chunkChunk

Any chunk produced by this chunker.

vectorlist[float]

Dense embedding vector.

model_namestr, optional

Encoder model name.

model_versionstr, optional

Encoder model version.

Returns:
Chunk

New frozen instance with metadata["multilang"]["embedding"] populated and metadata["embedding"] set at top level for compatibility with EmbeddedChunk.

Parameters:
  • chunk (Chunk)

  • vector (list[float])

  • model_name (str | None)

  • model_version (str | None)

Return type:

Chunk

Notes

User note: For batch embedding, use attach_embedding_batch which avoids per-chunk dict copies.

Developer note: Two embedding locations are written:

  1. chunk.metadata["embedding"] — top-level key compatible with EmbeddedChunk and vector store adapters.

  2. chunk.metadata["multilang"]["embedding"] — inside the multilang bundle for model provenance tracking.

attach_embedding_batch(chunks, vectors, *, model_name=None, model_version=None)[source]#

Return a new list of chunks with embeddings attached.

Parameters:
chunkslist[Chunk]

Chunks from this chunker.

vectorslist[list[float]]

One embedding vector per chunk. Must have same length as chunks.

model_namestr, optional

Encoder model name.

model_versionstr, optional

Encoder model version.

Returns:
list[Chunk]

New list; originals are unmodified.

Raises:
ValueError

If len(chunks) != len(vectors).

Parameters:
Return type:

list[Chunk]

chunk(text, doc_id=None, extra_metadata=None)[source]#

Split text into paragraph-level chunks.

Parameters:
textstr

Raw document text.

doc_idstr, optional

Document identifier stored in each chunk’s metadata.

extra_metadatadict[str, Any], optional

Additional key/value pairs merged into the result metadata.

Returns:
ChunkResult

Chunks and aggregate metadata.

Raises:
TypeError

If text is not a str.

ValueError

If text is empty or whitespace-only.

Parameters:
Return type:

ChunkResult

chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#

Chunk a list of documents.

Parameters:
textslist[str]

Input documents.

doc_idslist[str], optional

Parallel document identifiers.

extra_metadatadict[str, Any], optional

Shared metadata for every result.

Returns:
list[ChunkResult]

One result per document.

Raises:
TypeError

If texts is not a list.

ValueError

If doc_ids length mismatches texts.

Parameters:
Return type:

list[ChunkResult]