ParagraphChunker#
- class scikitplot.corpus.ParagraphChunker(config=None)[source]#
Split a document into paragraph-level
Chunkobjects.- Parameters:
- configParagraphChunkerConfig, optional
Chunker configuration.
- Parameters:
config (ParagraphChunkerConfig | None)
Examples
>>> chunker = ParagraphChunker() >>> text = "First paragraph.\\n\\nSecond paragraph." >>> result = chunker.chunk(text) >>> len(result.chunks) 2
- chunk(text, doc_id=None, extra_metadata=None)[source]#
Split text into paragraph-level chunks.
- Parameters:
- textstr
Raw document text.
- doc_idstr, optional
Document identifier stored in each chunk’s metadata.
- extra_metadatadict[str, Any], optional
Additional key/value pairs merged into the result metadata.
- Returns:
- ChunkResult
Chunks and aggregate metadata.
- Raises:
- TypeError
If text is not a
str.- ValueError
If text is empty or whitespace-only.
- Parameters:
- Return type:
ChunkResult
- chunk_batch(texts, doc_ids=None, extra_metadata=None)[source]#
Chunk a list of documents.
- Parameters:
- textslist[str]
Input documents.
- doc_idslist[str], optional
Parallel document identifiers.
- extra_metadatadict[str, Any], optional
Shared metadata for every result.
- Returns:
- list[ChunkResult]
One result per document.
- Raises:
- TypeError
If texts is not a list.
- ValueError
If doc_ids length mismatches texts.
- Parameters:
- Return type:
list[ChunkResult]