normalize_extractor_output#
- scikitplot.corpus.normalize_extractor_output(raw, *, source_type=SourceType.UNKNOWN, section_type=SectionType.TEXT)[source]#
Coerce an extractor return value to a list of raw chunk dicts.
Every dict in the returned list is guaranteed to contain a
"text"key. Missing"section_type"and"source_type"keys are filled with the supplied defaults.- Parameters:
- rawstr, list[str], dict, or list[dict]
Value returned by a user-supplied extractor callable. Supported types:
strEntire resource as a single text chunk.
list[str]Multiple text chunks. All elements must be
str.dictSingle chunk with text and optional metadata. Must contain a
"text"key whose value is astr.list[dict]Multiple chunks. Every element must be a
dictwith a"text"key.
- source_typeSourceType, optional
Default source type injected into chunks that do not specify
"source_type". Default:UNKNOWN.- section_typeSectionType, optional
Default section type injected into chunks that do not specify
"section_type". Default:TEXT.
- Returns:
- list of dict
Normalised list of raw chunk dicts, each containing at minimum
{"text": str}. The list may be empty whenrawis an empty list.
- Raises:
- TypeError
If
rawis not one of the four supported types, or if a list contains a mix ofstrand non-strelements, or if a list element is neitherstrnordict.- ValueError
If any
dictinrawis missing a"text"key, or if the"text"value is not astr.
- Parameters:
raw (Any)
source_type (SourceType)
section_type (SectionType)
- Return type:
Notes
This function is intentionally pure (no side effects, deterministic). It is called by
CustomReaderand by thecustom_extractordispatch branches ofPDFReader,ImageReader,AudioReader, andVideoReader.Examples
Single string → one-element list:
>>> from scikitplot.corpus._readers._custom import normalize_extractor_output >>> normalize_extractor_output("Hello world") [{'text': 'Hello world', 'section_type': 'text', 'source_type': 'unknown'}]
List of strings → list of chunk dicts:
>>> normalize_extractor_output(["Page one", "Page two"]) [{'text': 'Page one', ...}, {'text': 'Page two', ...}]
Dict with extra metadata preserved:
>>> normalize_extractor_output({"text": "Hello", "page_number": 0}) [{'text': 'Hello', 'page_number': 0, 'section_type': 'text', 'source_type': 'unknown'}]