normalize_extractor_output#

scikitplot.corpus.normalize_extractor_output(raw, *, source_type=SourceType.UNKNOWN, section_type=SectionType.TEXT)[source]#

Coerce an extractor return value to a list of raw chunk dicts.

Every dict in the returned list is guaranteed to contain a "text" key. Missing "section_type" and "source_type" keys are filled with the supplied defaults.

Parameters:
rawstr, list[str], dict, or list[dict]

Value returned by a user-supplied extractor callable. Supported types:

str

Entire resource as a single text chunk.

list[str]

Multiple text chunks. All elements must be str.

dict

Single chunk with text and optional metadata. Must contain a "text" key whose value is a str.

list[dict]

Multiple chunks. Every element must be a dict with a "text" key.

source_typeSourceType, optional

Default source type injected into chunks that do not specify "source_type". Default: UNKNOWN.

section_typeSectionType, optional

Default section type injected into chunks that do not specify "section_type". Default: TEXT.

Returns:
list of dict

Normalised list of raw chunk dicts, each containing at minimum {"text": str}. The list may be empty when raw is an empty list.

Raises:
TypeError

If raw is not one of the four supported types, or if a list contains a mix of str and non-str elements, or if a list element is neither str nor dict.

ValueError

If any dict in raw is missing a "text" key, or if the "text" value is not a str.

Parameters:
Return type:

list[dict[str, Any]]

Notes

This function is intentionally pure (no side effects, deterministic). It is called by CustomReader and by the custom_extractor dispatch branches of PDFReader, ImageReader, AudioReader, and VideoReader.

Examples

Single string → one-element list:

>>> from scikitplot.corpus._readers._custom import normalize_extractor_output
>>> normalize_extractor_output("Hello world")
[{'text': 'Hello world', 'section_type': 'text', 'source_type': 'unknown'}]

List of strings → list of chunk dicts:

>>> normalize_extractor_output(["Page one", "Page two"])
[{'text': 'Page one', ...}, {'text': 'Page two', ...}]

Dict with extra metadata preserved:

>>> normalize_extractor_output({"text": "Hello", "page_number": 0})
[{'text': 'Hello', 'page_number': 0, 'section_type': 'text', 'source_type': 'unknown'}]