AudioReader#

class scikitplot.corpus.AudioReader(input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, custom_extractor=None, custom_extractor_kwargs=<factory>, transcribe=False, whisper_model='base', classify=False, classifier=None, segment_duration=5.0, segment_overlap=1.0, extract_features=False, txt_as_single_chunk=False, max_file_bytes=5368709120)[source]#

Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification.

Three extraction paths are attempted in order:

Companion file — zero-dependency, instant. The reader looks for .lrc, .srt, .vtt, or .txt files with the same stem as the audio. If found, the companion is parsed; transcription is never invoked.
Whisper transcription — opt-in. Enable with transcribe=True. Requires faster-whisper or openai-whisper.
Audio classification — opt-in. Enable with classify=True and provide a classifier callable. For non-speech audio (animal sounds, instruments, environmental sounds).

Parameters:

input_pathpathlib.Path

Path to the audio file.

transcribebool, optional

When True, fall back to Whisper ASR if no companion file is found. When False (default), a missing companion file causes the reader to yield no chunks (unless classify=True).

whisper_modelstr, optional

Whisper model size. One of "tiny", "base", "small", "medium", "large", "large-v2", "large-v3". Default: "base".

classifybool, optional

When True, apply audio classification using the classifier callable. Can be combined with transcribe: transcription produces speech text, classification produces non-speech labels. Default: False.

classifiercallable or None, optional

A callable for audio classification. Signature:

classifier(audio_path: Path, offset: float, duration: float)
    -> list[dict[str, Any]]

Must return dicts with "label" (str) and "confidence" (float). May include "text" (str). Required when classify=True.

segment_durationfloat, optional

Duration in seconds of each classification window when classify=True. Default: 5.0.

segment_overlapfloat, optional

Overlap in seconds between consecutive classification windows. Default: 1.0.

extract_featuresbool, optional

When True, extract audio features (MFCCs, chroma, spectral) for each segment and store them in metadata. Requires librosa. Default: False.

txt_as_single_chunkbool, optional

When a .txt companion is found, yield the entire file as one chunk (True) or one chunk per non-empty line (False). Default: False.

max_file_bytesint, optional

Maximum file size in bytes. Default: 5 GB.

chunkerChunkerBase or None, optional

Inherited from DocumentReader.

filter_FilterBase or None, optional

Inherited from DocumentReader.

filename_overridestr or None, optional

Inherited from DocumentReader.

default_languagestr or None, optional

ISO 639-1 language code. Used as language hint for Whisper. Default: None (auto-detect).

Attributes:

file_typeslist of str: Class variable. Registered extensions: [".mp3", ".wav", ".flac", ".ogg", ".m4a", ".wma", ".aac", ".aiff", ".opus", ".wv"].

Raises:

ValueError: If whisper_model is not a valid Whisper model size.
ValueError: If classify=True but classifier is None.
ValueError: If segment_duration <= segment_overlap.
ImportError: If transcribe=True and no Whisper backend is installed.

Parameters:

input_path (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
source_provenance (dict[str, Any])
custom_extractor (Any | None)
custom_extractor_kwargs (dict[str, Any])
transcribe (bool)
whisper_model (str)
classify (bool)
classifier (Callable[[...], list[dict[str, Any]]] | None)
segment_duration (float)
segment_overlap (float)
extract_features (bool)
txt_as_single_chunk (bool)
max_file_bytes (int)

See also

scikitplot.corpus._readers.VideoReader: Video/subtitle reader.
scikitplot.corpus._readers.TextReader: Plain-text file reader.

Notes

Scenario 11 — Beethoven MP3 + Music Notes Book:

Build a corpus of Beethoven recordings and a book of music notes. Use Whisper ASR to transcribe audio segments (or provide companion .lrc files with lyrics). Each audio segment carries timecode_start and timecode_end for temporal alignment. Use SimilarityIndex with MatchMode.SEMANTIC to find which book passages match which audio segments — like Shazam for text-to-audio alignment.

With extract_features=True, chroma features capture harmonic content that correlates with musical notation in the book.

Scenario 12 — Animal Sounds + Children’s Book (Bremen):

Build a corpus of animal sound recordings using classify=True with a classifier that labels sounds ("bird", "donkey", "cat", "dog", "rooster"). Each chunk carries metadata["audio_label"] and a text description. Use SimilarityIndex with MatchMode.KEYWORD to match labels against sentences in The Town Musicians of Bremen.

Chunk metadata keys (companion):

"text" — lyrics line or transcript text
"section_type" — SectionType.LYRICS (LRC) or SectionType.TEXT (SRT/VTT/TXT)
"timecode_start" — start time in seconds (float), if available
"timecode_end" — end time in seconds (float), if available
"source_type" — SourceType.AUDIO
"companion_format" — "lrc"/"srt"/"vtt"/"txt"

Chunk metadata keys (transcription):

"text" — Whisper-generated transcription
"section_type" — SectionType.TRANSCRIPT
"timecode_start" / "timecode_end" — segment timecodes
"confidence" — ASR confidence (when available)
"source_type" — SourceType.AUDIO

Chunk metadata keys (classification):

"text" — label text or description
"section_type" — SectionType.TEXT
"timecode_start" / "timecode_end" — window timecodes
"confidence" — classification confidence
"audio_label" — classification label string (in metadata)
"source_type" — SourceType.AUDIO

Examples

Companion LRC lyrics:

>>> from pathlib import Path
>>> reader = AudioReader(input_path=Path("beethoven_moonlight.mp3"))
>>> docs = list(reader.get_documents())
>>> for d in docs[:3]:
...     print(f"{d.timecode_start:.1f}s: {d.text[:50]}")

Whisper transcription:

>>> reader = AudioReader(
...     input_path=Path("lecture.mp3"),
...     transcribe=True,
...     whisper_model="small",
...     default_language="en",
... )
>>> docs = list(reader.get_documents())

Audio classification (animal sounds):

>>> def my_classifier(path, offset, duration):
...     # Your classification model here
...     return [{"label": "bird", "confidence": 0.95, "text": "bird chirping"}]
>>> reader = AudioReader(
...     input_path=Path("forest_sounds.wav"),
...     classify=True,
...     classifier=my_classifier,
...     segment_duration=3.0,
... )
>>> docs = list(reader.get_documents())

chunker: ChunkerBase | None = None#: Chunker to apply to each raw text block. None means each raw chunk is used as-is (one CorpusDocument per raw chunk).

classifier: Callable[[...], list[dict[str, Any]]] | None = None#

Audio classification callable. Signature:

classifier(audio_path: Path, offset: float, duration: float)
    -> list[dict[str, Any]]

Required when classify=True.

classify: bool = False#: Enable audio classification via classifier callable.

classmethod create(*input_path, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for one or more sources.

Accepts any mix of file paths, URL strings, and pathlib.Path objects — in any order. URL strings (those starting with http:// or https://) are automatically detected and routed to from_url; everything else is treated as a local file path and dispatched by extension via the registry.

Parameters:

*input_pathstr or pathlib.Path

One or more source paths or URL strings. Each element is classified independently:

str matching ^https?:// (case-insensitive) — treated as a URL and routed to from_url. Must be passed as a plain ``str``, not wrapped in pathlib.Path; wrapping collapses the double-slash (https:// → https:/) and breaks URL detection.
str not matching the URL pattern — treated as a local file path and converted to pathlib.Path internally.
pathlib.Path — always treated as a local file path and dispatched by extension via the reader registry.

Pass a single value for the common case; pass multiple values to get a _MultiSourceReader that chains all their documents in order.

chunkerChunkerBase or None, optional

Chunker injected into every reader. Default: None.

filter_FilterBase or None, optional

Filter injected into every reader. Default: None (DefaultFilter).

filename_overridestr or None, optional

Override the input_path label. Only applied when input_path contains exactly one source. Default: None.

default_languagestr or None, optional

ISO 639-1 language code applied to all sources. Default: None.

source_typeSourceType, list[SourceType or None], or None, optional

Semantic label for the source kind. When input_path has more than one element you may pass a list of the same length to assign a distinct type per source; None entries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default: None.

source_titlestr or None, optional

Title propagated into every yielded document. Default: None.

source_authorstr or None, optional

Author propagated into every yielded document. Default: None.

source_datestr or None, optional

ISO 8601 publication date. Default: None.

collection_idstr or None, optional

Corpus collection identifier. Default: None.

doistr or None, optional

Digital Object Identifier (file sources only). Default: None.

isbnstr or None, optional

ISBN (file sources only). Default: None.

**kwargsAny

Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g. transcribe=True for AudioReader, backend="easyocr" for ImageReader).

Returns:

DocumentReader: A single reader when input_path has exactly one element (backward compatible with every existing call site). A _MultiSourceReader when input_path has more than one element — it implements the same get_documents() interface and chains documents from all sub-readers in order.

Raises:

ValueError: If input_path is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
TypeError: If any element of input_path is not a str or pathlib.Path.

Parameters:

input_path (str | pathlib.Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

URL auto-detection: A str element is treated as a URL when it matches ^https?:// (case-insensitive). All other strings and all pathlib.Path objects are treated as local file paths. This means you no longer need to call from_url explicitly — just pass the URL string to create.

Per-source source_type: When passing multiple input_path with different media types, supply a list:

DocumentReader.create(
    Path("podcast.mp3"),
    "report.pdf",
    "https://iris.who.int/.../content",  # returns image/jpeg
    source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
)

Reader-specific kwargs (forwarded via **kwargs):

transcribe=True, whisper_model="small" → AudioReader, VideoReader
backend="easyocr" → ImageReader
prefer_backend="pypdf" → PDFReader
classify=True, classifier=fn → AudioReader

Examples

Single file (backward-compatible):

>>> reader = DocumentReader.create(Path("hamlet.txt"))
>>> docs = list(reader.get_documents())

URL string auto-detected — no from_url() call required:

>>> reader = DocumentReader.create(
...     "https://en.wikipedia.org/wiki/Python_(programming_language)"
... )

Mixed multi-source batch:

>>> reader = DocumentReader.create(
...     Path("podcast.mp3"),
...     "report.pdf",
...     "https://iris.who.int/api/bitstreams/abc/content",
...     source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE],
... )
>>> docs = list(reader.get_documents())  # chained stream from all three

custom_extractor: Any | None = None#

User-supplied extraction callable that replaces get_raw_chunks entirely for this reader instance.

When set, _iter_raw_chunks calls custom_extractor(self.input_path, **custom_extractor_kwargs) and normalises the return value through normalize_extractor_output. The built-in get_raw_chunks implementation is not called.

This hook is available on every reader class (ALTOReader, TextReader, PDFReader, ImageReader, etc.) without any subclassing — simply pass a callable at construction time.

Callable contract

def my_extractor(path: pathlib.Path, **kwargs) -> ExtractorOutput

where ExtractorOutput is str, list[str], dict, or list[dict] — the same contract as CustomReader.

Examples

Override PDF extraction with pdfplumber for a single reader:

import pdfplumber
from pathlib import Path
from scikitplot.corpus._base import DocumentReader

def plumber_fn(path, **kw):
    with pdfplumber.open(path) as pdf:
        return [{"text": p.extract_text() or "", "page_number": i}
                for i, p in enumerate(pdf.pages)]

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=plumber_fn,
)
docs = list(reader.get_documents())

custom_extractor_kwargs: dict[str, Any][source]#

Extra keyword arguments forwarded to custom_extractor on every invocation. Merged into the call as **custom_extractor_kwargs.

Examples

reader = DocumentReader.create(
    Path("report.pdf"),
    custom_extractor=my_fn,
    custom_extractor_kwargs={"password": "s3cret", "pages": [0, 1, 2]},
)

default_language: str | None = None#: ISO 639-1 language code to assign when the source has no language info.

extract_features: bool = False#: Extract audio features (MFCCs, chroma) via librosa.

property file_name: str#

Effective filename used in document labels.

Returns filename_override when set; otherwise returns input_path.name.

Returns:

str: File name string (not a full path).

Examples

>>> from pathlib import Path
>>> reader = TextReader(input_path=Path("/data/corpus.txt"))
>>> reader.file_name
'corpus.txt'

file_type: ClassVar[str | None] = None#

Single file extension this reader handles (lowercase, including leading dot). E.g. ".txt", ".xml", ".zip".

For readers that handle multiple extensions, define file_types (plural) instead. Exactly one of file_type or file_types must be defined on every concrete subclass.

file_types: ClassVar[list[str] | None] = ['.mp3', '.wav', '.flac', '.ogg', '.m4a', '.wma', '.aac', '.aiff', '.opus', '.wv']#

List of file extensions this reader handles (lowercase, leading dot). Use instead of file_type when a single reader class should be registered for several extensions — e.g. an image reader for [".png", ".jpg", ".jpeg", ".gif", ".webp"].

When both file_type and file_types are defined on the same class, file_types takes precedence and file_type is ignored.

filename_override: str | None = None#: Override for the input_path label in generated documents.

filter_: FilterBase | None = None#: Filter applied after chunking. None triggers the DefaultFilter.

classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#

Build a _MultiSourceReader from a manifest file.

The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with # are ignored. JSON manifests (a list of strings or objects) are also supported.

Parameters:

manifest_pathstr or pathlib.Path

Path to the manifest file. Supported formats:

.txt / .manifest — one source per line.
.json — a JSON array of strings (sources) or objects with at least a "source" key (and optional "source_type", "source_title" per-entry overrides).

chunkerChunkerBase or None, optional

Chunker applied to all sources. Default: None.

filter_FilterBase or None, optional

Filter applied to all sources. Default: None.

default_languagestr or None, optional

ISO 639-1 language code. Default: None.

source_typeSourceType or None, optional

Override source type for all sources. Default: None.

source_titlestr or None, optional

Override title for all sources. Default: None.

source_authorstr or None, optional

Override author for all sources. Default: None.

source_datestr or None, optional

Override date for all sources. Default: None.

collection_idstr or None, optional

Collection identifier. Default: None.

doistr or None, optional

DOI override. Default: None.

isbnstr or None, optional

ISBN override. Default: None.

encodingstr, optional

Text encoding for .txt manifests. Default: "utf-8".

**kwargsAny

Forwarded to each reader constructor.

Returns:

_MultiSourceReader: Multi-source reader chaining all manifest entries.

Raises:

ValueError: If manifest_path does not exist or is empty after filtering blank and comment lines.
ValueError: If the manifest format is not recognised.

Parameters:

manifest_path (str | Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
encoding (str)
kwargs (Any)

Return type:

_MultiSourceReader

Notes

Per-entry overrides in JSON manifests: each entry may be an object with:

{
    "source": "https://example.com/report.pdf",
    "source_type": "research",
    "source_title": "Annual Report 2024",
}

String-level source_type values are coerced via SourceType(value) and an invalid value raises ValueError.

Examples

Text manifest sources.txt:

# WHO corpus
https://www.who.int/europe/news/item/...
https://youtu.be/rwPISgZcYIk
WHO-EURO-2025.pdf
scan.jpg

Usage:

reader = DocumentReader.from_manifest(
    Path("sources.txt"),
    collection_id="who-corpus",
)
docs = list(reader.get_documents())

classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#

Instantiate the appropriate reader for a URL source.

Dispatches to YouTubeReader for YouTube URLs and to WebReader for all other http:// / https:// URLs.

Parameters:

urlstr: Full URL string. Must start with http:// or https://.
chunkerChunkerBase or None, optional: Chunker to inject. Default: None.
filter_FilterBase or None, optional: Filter to inject. Default: None (DefaultFilter).
filename_overridestr or None, optional: Override for the input_path label. Default: None.
default_languagestr or None, optional: ISO 639-1 language code. Default: None.
source_typeSourceType or None, optional: Semantic label for the source. Default: None.
source_titlestr or None, optional: Title of the source work. Default: None.
source_authorstr or None, optional: Primary author. Default: None.
source_datestr or None, optional: Publication date in ISO 8601 format. Default: None.
collection_idstr or None, optional: Corpus collection identifier. Default: None.
doistr or None, optional: Digital Object Identifier. Default: None.
isbnstr or None, optional: International Standard Book Number. Default: None.
**kwargsAny: Additional kwargs forwarded to the reader constructor (e.g. include_auto_generated=False for YouTubeReader).

Returns:

DocumentReader: YouTubeReader or WebReader instance.

Raises:

ValueError: If url does not start with http:// or https://.
ImportError: If the required reader class is not registered (i.e. scikitplot.corpus._readers has not been imported yet).

Parameters:

url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)

Return type:

Self

Notes

Prefer :meth:`create` for new code. Passing a URL string to create automatically calls from_url — you rarely need to call from_url directly.

Examples

>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python")
>>> docs = list(reader.get_documents())

>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=rwPISgZcYIk")
>>> docs = list(yt.get_documents())

get_documents()[source]#

Yield validated CorpusDocument instances for the input file.

Orchestrates the full per-file pipeline:

validate_input — fail fast if file is missing.
get_raw_chunks — format-specific text extraction.
Chunker (if set) — sub-segments each raw block.
CorpusDocument construction with validated schema.
Filter — discards noise documents.

Yields:

CorpusDocument: Validated documents that passed the filter.

Raises:

ValueError: If the input file is missing or the format is invalid.

Return type:

Generator[CorpusDocument, None, None]

Notes

The global chunk_index counter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that (input_path, chunk_index) is a unique key within one reader run.

Omitted-document statistics are logged at INFO level after processing each file.

Examples

>>> from pathlib import Path
>>> reader = DocumentReader.create(Path("corpus.txt"))
>>> docs = list(reader.get_documents())
>>> all(isinstance(d, CorpusDocument) for d in docs)
True

get_raw_chunks()[source]#

Attempts companion detection first. Falls back to Whisper only when transcribe=True and no companion was found. Classification via classify=True runs independently (can combine with transcription).

Yields:

dict: Keys always include "text" and "section_type". May include "timecode_start", "timecode_end", "confidence", "source_type", and format-specific keys.

Raises:

ValueError: If the file exceeds max_file_bytes.
ImportError: If transcribe=True and Whisper is not installed.

Return type:

Generator[dict[str, Any], None, None]

input_path: Path[source]#

Path to the source file.

For URL-based readers (WebReader, YouTubeReader), pass pathlib.Path(url_string) here and set source_uri to the original URL string. validate_input() is overridden in those subclasses to skip the file-existence check.

max_file_bytes: int = 5368709120#

5 GB.

Type:: Maximum audio file size. Default

segment_duration: float = 5.0#

5.0.

Type:: Classification window duration in seconds. Default

segment_overlap: float = 1.0#

1.0.

Type:: Classification window overlap in seconds. Default

source_provenance: dict[str, Any][source]#

Provenance overrides propagated into every yielded CorpusDocument.

Keys may include "source_type", "source_title", "source_author", and "collection_id". Populated by create / from_url from their keyword arguments.

source_uri: str | None = None#

Original URI for URL-based readers (web pages, YouTube videos).

Set this to the full URL string when input_path is a synthetic pathlib.Path wrapping a URL. File-based readers leave this None.

Examples

>>> reader = WebReader(
...     input_path=Path("https://example.com/article"),
...     source_uri="https://example.com/article",
... )

classmethod subclass_by_type()[source]#

Return a copy of the extension → reader class registry.

Returns:

dict: Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.

Return type:

dict[str, type[DocumentReader]]

Examples

>>> registry = DocumentReader.subclass_by_type()
>>> ".txt" in registry
True

classmethod supported_types()[source]#

Return a sorted list of file extensions supported by registered readers.

Returns:

list of str: Lowercase file extensions, each including the leading dot. E.g. ['.pdf', '.txt', '.xml', '.zip'].

Return type:

list[str]

Examples

>>> DocumentReader.supported_types()
['.pdf', '.txt', '.xml', '.zip']

transcribe: bool = False#: Enable Whisper ASR fallback when no companion file is found.

txt_as_single_chunk: bool = False#: Yield entire .txt companion as one chunk if True.

validate_input()[source]#

Assert that the input file exists and is readable.

Raises:

ValueError: If input_path does not exist or is not a regular file.

Return type:

None

Notes

Called automatically by get_documents before iterating. Can also be called eagerly after construction to fail fast.

Examples

>>> reader = DocumentReader.create(Path("missing.txt"))
>>> reader.validate_input()
Traceback (most recent call last):
    ...
ValueError: Input file does not exist: missing.txt

whisper_model: str = 'base'#

"base".

Type:: Whisper model size for transcription. Default