AudioReader#
- class scikitplot.corpus.AudioReader(input_file, chunker=None, filter_=None, filename_override=None, default_language=None, source_uri=None, source_provenance=<factory>, transcribe=False, whisper_model='base', classify=False, classifier=None, segment_duration=5.0, segment_overlap=1.0, extract_features=False, txt_as_single_chunk=False, max_file_bytes=5368709120)[source]#
Text extraction from audio files via companion transcript/lyrics parsing, Whisper ASR, and optional audio classification.
Three extraction paths are attempted in order:
Companion file — zero-dependency, instant. The reader looks for
.lrc,.srt,.vtt, or.txtfiles with the same stem as the audio. If found, the companion is parsed; transcription is never invoked.Whisper transcription — opt-in. Enable with
transcribe=True. Requiresfaster-whisperoropenai-whisper.Audio classification — opt-in. Enable with
classify=Trueand provide aclassifiercallable. For non-speech audio (animal sounds, instruments, environmental sounds).
- Parameters:
- input_filepathlib.Path
Path to the audio file.
- transcribebool, optional
When
True, fall back to Whisper ASR if no companion file is found. WhenFalse(default), a missing companion file causes the reader to yield no chunks (unlessclassify=True).- whisper_modelstr, optional
Whisper model size. One of
"tiny","base","small","medium","large","large-v2","large-v3". Default:"base".- classifybool, optional
When
True, apply audio classification using theclassifiercallable. Can be combined withtranscribe: transcription produces speech text, classification produces non-speech labels. Default:False.- classifiercallable or None, optional
A callable for audio classification. Signature:
classifier(audio_path: Path, offset: float, duration: float) -> list[dict[str, Any]]
Must return dicts with
"label"(str) and"confidence"(float). May include"text"(str). Required whenclassify=True.- segment_durationfloat, optional
Duration in seconds of each classification window when
classify=True. Default: 5.0.- segment_overlapfloat, optional
Overlap in seconds between consecutive classification windows. Default: 1.0.
- extract_featuresbool, optional
When
True, extract audio features (MFCCs, chroma, spectral) for each segment and store them inmetadata. Requireslibrosa. Default:False.- txt_as_single_chunkbool, optional
When a
.txtcompanion is found, yield the entire file as one chunk (True) or one chunk per non-empty line (False). Default:False.- max_file_bytesint, optional
Maximum file size in bytes. Default: 5 GB.
- chunkerChunkerBase or None, optional
Inherited from
DocumentReader.- filter_FilterBase or None, optional
Inherited from
DocumentReader.- filename_overridestr or None, optional
Inherited from
DocumentReader.- default_languagestr or None, optional
ISO 639-1 language code. Used as language hint for Whisper. Default:
None(auto-detect).
- Attributes:
- file_typeslist of str
Class variable. Registered extensions:
[".mp3", ".wav", ".flac", ".ogg", ".m4a", ".wma", ".aac", ".aiff", ".opus", ".wv"].
- Raises:
- ValueError
If
whisper_modelis not a valid Whisper model size.- ValueError
If
classify=TruebutclassifierisNone.- ValueError
If
segment_duration <= segment_overlap.- ImportError
If
transcribe=Trueand no Whisper backend is installed.
- Parameters:
input_file (Path)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_uri (str | None)
transcribe (bool)
whisper_model (str)
classify (bool)
segment_duration (float)
segment_overlap (float)
extract_features (bool)
txt_as_single_chunk (bool)
max_file_bytes (int)
See also
scikitplot.corpus._readers.VideoReaderVideo/subtitle reader.
scikitplot.corpus._readers.TextReaderPlain-text file reader.
Notes
Scenario 11 — Beethoven MP3 + Music Notes Book:
Build a corpus of Beethoven recordings and a book of music notes. Use Whisper ASR to transcribe audio segments (or provide companion
.lrcfiles with lyrics). Each audio segment carriestimecode_startandtimecode_endfor temporal alignment. UseSimilarityIndexwithMatchMode.SEMANTICto find which book passages match which audio segments — like Shazam for text-to-audio alignment.With
extract_features=True, chroma features capture harmonic content that correlates with musical notation in the book.Scenario 12 — Animal Sounds + Children’s Book (Bremen):
Build a corpus of animal sound recordings using
classify=Truewith a classifier that labels sounds ("bird","donkey","cat","dog","rooster"). Each chunk carriesmetadata["audio_label"]and a text description. UseSimilarityIndexwithMatchMode.KEYWORDto match labels against sentences in The Town Musicians of Bremen.Chunk metadata keys (companion):
"text"— lyrics line or transcript text"section_type"—SectionType.LYRICS(LRC) orSectionType.TEXT(SRT/VTT/TXT)"timecode_start"— start time in seconds (float), if available"timecode_end"— end time in seconds (float), if available"source_type"—SourceType.AUDIO"companion_format"—"lrc"/"srt"/"vtt"/"txt"
Chunk metadata keys (transcription):
"text"— Whisper-generated transcription"section_type"—SectionType.TRANSCRIPT"timecode_start"/"timecode_end"— segment timecodes"confidence"— ASR confidence (when available)"source_type"—SourceType.AUDIO
Chunk metadata keys (classification):
"text"— label text or description"section_type"—SectionType.TEXT"timecode_start"/"timecode_end"— window timecodes"confidence"— classification confidence"audio_label"— classification label string (in metadata)"source_type"—SourceType.AUDIO
Examples
Companion LRC lyrics:
>>> from pathlib import Path >>> reader = AudioReader(input_file=Path("beethoven_moonlight.mp3")) >>> docs = list(reader.get_documents()) >>> for d in docs[:3]: ... print(f"{d.timecode_start:.1f}s: {d.text[:50]}")
Whisper transcription:
>>> reader = AudioReader( ... input_file=Path("lecture.mp3"), ... transcribe=True, ... whisper_model="small", ... default_language="en", ... ) >>> docs = list(reader.get_documents())
Audio classification (animal sounds):
>>> def my_classifier(path, offset, duration): ... # Your classification model here ... return [{"label": "bird", "confidence": 0.95, "text": "bird chirping"}] >>> reader = AudioReader( ... input_file=Path("forest_sounds.wav"), ... classify=True, ... classifier=my_classifier, ... segment_duration=3.0, ... ) >>> docs = list(reader.get_documents())
- chunker: ChunkerBase | None = None#
Chunker to apply to each raw text block.
Nonemeans each raw chunk is used as-is (one CorpusDocument per raw chunk).
- classifier: Callable[[...], list[dict[str, Any]]] | None = None#
Audio classification callable. Signature:
classifier(audio_path: Path, offset: float, duration: float) -> list[dict[str, Any]]
Required when
classify=True.
- classmethod create(*inputs, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for one or more sources.
Accepts any mix of file paths, URL strings, and
pathlib.Pathobjects — in any order. URL strings (those starting withhttp://orhttps://) are automatically detected and routed tofrom_url; everything else is treated as a local file path and dispatched by extension via the registry.- Parameters:
- *inputspathlib.Path or str
One or more source paths or URL strings. Pass a single value for the common case; pass multiple values to get a
_MultiSourceReaderthat chains all their documents.- chunkerChunkerBase or None, optional
Chunker injected into every reader. Default:
None.- filter_FilterBase or None, optional
Filter injected into every reader. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override the
source_filelabel. Only applied when inputs contains exactly one source. Default:None.- default_languagestr or None, optional
ISO 639-1 language code applied to all sources. Default:
None.- source_typeSourceType, list[SourceType or None], or None, optional
Semantic label for the source kind. When inputs has more than one element you may pass a list of the same length to assign a distinct type per source;
Noneentries in the list mean “infer from extension / URL”. A single value is broadcast to all sources. Default:None.- source_titlestr or None, optional
Title propagated into every yielded document. Default:
None.- source_authorstr or None, optional
Author propagated into every yielded document. Default:
None.- source_datestr or None, optional
ISO 8601 publication date. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier (file sources only). Default:
None.- isbnstr or None, optional
ISBN (file sources only). Default:
None.- **kwargsAny
Extra keyword arguments forwarded verbatim to each concrete reader constructor (e.g.
transcribe=TrueforAudioReader,backend="easyocr"forImageReader).
- Returns:
- DocumentReader
A single reader when inputs has exactly one element (backward compatible with every existing call site). A
_MultiSourceReaderwhen inputs has more than one element — it implements the sameget_documents()interface and chains documents from all sub-readers in order.
- Raises:
- ValueError
If inputs is empty, or if a source URL is invalid, or if no reader is registered for a file’s extension.
- TypeError
If any element of inputs is not a
strorpathlib.Path.
- Parameters:
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | list[SourceType | None] | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
URL auto-detection: A
strelement is treated as a URL when it matches^https?://(case-insensitive). All other strings and allpathlib.Pathobjects are treated as local file paths. This means you no longer need to callfrom_urlexplicitly — just pass the URL string tocreate.Per-source source_type: When passing multiple inputs with different media types, supply a list:
DocumentReader.create( Path("podcast.mp3"), "report.pdf", "https://iris.who.int/.../content", # returns image/jpeg source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], )
Reader-specific kwargs (forwarded via
**kwargs):transcribe=True,whisper_model="small"→AudioReader,VideoReaderbackend="easyocr"→ImageReaderprefer_backend="pypdf"→PDFReaderclassify=True,classifier=fn→AudioReader
Examples
Single file (backward-compatible):
>>> reader = DocumentReader.create(Path("hamlet.txt")) >>> docs = list(reader.get_documents())
URL string auto-detected — no from_url() call required:
>>> reader = DocumentReader.create( ... "https://en.wikipedia.org/wiki/Python_(programming_language)" ... )
Mixed multi-source batch:
>>> reader = DocumentReader.create( ... Path("podcast.mp3"), ... "report.pdf", ... "https://iris.who.int/api/bitstreams/abc/content", ... source_type=[SourceType.PODCAST, SourceType.RESEARCH, SourceType.IMAGE], ... ) >>> docs = list(reader.get_documents()) # chained stream from all three
- default_language: str | None = None#
ISO 639-1 language code to assign when the source has no language info.
- property file_name: str#
Effective filename used in document labels.
Returns
filename_overridewhen set; otherwise returnsinput_file.name.- Returns:
- str
File name string (not a full path).
Examples
>>> from pathlib import Path >>> reader = TextReader(input_file=Path("/data/corpus.txt")) >>> reader.file_name 'corpus.txt'
- file_type: ClassVar[str | None] = None#
Single file extension this reader handles (lowercase, including leading dot). E.g.
".txt",".xml",".zip".For readers that handle multiple extensions, define
file_types(plural) instead. Exactly one offile_typeorfile_typesmust be defined on every concrete subclass.
- file_types: ClassVar[list[str] | None] = ['.mp3', '.wav', '.flac', '.ogg', '.m4a', '.wma', '.aac', '.aiff', '.opus', '.wv']#
List of file extensions this reader handles (lowercase, leading dot). Use instead of
file_typewhen a single reader class should be registered for several extensions — e.g. an image reader for[".png", ".jpg", ".jpeg", ".gif", ".webp"].When both
file_typeandfile_typesare defined on the same class,file_typestakes precedence andfile_typeis ignored.
- filter_: FilterBase | None = None#
Filter applied after chunking.
Nonetriggers theDefaultFilter.
- classmethod from_manifest(manifest_path, *, chunker=None, filter_=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, encoding='utf-8', **kwargs)[source]#
Build a
_MultiSourceReaderfrom a manifest file.The manifest is a text file with one source per line — either a file path or a URL. Blank lines and lines starting with
#are ignored. JSON manifests (a list of strings or objects) are also supported.- Parameters:
- manifest_pathpathlib.Path or str
Path to the manifest file. Supported formats:
.txt/.manifest— one source per line..json— a JSON array of strings (sources) or objects with at least a"source"key (and optional"source_type","source_title"per-entry overrides).
- chunkerChunkerBase or None, optional
Chunker applied to all sources. Default:
None.- filter_FilterBase or None, optional
Filter applied to all sources. Default:
None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Override source type for all sources. Default:
None.- source_titlestr or None, optional
Override title for all sources. Default:
None.- source_authorstr or None, optional
Override author for all sources. Default:
None.- source_datestr or None, optional
Override date for all sources. Default:
None.- collection_idstr or None, optional
Collection identifier. Default:
None.- doistr or None, optional
DOI override. Default:
None.- isbnstr or None, optional
ISBN override. Default:
None.- encodingstr, optional
Text encoding for
.txtmanifests. Default:"utf-8".- **kwargsAny
Forwarded to each reader constructor.
- Returns:
- _MultiSourceReader
Multi-source reader chaining all manifest entries.
- Raises:
- ValueError
If manifest_path does not exist or is empty after filtering blank and comment lines.
- ValueError
If the manifest format is not recognised.
- Parameters:
- Return type:
_MultiSourceReader
Notes
Per-entry overrides in JSON manifests: each entry may be an object with:
{ "source": "https://example.com/report.pdf", "source_type": "research", "source_title": "Annual Report 2024", }
String-level
source_typevalues are coerced viaSourceType(value)and an invalid value raisesValueError.Examples
Text manifest
sources.txt:# WHO corpus https://www.who.int/europe/news/item/... https://youtu.be/rwPISgZcYIk WHO-EURO-2025.pdf scan.jpg
Usage:
reader = DocumentReader.from_manifest( Path("sources.txt"), collection_id="who-corpus", ) docs = list(reader.get_documents())
- classmethod from_url(url, *, chunker=None, filter_=None, filename_override=None, default_language=None, source_type=None, source_title=None, source_author=None, source_date=None, collection_id=None, doi=None, isbn=None, **kwargs)[source]#
Instantiate the appropriate reader for a URL source.
Dispatches to
YouTubeReaderfor YouTube URLs and toWebReaderfor all otherhttp:///https://URLs.- Parameters:
- urlstr
Full URL string. Must start with
http://orhttps://.- chunkerChunkerBase or None, optional
Chunker to inject. Default:
None.- filter_FilterBase or None, optional
Filter to inject. Default:
None(DefaultFilter).- filename_overridestr or None, optional
Override for the
source_filelabel. Default:None.- default_languagestr or None, optional
ISO 639-1 language code. Default:
None.- source_typeSourceType or None, optional
Semantic label for the source. Default:
None.- source_titlestr or None, optional
Title of the source work. Default:
None.- source_authorstr or None, optional
Primary author. Default:
None.- source_datestr or None, optional
Publication date in ISO 8601 format. Default:
None.- collection_idstr or None, optional
Corpus collection identifier. Default:
None.- doistr or None, optional
Digital Object Identifier. Default:
None.- isbnstr or None, optional
International Standard Book Number. Default:
None.- **kwargsAny
Additional kwargs forwarded to the reader constructor (e.g.
include_auto_generated=FalseforYouTubeReader).
- Returns:
- DocumentReader
YouTubeReaderorWebReaderinstance.
- Raises:
- ValueError
If
urldoes not start withhttp://orhttps://.- ImportError
If the required reader class is not registered (i.e.
scikitplot.corpus._readershas not been imported yet).
- Parameters:
url (str)
chunker (ChunkerBase | None)
filter_ (FilterBase | None)
filename_override (str | None)
default_language (str | None)
source_type (SourceType | None)
source_title (str | None)
source_author (str | None)
source_date (str | None)
collection_id (str | None)
doi (str | None)
isbn (str | None)
kwargs (Any)
- Return type:
Notes
Prefer :meth:`create` for new code. Passing a URL string to
createautomatically callsfrom_url— you rarely need to callfrom_urldirectly.Examples
>>> reader = DocumentReader.from_url("https://en.wikipedia.org/wiki/Python") >>> docs = list(reader.get_documents())
>>> yt = DocumentReader.from_url("https://www.youtube.com/watch?v=dQw4w9WgXcQ") >>> docs = list(yt.get_documents())
- get_documents()[source]#
Yield validated
CorpusDocumentinstances for the input file.Orchestrates the full per-file pipeline:
validate_input— fail fast if file is missing.get_raw_chunks— format-specific text extraction.Chunker (if set) — sub-segments each raw block.
CorpusDocumentconstruction with validated schema.Filter — discards noise documents.
- Yields:
- CorpusDocument
Validated documents that passed the filter.
- Raises:
- ValueError
If the input file is missing or the format is invalid.
- Return type:
Generator[CorpusDocument, None, None]
Notes
The global
chunk_indexcounter is monotonically increasing across all raw chunks and sub-chunks for a single file, ensuring that(source_file, chunk_index)is a unique key within one reader run.Omitted-document statistics are logged at INFO level after processing each file.
Examples
>>> from pathlib import Path >>> reader = DocumentReader.create(Path("corpus.txt")) >>> docs = list(reader.get_documents()) >>> all(isinstance(d, CorpusDocument) for d in docs) True
- get_raw_chunks()[source]#
Extract text from the audio via companion, transcription, or classification.
Attempts companion detection first. Falls back to Whisper only when
transcribe=Trueand no companion was found. Classification viaclassify=Trueruns independently (can combine with transcription).- Yields:
- dict
Keys always include
"text"and"section_type". May include"timecode_start","timecode_end","confidence","source_type", and format-specific keys.
- Raises:
- ValueError
If the file exceeds
max_file_bytes.- ImportError
If
transcribe=Trueand Whisper is not installed.
- Return type:
- input_file: Path[source]#
Path to the source file.
For URL-based readers (
WebReader,YouTubeReader), passpathlib.Path(url_string)here and setsource_urito the original URL string.validate_input()is overridden in those subclasses to skip the file-existence check.
- source_provenance: dict[str, Any][source]#
Provenance overrides propagated into every yielded
CorpusDocument.Keys may include
"source_type","source_title","source_author", and"collection_id". Populated bycreate/from_urlfrom their keyword arguments.
- source_uri: str | None = None#
Original URI for URL-based readers (web pages, YouTube videos).
Set this to the full URL string when
input_fileis a syntheticpathlib.Pathwrapping a URL. File-based readers leave thisNone.Examples
>>> reader = WebReader( ... input_file=Path("https://example.com/article"), ... source_uri="https://example.com/article", ... )
- classmethod subclass_by_type()[source]#
Return a copy of the extension → reader class registry.
- Returns:
- dict
Mapping of file extension (str) → reader class. Returns a shallow copy so callers cannot accidentally mutate the registry.
- Return type:
Examples
>>> registry = DocumentReader.subclass_by_type() >>> ".txt" in registry True
- classmethod supported_types()[source]#
Return a sorted list of file extensions supported by registered readers.
- Returns:
- list of str
Lowercase file extensions, each including the leading dot. E.g.
['.pdf', '.txt', '.xml', '.zip'].
- Return type:
Examples
>>> DocumentReader.supported_types() ['.pdf', '.txt', '.xml', '.zip']
- validate_input()[source]#
Assert that the input file exists and is readable.
- Raises:
- ValueError
If
input_filedoes not exist or is not a regular file.
- Return type:
None
Notes
Called automatically by
get_documentsbefore iterating. Can also be called eagerly after construction to fail fast.Examples
>>> reader = DocumentReader.create(Path("missing.txt")) >>> reader.validate_input() Traceback (most recent call last): ... ValueError: Input file does not exist: missing.txt