CorpusSource#
- class scikitplot.corpus.CorpusSource(kind, root=None, urls=<factory>, pattern='**/*', recursive=True, extensions=None, source_provenance=<factory>, follow_symlinks=True)[source]#
Declarative descriptor for one or more document sources.
CorpusSourceis a value object — it describes where to find documents and what provenance metadata to attach. The actual file-system access is deferred toiter_entries.- Parameters:
- kindSourceKind
What kind of source this is.
- rootpathlib.Path or None
Base path (directory root, single file, or manifest file).
Nonewhenkind=URLwithout a manifest.- urlslist[str]
Explicit list of URLs. Only relevant when
kind=URL.- patternstr
Glob pattern used when
kind=DIRECTORY. Default:"**/*".- recursivebool
When
True, globs descend into sub-directories. Default:True.- extensionslist[str] or None
Whitelist of file extensions (lowercase, with leading dot) to include when globbing.
Nonemeans accept all. Default:None.- source_provenancedict
Metadata propagated into every yielded
SourceEntry(e.g.{"source_title": "Hamlet", "source_author": "Shakespeare"}).- follow_symlinksbool
Whether to follow symbolic links during directory traversal. Default:
True.
- Parameters:
See also
scikitplot.corpus._pipeline.CorpusPipelineConsumes
CorpusSource.
Examples
Single file:
>>> from pathlib import Path >>> src = CorpusSource.from_file(Path("article.txt")) >>> list(src.iter_entries()) [SourceEntry(path_or_url='article.txt', kind=<SourceKind.FILE: 'file'>, ...)]
Directory glob:
>>> src = CorpusSource.from_directory(Path("corpus/"), pattern="*.txt") >>> entries = list(src.iter_entries())
URL list:
>>> src = CorpusSource.from_urls(["https://a.com/p1", "https://b.com/p2"])
URL manifest file:
>>> src = CorpusSource.from_manifest(Path("urls.txt"))
- count()[source]#
Return the total number of entries this source will yield.
Warning
For
DIRECTORYsources this iterates all matching files. For large directory trees (100k+ files) this may be slow.- Returns:
- int
Number of entries.
- Return type:
- classmethod from_directory(directory, pattern='**/*', recursive=True, extensions=None, source_provenance=None, follow_symlinks=True)[source]#
Create a source that globs a directory.
- Parameters:
- directorypathlib.Path or str
Root directory to glob.
- patternstr, optional
Glob pattern relative to directory. Default:
"**/*"(all files recursively).- recursivebool, optional
Whether
**in pattern should recurse into sub-directories. Default:True.- extensionslist[str] or None, optional
Whitelist of lowercase file extensions with leading dot.
Noneaccepts all. Default:None.- source_provenancedict, optional
Provenance metadata for all entries. Default:
{}.- follow_symlinksbool, optional
Follow symlinks during traversal. Default:
True.
- Returns:
- CorpusSource
- Parameters:
- Return type:
- classmethod from_file(path, source_provenance=None)[source]#
Create a source for a single local file.
- classmethod from_manifest(manifest_path, source_provenance=None)[source]#
Create a source from a UTF-8 manifest file (one entry per line).
Lines starting with
#and blank lines are ignored. Each non-comment line is treated as either a URL or a filesystem path.- Parameters:
- manifest_pathpathlib.Path or str
Path to the manifest text file.
- source_provenancedict, optional
Provenance metadata for all entries.
- Returns:
- CorpusSource
- Raises:
- ValueError
If the manifest file does not exist.
- Parameters:
- Return type:
- classmethod from_urls(urls, source_provenance=None)[source]#
Create a source from an explicit list of URLs.
- Parameters:
- urlslist[str]
List of
http://orhttps://URLs.- source_provenancedict, optional
Provenance metadata for all entries.
- Returns:
- CorpusSource
- Raises:
- ValueError
If urls is empty or any entry is not a valid URL.
- Parameters:
- Return type:
- iter_entries()[source]#
Yield resolved
SourceEntryobjects for this source.The generator is lazy — filesystem access happens per-entry, not upfront. This keeps memory proportional to working-set size, not corpus size.
- Yields:
- SourceEntry
One entry per file or URL.
- Raises:
- ValueError
If configuration is invalid (delegated to
validate).- FileNotFoundError
If a FILE source path does not exist at iteration time.
- Return type:
Generator[SourceEntry, None, None]
- kind: SourceKind[source]#