CorpusSource#

class scikitplot.corpus.CorpusSource(kind, root=None, urls=<factory>, pattern='**/*', recursive=True, extensions=None, source_provenance=<factory>, follow_symlinks=True)[source]#

Declarative descriptor for one or more document sources.

CorpusSource is a value object — it describes where to find documents and what provenance metadata to attach. The actual file-system access is deferred to iter_entries.

Parameters:
kindSourceKind

What kind of source this is.

rootpathlib.Path or None

Base path (directory root, single file, or manifest file). None when kind=URL without a manifest.

urlslist[str]

Explicit list of URLs. Only relevant when kind=URL.

patternstr

Glob pattern used when kind=DIRECTORY. Default: "**/*".

recursivebool

When True, globs descend into sub-directories. Default: True.

extensionslist[str] or None

Whitelist of file extensions (lowercase, with leading dot) to include when globbing. None means accept all. Default: None.

source_provenancedict

Metadata propagated into every yielded SourceEntry (e.g. {"source_title": "Hamlet", "source_author": "Shakespeare"}).

follow_symlinksbool

Whether to follow symbolic links during directory traversal. Default: True.

Parameters:

See also

scikitplot.corpus._pipeline.CorpusPipeline

Consumes CorpusSource.

Examples

Single file:

>>> from pathlib import Path
>>> src = CorpusSource.from_file(Path("article.txt"))
>>> list(src.iter_entries())
[SourceEntry(path_or_url='article.txt', kind=<SourceKind.FILE: 'file'>, ...)]

Directory glob:

>>> src = CorpusSource.from_directory(Path("corpus/"), pattern="*.txt")
>>> entries = list(src.iter_entries())

URL list:

>>> src = CorpusSource.from_urls(["https://a.com/p1", "https://b.com/p2"])

URL manifest file:

>>> src = CorpusSource.from_manifest(Path("urls.txt"))
count()[source]#

Return the total number of entries this source will yield.

Warning

For DIRECTORY sources this iterates all matching files. For large directory trees (100k+ files) this may be slow.

Returns:
int

Number of entries.

Return type:

int

extensions: list[str] | None = None#
classmethod from_directory(directory, pattern='**/*', recursive=True, extensions=None, source_provenance=None, follow_symlinks=True)[source]#

Create a source that globs a directory.

Parameters:
directorypathlib.Path or str

Root directory to glob.

patternstr, optional

Glob pattern relative to directory. Default: "**/*" (all files recursively).

recursivebool, optional

Whether ** in pattern should recurse into sub-directories. Default: True.

extensionslist[str] or None, optional

Whitelist of lowercase file extensions with leading dot. None accepts all. Default: None.

source_provenancedict, optional

Provenance metadata for all entries. Default: {}.

follow_symlinksbool, optional

Follow symlinks during traversal. Default: True.

Returns:
CorpusSource
Parameters:
Return type:

CorpusSource

classmethod from_file(path, source_provenance=None)[source]#

Create a source for a single local file.

Parameters:
pathpathlib.Path or str

Path to the file.

source_provenancedict, optional

Provenance metadata merged into every yielded entry.

Returns:
CorpusSource
Parameters:
Return type:

CorpusSource

classmethod from_manifest(manifest_path, source_provenance=None)[source]#

Create a source from a UTF-8 manifest file (one entry per line).

Lines starting with # and blank lines are ignored. Each non-comment line is treated as either a URL or a filesystem path.

Parameters:
manifest_pathpathlib.Path or str

Path to the manifest text file.

source_provenancedict, optional

Provenance metadata for all entries.

Returns:
CorpusSource
Raises:
ValueError

If the manifest file does not exist.

Parameters:
Return type:

CorpusSource

classmethod from_urls(urls, source_provenance=None)[source]#

Create a source from an explicit list of URLs.

Parameters:
urlslist[str]

List of http:// or https:// URLs.

source_provenancedict, optional

Provenance metadata for all entries.

Returns:
CorpusSource
Raises:
ValueError

If urls is empty or any entry is not a valid URL.

Parameters:
Return type:

CorpusSource

iter_entries()[source]#

Yield resolved SourceEntry objects for this source.

The generator is lazy — filesystem access happens per-entry, not upfront. This keeps memory proportional to working-set size, not corpus size.

Yields:
SourceEntry

One entry per file or URL.

Raises:
ValueError

If configuration is invalid (delegated to validate).

FileNotFoundError

If a FILE source path does not exist at iteration time.

Return type:

Generator[SourceEntry, None, None]

kind: SourceKind[source]#
pattern: str = '**/*'#
recursive: bool = True#
root: Path | None = None#
source_provenance: dict[str, Any][source]#
urls: list[str][source]#
validate()[source]#

Assert that this source is internally consistent.

Raises:
ValueError

On any configuration inconsistency.

Return type:

None