CollectionManifest#

class scikitplot.corpus.CollectionManifest(collection_id, title=None, author=None, source_date=None, language=None, description='', source_type=None, file_provenance=<factory>, tags=<factory>, expected_file_count=None)[source]#

Descriptor for a named corpus collection.

A CollectionManifest holds corpus-level provenance metadata (author, title, date, language) that is propagated into every CorpusDocument produced from this collection. It optionally carries per-file provenance overrides so that individual files within a multi-file corpus can have their own metadata.

Parameters:
collection_idstr

Unique identifier for this collection. Must be non-empty. Used as CorpusDocument.collection_id in all produced documents.

titlestr or None, optional

Human-readable title of the collection. Default: None.

authorstr or None, optional

Primary author or editor. Default: None.

source_datestr or None, optional

Publication or creation date in ISO 8601 format. Default: None.

languagestr or None, optional

Default ISO 639-1 language code for all files. Default: None.

descriptionstr, optional

Free-text description of the corpus. Default: "".

source_typestr or None, optional

Default SourceType value string for all files. Default: None.

file_provenancedict[str, dict], optional

Per-file provenance overrides. Keys are filenames (basename only); values are dicts with any CorpusDocument provenance field names. Override values take precedence over the collection-level defaults. Default: {}.

tagslist[str], optional

Arbitrary tags for search / filtering. Default: [].

expected_file_countint or None, optional

Expected number of source files. Used for completeness validation. Default: None (no check).

Raises:
ValueError

If collection_id is empty or whitespace-only, or if expected_file_count is negative.

Parameters:
  • collection_id (str)

  • title (str | None)

  • author (str | None)

  • source_date (str | None)

  • language (str | None)

  • description (str)

  • source_type (str | None)

  • file_provenance (dict[str, dict[str, Any]])

  • tags (list[str])

  • expected_file_count (int | None)

Examples

>>> manifest = CollectionManifest(
...     collection_id="gutenberg_shakespeare",
...     title="The Complete Works of Shakespeare",
...     author="Shakespeare, William",
...     source_date="1600",
...     language="en",
...     source_type="play",
... )
>>> manifest.to_provenance()
{'collection_id': 'gutenberg_shakespeare', 'source_title': '...', ...}
author: str | None = None#
check_completeness(actual_file_count)[source]#

Return True if the actual file count matches expected_file_count.

Parameters:
actual_file_countint

Number of files actually found in the collection directory.

Returns:
bool

Always True when expected_file_count is None.

Parameters:

actual_file_count (int)

Return type:

bool

collection_id: str[source]#
description: str = ''#
expected_file_count: int | None = None#
file_provenance: dict[str, dict[str, Any]][source]#
language: str | None = None#
provenance_for_file(filename)[source]#

Return merged provenance for a specific file.

Starts with collection-level defaults, then applies per-file overrides from file_provenance. Basename matching only.

Parameters:
filenamestr

Source filename (basename). Matched against file_provenance keys case-sensitively.

Returns:
dict[str, Any]

Merged provenance dict.

Parameters:

filename (str)

Return type:

dict[str, Any]

Examples

>>> manifest = CollectionManifest(
...     collection_id="c1",
...     author="Default Author",
...     file_provenance={"hamlet.xml": {"source_title": "Hamlet"}},
... )
>>> manifest.provenance_for_file("hamlet.xml")
{'collection_id': 'c1', 'source_author': 'Default Author',
 'source_title': 'Hamlet'}
source_date: str | None = None#
source_type: str | None = None#
tags: list[str][source]#
title: str | None = None#
to_provenance()[source]#

Return a provenance dict suitable for create.

Only non-None values are included.

Returns:
dict[str, Any]

Keys are CorpusDocument provenance field names.

Return type:

dict[str, Any]

validate()[source]#

Assert that all invariants hold.

Raises:
ValueError

If any invariant is violated.

Return type:

None