CollectionManifest#
- class scikitplot.corpus.CollectionManifest(collection_id, title=None, author=None, source_date=None, language=None, description='', source_type=None, file_provenance=<factory>, tags=<factory>, expected_file_count=None)[source]#
Descriptor for a named corpus collection.
A
CollectionManifestholds corpus-level provenance metadata (author, title, date, language) that is propagated into everyCorpusDocumentproduced from this collection. It optionally carries per-file provenance overrides so that individual files within a multi-file corpus can have their own metadata.- Parameters:
- collection_idstr
Unique identifier for this collection. Must be non-empty. Used as
CorpusDocument.collection_idin all produced documents.- titlestr or None, optional
Human-readable title of the collection. Default:
None.- authorstr or None, optional
Primary author or editor. Default:
None.- source_datestr or None, optional
Publication or creation date in ISO 8601 format. Default:
None.- languagestr or None, optional
Default ISO 639-1 language code for all files. Default:
None.- descriptionstr, optional
Free-text description of the corpus. Default:
"".- source_typestr or None, optional
Default
SourceTypevalue string for all files. Default:None.- file_provenancedict[str, dict], optional
Per-file provenance overrides. Keys are filenames (basename only); values are dicts with any
CorpusDocumentprovenance field names. Override values take precedence over the collection-level defaults. Default:{}.- tagslist[str], optional
Arbitrary tags for search / filtering. Default:
[].- expected_file_countint or None, optional
Expected number of source files. Used for completeness validation. Default:
None(no check).
- Raises:
- ValueError
If
collection_idis empty or whitespace-only, or ifexpected_file_countis negative.
- Parameters:
Examples
>>> manifest = CollectionManifest( ... collection_id="gutenberg_shakespeare", ... title="The Complete Works of Shakespeare", ... author="Shakespeare, William", ... source_date="1600", ... language="en", ... source_type="play", ... ) >>> manifest.to_provenance() {'collection_id': 'gutenberg_shakespeare', 'source_title': '...', ...}
- check_completeness(actual_file_count)[source]#
Return
Trueif the actual file count matchesexpected_file_count.
- provenance_for_file(filename)[source]#
Return merged provenance for a specific file.
Starts with collection-level defaults, then applies per-file overrides from
file_provenance. Basename matching only.- Parameters:
- filenamestr
Source filename (basename). Matched against
file_provenancekeys case-sensitively.
- Returns:
- dict[str, Any]
Merged provenance dict.
- Parameters:
filename (str)
- Return type:
Examples
>>> manifest = CollectionManifest( ... collection_id="c1", ... author="Default Author", ... file_provenance={"hamlet.xml": {"source_title": "Hamlet"}}, ... ) >>> manifest.provenance_for_file("hamlet.xml") {'collection_id': 'c1', 'source_author': 'Default Author', 'source_title': 'Hamlet'}