provenance_from_filename#

scikitplot.corpus.provenance_from_filename(filename, source_type=None)[source]#

Extract provenance metadata from a source filename using heuristics.

Designed for corpora organised by the Project Gutenberg naming convention: Surname_Firstname_Title_Year.ext or Author-Title.ext. Falls back gracefully — always returns a dict, even if no patterns are detected.

Parameters:
filenamestr

Source filename (basename or full path; only the basename is used).

source_typestr or None, optional

SourceType value string to include in the result. Default: None.

Returns:
dict[str, Any]

Dict with zero or more of: "source_author", "source_title", "source_date", "source_type". Suitable for passing as source_provenance to from_file.

Parameters:
  • filename (str)

  • source_type (str | None)

Return type:

dict[str, Any]

Examples

>>> provenance_from_filename("Shakespeare_William_Hamlet_1603.xml")
{'source_author': 'Shakespeare William', 'source_title': 'Hamlet',
 'source_date': '1603'}
>>> provenance_from_filename("dickens-great-expectations.txt")
{'source_author': 'Dickens', 'source_title': 'Great Expectations'}
>>> provenance_from_filename("document.pdf")
{}