provenance_from_filename#
- scikitplot.corpus.provenance_from_filename(filename, source_type=None)[source]#
Extract provenance metadata from a source filename using heuristics.
Designed for corpora organised by the Project Gutenberg naming convention:
Surname_Firstname_Title_Year.extorAuthor-Title.ext. Falls back gracefully — always returns a dict, even if no patterns are detected.- Parameters:
- filenamestr
Source filename (basename or full path; only the basename is used).
- source_typestr or None, optional
SourceTypevalue string to include in the result. Default:None.
- Returns:
- dict[str, Any]
Dict with zero or more of:
"source_author","source_title","source_date","source_type". Suitable for passing assource_provenancetofrom_file.
- Parameters:
- Return type:
Examples
>>> provenance_from_filename("Shakespeare_William_Hamlet_1603.xml") {'source_author': 'Shakespeare William', 'source_title': 'Hamlet', 'source_date': '1603'}
>>> provenance_from_filename("dickens-great-expectations.txt") {'source_author': 'Dickens', 'source_title': 'Great Expectations'}
>>> provenance_from_filename("document.pdf") {}