CorpusStats#
- class scikitplot.corpus.CorpusStats(n_documents, n_tokens, n_chars, mean_tokens, median_tokens, min_tokens, max_tokens, language_counts, section_type_counts, source_type_counts, source_file_counts, collection_ids, has_embeddings, date_range)[source]#
Aggregate statistics over a
CorpusDocumentcollection.- Parameters:
- n_documentsint
Total document count.
- n_tokensint
Total whitespace-delimited token count (sum of
doc.word_count).- n_charsint
Total character count (sum of
doc.char_count).- mean_tokensfloat
Average tokens per document.
0.0whenn_documents == 0.- median_tokensfloat
Median tokens per document.
0.0whenn_documents == 0.- min_tokensint
Minimum token count across documents.
- max_tokensint
Maximum token count across documents.
- language_countsdict[str, int]
Map of ISO 639-1 code → document count.
Nonelanguage stored as"unknown".- section_type_countsdict[str, int]
Map of
SectionType.value→ document count.- source_type_countsdict[str, int]
Map of
SourceType.value→ document count.- input_path_countsdict[str, int]
Map of
input_path→ document count.- collection_idslist[str]
Sorted unique
collection_idvalues (Noneexcluded).- has_embeddingsint
Number of documents where
embeddingis notNone.- date_rangetuple[str, str] or None
(earliest, latest)
source_datevalues, orNoneif no documents have dates.
- Parameters: