CorpusStats#

class scikitplot.corpus.CorpusStats(n_documents, n_tokens, n_chars, mean_tokens, median_tokens, min_tokens, max_tokens, language_counts, section_type_counts, source_type_counts, source_file_counts, collection_ids, has_embeddings, date_range)[source]#

Aggregate statistics over a CorpusDocument collection.

Parameters:
n_documentsint

Total document count.

n_tokensint

Total whitespace-delimited token count (sum of doc.word_count).

n_charsint

Total character count (sum of doc.char_count).

mean_tokensfloat

Average tokens per document. 0.0 when n_documents == 0.

median_tokensfloat

Median tokens per document. 0.0 when n_documents == 0.

min_tokensint

Minimum token count across documents.

max_tokensint

Maximum token count across documents.

language_countsdict[str, int]

Map of ISO 639-1 code → document count. None language stored as "unknown".

section_type_countsdict[str, int]

Map of SectionType.value → document count.

source_type_countsdict[str, int]

Map of SourceType.value → document count.

input_path_countsdict[str, int]

Map of input_path → document count.

collection_idslist[str]

Sorted unique collection_id values (None excluded).

has_embeddingsint

Number of documents where embedding is not None.

date_rangetuple[str, str] or None

(earliest, latest) source_date values, or None if no documents have dates.

Parameters:
collection_ids: list[str][source]#
date_range: tuple[str, str] | None[source]#
has_embeddings: int[source]#
language_counts: dict[str, int][source]#
max_tokens: int[source]#
mean_tokens: float[source]#
median_tokens: float[source]#
min_tokens: int[source]#
n_chars: int[source]#
n_documents: int[source]#
n_tokens: int[source]#
section_type_counts: dict[str, int][source]#
source_file_counts: dict[str, int][source]#
source_type_counts: dict[str, int][source]#
summary()[source]#

Return a human-readable one-page summary string.

Returns:
str
Return type:

str

to_dict()[source]#

Return a JSON-safe dictionary representation of the stats.

Returns:
dict[str, Any]
Return type:

dict[str, Any]