BuildResult#

class scikitplot.corpus.BuildResult(documents=<factory>, n_sources=0, n_raw=0, n_filtered=0, n_normalised=0, n_enriched=0, n_embedded=0, index=None, errors=<factory>)[source]#

Result of a corpus build operation.

Parameters:

documentslist[CorpusDocument]: The processed documents.
n_sourcesint: Number of source files/URLs processed.
n_rawint: Total raw chunks before filtering.
n_filteredint: Chunks removed by filtering.
n_normalisedint: Chunks that were text-normalised.
n_enrichedint: Chunks that were NLP-enriched.
n_embeddedint: Chunks that were embedded.
indexSimilarityIndex or None: Built similarity index (if build_index=True).
errorslist[tuple[str, Exception]]: (input_path, exception) pairs for failed sources.

Parameters:

documents (list[Any])
n_sources (int)
n_raw (int)
n_filtered (int)
n_normalised (int)
n_enriched (int)
n_embedded (int)
index (Any)
errors (list[tuple[str, Exception]])

Notes

User note: Access documents directly:

result = builder.build("./data/")
for doc in result.documents:
    print(doc.text[:80])

documents: list[Any][source]#

errors: list[tuple[str, Exception]][source]#

index: Any = None#

property n_documents: int#

Number of CorpusDocument instances in documents.

Returns:

int

n_embedded: int = 0#

n_enriched: int = 0#

n_filtered: int = 0#

n_normalised: int = 0#

n_raw: int = 0#

n_sources: int = 0#

property success_rate: float#

Fraction of ingested sources that completed without error.

Returns:

float: (n_sources - len(errors)) / n_sources in [0.0, 1.0]. Returns 1.0 when no sources were processed.

summary()[source]#

Return a multi-line human-readable build summary.

Returns:

str: Multi-line string reporting sources, documents, normalisation, enrichment, embedding counts, and any errors.

Return type:

str