LanguageDetectionNormalizer#

class scikitplot.corpus.LanguageDetectionNormalizer(fallback_language=None, min_confidence=0.7, overwrite=False)[source]#

Detect document language and set CorpusDocument.language.

Uses langdetect (pip install langdetect) which is a port of Google’s language-detection library. Falls back to the provided fallback_language if detection fails or the detected language has confidence below min_confidence.

Parameters:

fallback_languagestr or None, optional: ISO 639-1 language code to use when detection fails. None leaves language unchanged on failure. Default: None.
min_confidencefloat, optional: Minimum probability threshold for accepting a detected language. Must be in [0.0, 1.0]. Default: 0.7.
overwritebool, optional: When False, skip detection if the document already has a non-None language field. Default: False.

Parameters:

fallback_language (str | None)
min_confidence (float)
overwrite (bool)

Examples

>>> norm = LanguageDetectionNormalizer(fallback_language="en")
>>> doc = CorpusDocument.create("f.txt", 0, "The quick brown fox.")
>>> result = norm.normalize_doc(doc)
>>> result.language
'en'

normalize_doc(doc)[source]#

Detect language and update doc.language.

Parameters:

docCorpusDocument

Returns:

CorpusDocument: New instance with language set (or unchanged on failure).

Raises:

ImportError: If langdetect is not installed.

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument