LanguageDetectionNormalizer#

class scikitplot.corpus.LanguageDetectionNormalizer(fallback_language=None, min_confidence=0.7, overwrite=False)[source]#

Detect document language and set CorpusDocument.language.

Uses langdetect (pip install langdetect) which is a port of Google’s language-detection library. Falls back to the provided fallback_language if detection fails or the detected language has confidence below min_confidence.

Parameters:
fallback_languagestr or None, optional

ISO 639-1 language code to use when detection fails. None leaves language unchanged on failure. Default: None.

min_confidencefloat, optional

Minimum probability threshold for accepting a detected language. Must be in [0.0, 1.0]. Default: 0.7.

overwritebool, optional

When False, skip detection if the document already has a non-None language field. Default: False.

Parameters:
  • fallback_language (str | None)

  • min_confidence (float)

  • overwrite (bool)

Examples

>>> norm = LanguageDetectionNormalizer(fallback_language="en")
>>> doc = CorpusDocument.create("f.txt", 0, "The quick brown fox.")
>>> result = norm.normalize_doc(doc)
>>> result.language
'en'
normalize_doc(doc)[source]#

Detect language and update doc.language.

Parameters:
docCorpusDocument
Returns:
CorpusDocument

New instance with language set (or unchanged on failure).

Raises:
ImportError

If langdetect is not installed.

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument