detect_script#

scikitplot.corpus.detect_script(text, *, sample_size=500, majority_threshold=0.55)[source]#

Detect the dominant Unicode script in text.

Samples up to sample_size characters for efficiency on long documents. Returns ScriptType.MIXED when no single script exceeds majority_threshold of all script characters found.

Parameters:
textstr

Input text to analyse. Any length — only the first sample_size characters are examined.

sample_sizeint, optional

Maximum number of characters to inspect. Default 500.

majority_thresholdfloat, optional

Fraction of script chars a single script must reach to be declared dominant. Default 0.55 (55 %).

Returns:
ScriptType

Detected dominant script.

Parameters:
  • text (str)

  • sample_size (int)

  • majority_threshold (float)

Return type:

ScriptType

Notes

User note: The detection is based on Unicode code-point ranges and is heuristic — it does not use a language model. For ambiguous texts (transliterated Arabic in Latin script, mixed-script social media posts) the result may be ScriptType.MIXED. Pass an explicit script_hint to SentenceChunkerConfig to override detection.

Supported script families: Latin, CJK (Chinese/Japanese/Korean/Hangul), Arabic (including Persian/Ottoman/Urdu), Hebrew, Devanagari (Hindi/Sanskrit/Nepali), Greek, Cyrillic, Ethiopic, Georgian, Coptic (Egyptian proxy), Thai, Southeast Asian (Lao/Myanmar/Khmer), South Asian Dravidian (Tamil/Telugu/Kannada/Malayalam/Sinhala), Armenian, Tibetan.

Developer note: Unicode categories unicodedata.category(c) are not used here because they do not map cleanly to script families. Code-point ranges from the Unicode Standard are used instead.

Examples

>>> detect_script("Hello world")
<ScriptType.LATIN: 'latin'>
>>> detect_script("مرحبا بالعالم")
<ScriptType.ARABIC: 'arabic'>
>>> detect_script("こんにちは世界")
<ScriptType.CJK: 'cjk'>
>>> detect_script("Ἡ γλῶσσα")
<ScriptType.GREEK: 'greek'>
>>> detect_script("12345 !@#$%")
<ScriptType.UNKNOWN: 'unknown'>