detect_script#
- scikitplot.corpus.detect_script(text, *, sample_size=500, majority_threshold=0.55)[source]#
Detect the dominant Unicode script in text.
Samples up to sample_size characters for efficiency on long documents. Returns
ScriptType.MIXEDwhen no single script exceeds majority_threshold of all script characters found.- Parameters:
- textstr
Input text to analyse. Any length — only the first sample_size characters are examined.
- sample_sizeint, optional
Maximum number of characters to inspect. Default 500.
- majority_thresholdfloat, optional
Fraction of script chars a single script must reach to be declared dominant. Default 0.55 (55 %).
- Returns:
- ScriptType
Detected dominant script.
- Parameters:
- Return type:
Notes
User note: The detection is based on Unicode code-point ranges and is heuristic — it does not use a language model. For ambiguous texts (transliterated Arabic in Latin script, mixed-script social media posts) the result may be
ScriptType.MIXED. Pass an explicitscript_hinttoSentenceChunkerConfigto override detection.Supported script families: Latin, CJK (Chinese/Japanese/Korean/Hangul), Arabic (including Persian/Ottoman/Urdu), Hebrew, Devanagari (Hindi/Sanskrit/Nepali), Greek, Cyrillic, Ethiopic, Georgian, Coptic (Egyptian proxy), Thai, Southeast Asian (Lao/Myanmar/Khmer), South Asian Dravidian (Tamil/Telugu/Kannada/Malayalam/Sinhala), Armenian, Tibetan.
Developer note: Unicode categories
unicodedata.category(c)are not used here because they do not map cleanly to script families. Code-point ranges from the Unicode Standard are used instead.Examples
>>> detect_script("Hello world") <ScriptType.LATIN: 'latin'> >>> detect_script("مرحبا بالعالم") <ScriptType.ARABIC: 'arabic'> >>> detect_script("こんにちは世界") <ScriptType.CJK: 'cjk'> >>> detect_script("Ἡ γλῶσσα") <ScriptType.GREEK: 'greek'> >>> detect_script("12345 !@#$%") <ScriptType.UNKNOWN: 'unknown'>