split_cjk_chars#
- scikitplot.corpus.split_cjk_chars(text)[source]#
Split text into individual CJK character tokens.
Non-CJK runs (Latin words, numbers, spaces) are kept as contiguous tokens split on whitespace. This produces a mixed token list where each CJK ideograph is its own token and each Latin/numeric word is its own token.
- Parameters:
- textstr
Input text that may contain CJK characters.
- Returns:
- list[str]
Mixed token list.
- Parameters:
text (str)
- Return type:
Notes
User note: This is the recommended tokenization strategy for Chinese, Japanese (without furigana), and Korean when a dedicated morphological analyser (MeCab, jieba, kss) is not available. Character-level tokenization loses word-level semantics but ensures that CJK text is not treated as one giant “word” by whitespace splitters.
Developer note: Used by
_tokenize_whitespacewhenunit=TOKENSand the text is detected as CJK.Examples
>>> split_cjk_chars("你好 world 再见") ['你', '好', 'world', '再', '见'] >>> split_cjk_chars("Hello world") ['Hello', 'world'] >>> split_cjk_chars("abc 日本語 123") ['abc', '日', '本', '語', '123']