split_cjk_chars#

scikitplot.corpus.split_cjk_chars(text)[source]#

Split text into individual CJK character tokens.

Non-CJK runs (Latin words, numbers, spaces) are kept as contiguous tokens split on whitespace. This produces a mixed token list where each CJK ideograph is its own token and each Latin/numeric word is its own token.

Parameters:

textstr: Input text that may contain CJK characters.

Returns:

list[str]: Mixed token list.

Parameters:

text (str)

Return type:

list[str]

Notes

User note: This is the recommended tokenization strategy for Chinese, Japanese (without furigana), and Korean when a dedicated morphological analyser (MeCab, jieba, kss) is not available. Character-level tokenization loses word-level semantics but ensures that CJK text is not treated as one giant “word” by whitespace splitters.

Developer note: Used by _tokenize_whitespace when unit=TOKENS and the text is detected as CJK.

Examples

>>> split_cjk_chars("你好 world 再见")
['你', '好', 'world', '再', '见']
>>> split_cjk_chars("Hello world")
['Hello', 'world']
>>> split_cjk_chars("abc 日本語 123")
['abc', '日', '本', '語', '123']