FunctionTokenizer#

class scikitplot.corpus.FunctionTokenizer(fn, name='custom')[source]#

Wrap any Callable[[str], list[str]] as a TokenizerProtocol.

Parameters:
fnCallable[[str], list[str]]

Tokenization function. Must accept a single str argument and return a list[str].

namestr, optional

Human-readable name for logging and repr.

Parameters:

Notes

User note: Use this to plug in any tokenization library:

import MeCab

tagger = MeCab.Tagger("-Owakati")
tok = FunctionTokenizer(lambda text: tagger.parse(text).strip().split())

import jieba

tok = FunctionTokenizer(lambda text: list(jieba.cut(text)))

Developer note: The wrapper stores only the callable; no model loading happens at construction time.

Examples

>>> tok = FunctionTokenizer(str.split)
>>> tok.tokenize("hello world")
['hello', 'world']
>>> tok = FunctionTokenizer(lambda t: list(t), name="char_splitter")
>>> tok.tokenize("abc")
['a', 'b', 'c']
tokenize(text)[source]#

Tokenize text using the wrapped callable.

Parameters:
textstr

Input text.

Returns:
list[str]

Token list.

Raises:
TypeError

If the wrapped callable does not return a list.

Parameters:

text (str)

Return type:

list[str]