FixedWindowChunkerConfig#

class scikitplot.corpus.FixedWindowChunkerConfig(window_size=512, step_size=256, unit=WindowUnit.CHARS, min_length=10, include_offsets=True, strip_whitespace=True, multilang_config=None)[source]#

Configuration for FixedWindowChunker.

Parameters:
window_sizeint

Size of each chunk in unit units.

step_sizeint

Stride between consecutive chunk starts. step_size == window_size gives non-overlapping chunks. step_size < window_size gives sliding-window overlap.

unitWindowUnit

Measurement unit: CHARS (default) or TOKENS.

min_lengthint

Minimum character length to keep the last (possibly partial) chunk.

include_offsetsbool

Compute and store character offsets.

strip_whitespacebool

Strip leading/trailing whitespace from each chunk.

Parameters:
  • window_size (int)

  • step_size (int)

  • unit (WindowUnit)

  • min_length (int)

  • include_offsets (bool)

  • strip_whitespace (bool)

  • multilang_config (MultilangConfig | None)

include_offsets: bool = True#
min_length: int = 10#
multilang_config: MultilangConfig | None = None#
step_size: int = 256#
strip_whitespace: bool = True#
unit: WindowUnit = 'chars'[source]#
window_size: int = 512#