LLMTrainingExporter#

class scikitplot.corpus.LLMTrainingExporter(engine=None, default_system_prompt='You are a helpful assistant.')[source]#

Export a corpus with embeddings to LLM training formats.

Orchestrates the full journey from list[CorpusDocument] → training-ready files / datasets.

Parameters:

engineMultimodalEmbeddingEngine or EmbeddingEngine or None: Embedding engine. When None, existing doc.embedding values are used as-is; raises ValueError if a document lacks an embedding where one is required.
default_system_promptstr, optional: System prompt prepended to all OpenAI fine-tuning conversations. Default: "You are a helpful assistant.".

Parameters:

engine (Any | None)
default_system_prompt (str)

Examples

OpenAI fine-tuning (chat format):

>>> exporter = LLMTrainingExporter(engine)
>>> exporter.to_openai_finetuning_jsonl(
...     docs,
...     output_path=Path("train.jsonl"),
...     system_prompt="Answer medical questions accurately.",
...     response_fn=lambda doc: doc.metadata.get("answer", ""),
... )

HuggingFace SFT dataset:

>>> ds = exporter.to_huggingface_training_dataset(
...     docs,
...     tokenizer_name="gpt2",
...     task="clm",
... )

Pure embedding matrix for vector DB or contrastive training:

>>> matrix, meta_df = exporter.to_embedding_matrix(docs)
>>> matrix.shape  # (N, D)
(1024, 512)

default_system_prompt: str = 'You are a helpful assistant.'#

engine: Any | None = None#

log_to_mlflow(documents, *, run_name=None, artifact_dir='corpus_embeddings', log_params=True)[source]#

Log embedding matrix and metadata as MLflow artifacts.

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None: Documents with embeddings. None / empty input logs zero embeddings. Single doc is wrapped automatically. None list entries filtered.
run_namestr or None, optional: MLflow run name. Uses the active run when None.
artifact_dirstr, optional: Directory inside the MLflow artifact store. Default: "corpus_embeddings".
log_paramsbool, optional: Log engine config as MLflow params. Default: True.

Raises:

ImportError: If mlflow is not installed.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
run_name (str | None)
artifact_dir (str)
log_params (bool)

Return type:

None

to_embedding_matrix(documents, *, include_metadata=True, output_path=None)[source]#

Export embeddings as a (N, D) NumPy matrix with metadata.

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None: Documents. Those without embeddings are embedded via the engine when one is set, else raise ValueError. None / empty input returns an empty (0, 0) matrix. Single doc is wrapped automatically. None list entries filtered.
include_metadatabool, optional: Build a metadata DataFrame (or dict of lists). Default: True.
output_pathpathlib.Path or str or None, optional: When set, saves {output_path}.npy (matrix) and {output_path}.csv (metadata). Default: None.

Returns:

matrixndarray shape (N, D) float32: (0, 0) when no documents remain after normalization.
metadatapandas.DataFrame or dict[str, list]: Metadata table with doc_id, input_path, source_type, modality, content_hash, chunk_index columns. Returns plain dict when pandas is not installed.

Raises:

ValueError: If any document lacks an embedding and engine=None.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
include_metadata (bool)
output_path (Path | str | None)

Return type:

tuple[ndarray[tuple[Any, …], dtype[_ScalarT]], Any]

to_huggingface_training_dataset(documents, *, tokenizer_name='gpt2', max_length=512, task='clm', text_field='text', label_field=None, include_embeddings=False, stride=0)[source]#

Build a HuggingFace datasets.Dataset for LLM training.

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None

Documents to tokenize. None / empty list returns an empty dataset. Single doc is wrapped automatically. None list entries filtered.

tokenizer_namestr, optional

HuggingFace tokenizer name or local path. Default: "gpt2".

max_lengthint, optional

Maximum token sequence length. Sequences are truncated (and optionally strided). Default: 512.

task{“clm”, “mlm”, “sft”}, optional

Training objective.

"clm" — causal language model: labels = input_ids. "mlm" — masked language model: labels = input_ids (dynamic token masking is delegated to DataCollatorForLanguageModeling at training time). "sft" — supervised fine-tuning: requires label_field to be set.

Default: "clm".

text_fieldstr, optional

Document attribute to tokenize. Default: "text".

label_fieldstr or None, optional

Attribute to use as classification label (for "sft"). Default: None.

include_embeddingsbool, optional

Add "embedding" column. Default: False.

strideint, optional

Overlap between windows when splitting long texts. Default: 0 (no stride).

Returns:

datasets.Dataset: Tokenized training dataset. Falls back to a plain dict of lists when datasets is not installed.

Raises:

ImportError: If transformers is not installed.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
tokenizer_name (str)
max_length (int)
task (str)
text_field (str)
label_field (str | None)
include_embeddings (bool)
stride (int)

Return type:

Any

to_openai_finetuning_jsonl(documents, output_path, *, system_prompt=None, response_fn=None, user_field='text', include_embeddings=False, skip_empty=True)[source]#

Export documents as OpenAI chat fine-tuning JSONL.

Each line is a valid fine-tuning example:

{
  "messages": [
    {"role": "system", "content": "<system_prompt>"},
    {"role": "user",   "content": "<doc.text>"},
    {"role": "assistant", "content": "<response_fn(doc)>"}
  ],
  "embedding": [...]    // optional, only when include_embeddings=True
}

Parameters:

documentsCorpusDocument | list[CorpusDocument] | None: Documents to export. None and empty list write an empty file. Single doc is wrapped automatically. None list entries filtered.
output_pathpathlib.Path or str: Destination .jsonl file.
system_promptstr or None, optional: System message. Defaults to self.default_system_prompt.
response_fncallable or None, optional: fn(doc) → str producing the assistant response for each document. When None, uses doc.metadata.get("answer", "") — suitable when answers are stored in metadata.
user_fieldstr, optional: Document attribute to use as the user message. Default: "text".
include_embeddingsbool, optional: Append "embedding" key to each record. Requires that embeddings are present (call _ensure_embedded first or set an engine). Default: False.
skip_emptybool, optional: Skip documents with empty user content. Default: True.

Returns:

pathlib.Path: Path to the written .jsonl file.

Parameters:

documents (CorpusDocument | list[CorpusDocument] | None)
output_path (Path | str)
system_prompt (str | None)
response_fn (Callable[[Any], str] | None)
user_field (str)
include_embeddings (bool)
skip_empty (bool)

Return type:

Path

Notes

OpenAI fine-tuning requires at minimum a "user" message and an "assistant" message. Provide response_fn to generate meaningful assistant turns; otherwise the export produces single-turn user-only examples (system + user, no assistant reply) which are valid for supervised fine-tuning when you add assistant responses separately.