LLMTrainingExporter#

class scikitplot.corpus.LLMTrainingExporter(engine=None, default_system_prompt='You are a helpful assistant.')[source]#

Export a corpus with embeddings to LLM training formats.

Orchestrates the full journey from list[CorpusDocument] → training-ready files / datasets.

Parameters:
engineMultimodalEmbeddingEngine or EmbeddingEngine or None

Embedding engine. When None, existing doc.embedding values are used as-is; raises ValueError if a document lacks an embedding where one is required.

default_system_promptstr, optional

System prompt prepended to all OpenAI fine-tuning conversations. Default: "You are a helpful assistant.".

Parameters:
  • engine (Any | None)

  • default_system_prompt (str)

Examples

OpenAI fine-tuning (chat format):

>>> exporter = LLMTrainingExporter(engine)
>>> exporter.to_openai_finetuning_jsonl(
...     docs,
...     output_path=Path("train.jsonl"),
...     system_prompt="Answer medical questions accurately.",
...     response_fn=lambda doc: doc.metadata.get("answer", ""),
... )

HuggingFace SFT dataset:

>>> ds = exporter.to_huggingface_training_dataset(
...     docs,
...     tokenizer_name="gpt2",
...     task="clm",
... )

Pure embedding matrix for vector DB or contrastive training:

>>> matrix, meta_df = exporter.to_embedding_matrix(docs)
>>> matrix.shape  # (N, D)
(1024, 512)
default_system_prompt: str = 'You are a helpful assistant.'#
engine: Any | None = None#
log_to_mlflow(documents, *, run_name=None, artifact_dir='corpus_embeddings', log_params=True)[source]#

Log embedding matrix and metadata as MLflow artifacts.

Parameters:
documentslist[CorpusDocument]

Documents with embeddings.

run_namestr or None, optional

MLflow run name. Uses the active run when None.

artifact_dirstr, optional

Directory inside the MLflow artifact store. Default: "corpus_embeddings".

log_paramsbool, optional

Log engine config as MLflow params. Default: True.

Raises:
ImportError

If mlflow is not installed.

Parameters:
Return type:

None

to_embedding_matrix(documents, *, include_metadata=True, output_path=None)[source]#

Export embeddings as a (N, D) NumPy matrix with metadata.

Parameters:
documentslist[CorpusDocument]

Documents. Those without embeddings are embedded via the engine when one is set, else raise ValueError.

include_metadatabool, optional

Build a metadata DataFrame (or dict of lists). Default: True.

output_pathpathlib.Path or str or None, optional

When set, saves {output_path}.npy (matrix) and {output_path}.csv (metadata). Default: None.

Returns:
matrixndarray shape (N, D) float32
metadatapandas.DataFrame or dict[str, list]

Metadata table with doc_id, source_file, source_type, modality, content_hash, chunk_index columns. Returns plain dict when pandas is not installed.

Raises:
ValueError

If any document lacks an embedding and engine=None.

Parameters:
Return type:

tuple[ndarray[tuple[Any, …], dtype[_ScalarT]], Any]

to_huggingface_training_dataset(documents, *, tokenizer_name='gpt2', max_length=512, task='clm', text_field='text', label_field=None, include_embeddings=False, stride=0)[source]#

Build a HuggingFace datasets.Dataset for LLM training.

Parameters:
documentslist[CorpusDocument]

Documents to tokenize.

tokenizer_namestr, optional

HuggingFace tokenizer name or local path. Default: "gpt2".

max_lengthint, optional

Maximum token sequence length. Sequences are truncated (and optionally strided). Default: 512.

task{“clm”, “mlm”, “sft”}, optional

Training objective.

"clm" — causal language model: labels = input_ids. "mlm" — masked language model: 15% random masking. "sft" — supervised fine-tuning: requires label_field to be set.

Default: "clm".

text_fieldstr, optional

Document attribute to tokenize. Default: "text".

label_fieldstr or None, optional

Attribute to use as classification label (for "sft"). Default: None.

include_embeddingsbool, optional

Add "embedding" column. Default: False.

strideint, optional

Overlap between windows when splitting long texts. Default: 0 (no stride).

Returns:
datasets.Dataset

Tokenized training dataset. Falls back to a plain dict of lists when datasets is not installed.

Raises:
ImportError

If transformers is not installed.

Parameters:
  • documents (list[Any])

  • tokenizer_name (str)

  • max_length (int)

  • task (str)

  • text_field (str)

  • label_field (str | None)

  • include_embeddings (bool)

  • stride (int)

Return type:

Any

to_openai_finetuning_jsonl(documents, output_path, *, system_prompt=None, response_fn=None, user_field='text', include_embeddings=False, skip_empty=True)[source]#

Export documents as OpenAI chat fine-tuning JSONL.

Each line is a valid fine-tuning example:

{
  "messages": [
    {"role": "system", "content": "<system_prompt>"},
    {"role": "user",   "content": "<doc.text>"},
    {"role": "assistant", "content": "<response_fn(doc)>"}
  ],
  "embedding": [...]    // optional, only when include_embeddings=True
}
Parameters:
documentslist[CorpusDocument]

Documents to export.

output_pathpathlib.Path or str

Destination .jsonl file.

system_promptstr or None, optional

System message. Defaults to self.default_system_prompt.

response_fncallable or None, optional

fn(doc) str producing the assistant response for each document. When None, uses doc.metadata.get("answer", "") — suitable when answers are stored in metadata.

user_fieldstr, optional

Document attribute to use as the user message. Default: "text".

include_embeddingsbool, optional

Append "embedding" key to each record. Requires that embeddings are present (call _ensure_embedded first or set an engine). Default: False.

skip_emptybool, optional

Skip documents with empty user content. Default: True.

Returns:
pathlib.Path

Path to the written .jsonl file.

Parameters:
Return type:

Path

Notes

OpenAI fine-tuning requires at minimum a "user" message and an "assistant" message. Provide response_fn to generate meaningful assistant turns; otherwise the export produces single-turn user-only examples (system + user, no assistant reply) which are valid for supervised fine-tuning when you add assistant responses separately.