LLMTrainingExporter#
- class scikitplot.corpus.LLMTrainingExporter(engine=None, default_system_prompt='You are a helpful assistant.')[source]#
Export a corpus with embeddings to LLM training formats.
Orchestrates the full journey from
list[CorpusDocument]→ training-ready files / datasets.- Parameters:
- engineMultimodalEmbeddingEngine or EmbeddingEngine or None
Embedding engine. When
None, existingdoc.embeddingvalues are used as-is; raisesValueErrorif a document lacks an embedding where one is required.- default_system_promptstr, optional
System prompt prepended to all OpenAI fine-tuning conversations. Default:
"You are a helpful assistant.".
- Parameters:
Examples
OpenAI fine-tuning (chat format):
>>> exporter = LLMTrainingExporter(engine) >>> exporter.to_openai_finetuning_jsonl( ... docs, ... output_path=Path("train.jsonl"), ... system_prompt="Answer medical questions accurately.", ... response_fn=lambda doc: doc.metadata.get("answer", ""), ... )
HuggingFace SFT dataset:
>>> ds = exporter.to_huggingface_training_dataset( ... docs, ... tokenizer_name="gpt2", ... task="clm", ... )
Pure embedding matrix for vector DB or contrastive training:
>>> matrix, meta_df = exporter.to_embedding_matrix(docs) >>> matrix.shape # (N, D) (1024, 512)
- log_to_mlflow(documents, *, run_name=None, artifact_dir='corpus_embeddings', log_params=True)[source]#
Log embedding matrix and metadata as MLflow artifacts.
- Parameters:
- documentslist[CorpusDocument]
Documents with embeddings.
- run_namestr or None, optional
MLflow run name. Uses the active run when
None.- artifact_dirstr, optional
Directory inside the MLflow artifact store. Default:
"corpus_embeddings".- log_paramsbool, optional
Log engine config as MLflow params. Default:
True.
- Raises:
- ImportError
If
mlflowis not installed.
- Parameters:
- Return type:
None
- to_embedding_matrix(documents, *, include_metadata=True, output_path=None)[source]#
Export embeddings as a
(N, D)NumPy matrix with metadata.- Parameters:
- documentslist[CorpusDocument]
Documents. Those without embeddings are embedded via the engine when one is set, else raise
ValueError.- include_metadatabool, optional
Build a metadata DataFrame (or dict of lists). Default:
True.- output_pathpathlib.Path or str or None, optional
When set, saves
{output_path}.npy(matrix) and{output_path}.csv(metadata). Default:None.
- Returns:
- matrixndarray shape (N, D) float32
- metadatapandas.DataFrame or dict[str, list]
Metadata table with
doc_id,source_file,source_type,modality,content_hash,chunk_indexcolumns. Returns plain dict when pandas is not installed.
- Raises:
- ValueError
If any document lacks an embedding and
engine=None.
- Parameters:
- Return type:
- to_huggingface_training_dataset(documents, *, tokenizer_name='gpt2', max_length=512, task='clm', text_field='text', label_field=None, include_embeddings=False, stride=0)[source]#
Build a HuggingFace
datasets.Datasetfor LLM training.- Parameters:
- documentslist[CorpusDocument]
Documents to tokenize.
- tokenizer_namestr, optional
HuggingFace tokenizer name or local path. Default:
"gpt2".- max_lengthint, optional
Maximum token sequence length. Sequences are truncated (and optionally strided). Default:
512.- task{“clm”, “mlm”, “sft”}, optional
Training objective.
"clm"— causal language model:labels = input_ids."mlm"— masked language model: 15% random masking."sft"— supervised fine-tuning: requireslabel_fieldto be set.Default:
"clm".- text_fieldstr, optional
Document attribute to tokenize. Default:
"text".- label_fieldstr or None, optional
Attribute to use as classification label (for
"sft"). Default:None.- include_embeddingsbool, optional
Add
"embedding"column. Default:False.- strideint, optional
Overlap between windows when splitting long texts. Default:
0(no stride).
- Returns:
- datasets.Dataset
Tokenized training dataset. Falls back to a plain dict of lists when
datasetsis not installed.
- Raises:
- ImportError
If
transformersis not installed.
- Parameters:
- Return type:
- to_openai_finetuning_jsonl(documents, output_path, *, system_prompt=None, response_fn=None, user_field='text', include_embeddings=False, skip_empty=True)[source]#
Export documents as OpenAI chat fine-tuning JSONL.
Each line is a valid fine-tuning example:
{ "messages": [ {"role": "system", "content": "<system_prompt>"}, {"role": "user", "content": "<doc.text>"}, {"role": "assistant", "content": "<response_fn(doc)>"} ], "embedding": [...] // optional, only when include_embeddings=True }
- Parameters:
- documentslist[CorpusDocument]
Documents to export.
- output_pathpathlib.Path or str
Destination
.jsonlfile.- system_promptstr or None, optional
System message. Defaults to
self.default_system_prompt.- response_fncallable or None, optional
fn(doc) → strproducing the assistant response for each document. WhenNone, usesdoc.metadata.get("answer", "")— suitable when answers are stored in metadata.- user_fieldstr, optional
Document attribute to use as the user message. Default:
"text".- include_embeddingsbool, optional
Append
"embedding"key to each record. Requires that embeddings are present (call_ensure_embeddedfirst or set anengine). Default:False.- skip_emptybool, optional
Skip documents with empty user content. Default:
True.
- Returns:
- pathlib.Path
Path to the written
.jsonlfile.
- Parameters:
- Return type:
Notes
OpenAI fine-tuning requires at minimum a
"user"message and an"assistant"message. Provideresponse_fnto generate meaningful assistant turns; otherwise the export produces single-turn user-only examples (system + user, no assistant reply) which are valid for supervised fine-tuning when you add assistant responses separately.