process_html_directory#

scikitplot._externals._sphinx_ext._sphinx_ai_assistant.process_html_directory(input_dir, *, output_dir=None, selectors=None, theme_preset=None, exclude_patterns=None, strip_tags=None, max_workers=None, recursive=True, generate_llms=False, base_url='', llms_txt_max_entries=None, llms_txt_full_content=False)[source]#

Walk any HTML directory tree and convert pages to Markdown.

This function is entirely Sphinx-free and works with any static-site generator: Sphinx, MkDocs, Jekyll, Hugo, Hexo, Docusaurus, VitePress, GitBook, or plain HTML.

Parameters:

input_dirstr or pathlib.Path: Root directory containing .html files.
output_dirstr or pathlib.Path or None, optional: Directory where .md files are written, mirroring the directory structure of input_dir. When None (default), .md files are written alongside each .html file (inline mode).
selectorslist of str or None, optional: CSS selectors tried in order to locate the main content element. When None, uses the module default combined with theme_preset.
theme_presetstr or None, optional: Theme name from _THEME_SELECTOR_PRESETS (e.g. "mkdocs_material", "jekyll", "plain_html"). Merged with selectors.
exclude_patternslist of str or None, optional: Path substrings to skip. Defaults to ["genindex", "search", "py-modindex", "_sources", "_static"].
strip_tagslist of str or None, optional: HTML tag names removed (with content) before conversion. Defaults to ["script", "style", "nav", "footer", "header"].
max_workersint or None, optional: Maximum parallel worker processes. None → auto-detect (CPU count or 1).
recursivebool, optional: When True (default), recurse into subdirectories. When False, only the top-level .html files are processed.
generate_llmsbool, optional: When True, write an llms.txt index file after conversion.
base_urlstr, optional: Base URL prepended to .md paths in llms.txt.
llms_txt_max_entriesint or None, optional: Cap on the number of entries in llms.txt.
llms_txt_full_contentbool, optional: When True, embed full Markdown content inline in llms.txt.

Returns:

dict: {"generated": int, "skipped": int, "errors": int} — counts of files processed, skipped, and errored.

Raises:

ValueError: If input_dir does not exist or is not a directory.
ImportError: If beautifulsoup4 or markdownify is not installed.

Parameters:

input_dir (str | Path)
output_dir (str | Path | None)
selectors (list[str] | None)
theme_preset (str | None)
exclude_patterns (list[str] | None)
strip_tags (list[str] | None)
max_workers (int | None)
recursive (bool)
generate_llms (bool)
base_url (str)
llms_txt_max_entries (int | None)
llms_txt_full_content (bool)

Return type:

dict[str, int]

Examples

>>> stats = process_html_directory(
...     "/site/_build",
...     theme_preset="mkdocs_material",
...     generate_llms=True,
...     base_url="https://example.com",
... )
>>> print(stats)
{"generated": 42, "skipped": 3, "errors": 0}