process_html_directory#

scikitplot._externals._sphinx_ext._sphinx_ai_assistant.process_html_directory(input_dir, *, output_dir=None, selectors=None, theme_preset=None, exclude_patterns=None, strip_tags=None, max_workers=None, recursive=True, generate_llms=False, base_url='', llms_txt_max_entries=None, llms_txt_full_content=False)[source]#

Walk any HTML directory tree and convert pages to Markdown.

This function is entirely Sphinx-free and works with any static-site generator: Sphinx, MkDocs, Jekyll, Hugo, Hexo, Docusaurus, VitePress, GitBook, or plain HTML.

Parameters:
input_dirstr or pathlib.Path

Root directory containing .html files.

output_dirstr or pathlib.Path or None, optional

Directory where .md files are written, mirroring the directory structure of input_dir. When None (default), .md files are written alongside each .html file (inline mode).

selectorslist of str or None, optional

CSS selectors tried in order to locate the main content element. When None, uses the module default combined with theme_preset.

theme_presetstr or None, optional

Theme name from _THEME_SELECTOR_PRESETS (e.g. "mkdocs_material", "jekyll", "plain_html"). Merged with selectors.

exclude_patternslist of str or None, optional

Path substrings to skip. Defaults to ["genindex", "search", "py-modindex", "_sources", "_static"].

strip_tagslist of str or None, optional

HTML tag names removed (with content) before conversion. Defaults to ["script", "style", "nav", "footer", "header"].

max_workersint or None, optional

Maximum parallel worker processes. None → auto-detect (CPU count or 1).

recursivebool, optional

When True (default), recurse into subdirectories. When False, only the top-level .html files are processed.

generate_llmsbool, optional

When True, write an llms.txt index file after conversion.

base_urlstr, optional

Base URL prepended to .md paths in llms.txt.

llms_txt_max_entriesint or None, optional

Cap on the number of entries in llms.txt.

llms_txt_full_contentbool, optional

When True, embed full Markdown content inline in llms.txt.

Returns:
dict

{"generated": int, "skipped": int, "errors": int} — counts of files processed, skipped, and errored.

Raises:
ValueError

If input_dir does not exist or is not a directory.

ImportError

If beautifulsoup4 or markdownify is not installed.

Parameters:
  • input_dir (str | Path)

  • output_dir (str | Path | None)

  • selectors (list[str] | None)

  • theme_preset (str | None)

  • exclude_patterns (list[str] | None)

  • strip_tags (list[str] | None)

  • max_workers (int | None)

  • recursive (bool)

  • generate_llms (bool)

  • base_url (str)

  • llms_txt_max_entries (int | None)

  • llms_txt_full_content (bool)

Return type:

dict[str, int]

Examples

>>> stats = process_html_directory(
...     "/site/_build",
...     theme_preset="mkdocs_material",
...     generate_llms=True,
...     base_url="https://example.com",
... )
>>> print(stats)
{"generated": 42, "skipped": 3, "errors": 0}