process_html_directory#
- scikitplot._externals._sphinx_ext._sphinx_ai_assistant.process_html_directory(input_dir, *, output_dir=None, selectors=None, theme_preset=None, exclude_patterns=None, strip_tags=None, max_workers=None, recursive=True, generate_llms=False, base_url='', llms_txt_max_entries=None, llms_txt_full_content=False)[source]#
Walk any HTML directory tree and convert pages to Markdown.
This function is entirely Sphinx-free and works with any static-site generator: Sphinx, MkDocs, Jekyll, Hugo, Hexo, Docusaurus, VitePress, GitBook, or plain HTML.
- Parameters:
- input_dirstr or pathlib.Path
Root directory containing
.htmlfiles.- output_dirstr or pathlib.Path or None, optional
Directory where
.mdfiles are written, mirroring the directory structure of input_dir. WhenNone(default),.mdfiles are written alongside each.htmlfile (inline mode).- selectorslist of str or None, optional
CSS selectors tried in order to locate the main content element. When
None, uses the module default combined with theme_preset.- theme_presetstr or None, optional
Theme name from
_THEME_SELECTOR_PRESETS(e.g."mkdocs_material","jekyll","plain_html"). Merged with selectors.- exclude_patternslist of str or None, optional
Path substrings to skip. Defaults to
["genindex", "search", "py-modindex", "_sources", "_static"].- strip_tagslist of str or None, optional
HTML tag names removed (with content) before conversion. Defaults to
["script", "style", "nav", "footer", "header"].- max_workersint or None, optional
Maximum parallel worker processes.
None→ auto-detect (CPU count or 1).- recursivebool, optional
When
True(default), recurse into subdirectories. WhenFalse, only the top-level.htmlfiles are processed.- generate_llmsbool, optional
When
True, write anllms.txtindex file after conversion.- base_urlstr, optional
Base URL prepended to
.mdpaths inllms.txt.- llms_txt_max_entriesint or None, optional
Cap on the number of entries in
llms.txt.- llms_txt_full_contentbool, optional
When
True, embed full Markdown content inline inllms.txt.
- Returns:
- dict
{"generated": int, "skipped": int, "errors": int}— counts of files processed, skipped, and errored.
- Raises:
- ValueError
If input_dir does not exist or is not a directory.
- ImportError
If
beautifulsoup4ormarkdownifyis not installed.
- Parameters:
- Return type:
Examples
>>> stats = process_html_directory( ... "/site/_build", ... theme_preset="mkdocs_material", ... generate_llms=True, ... base_url="https://example.com", ... ) >>> print(stats) {"generated": 42, "skipped": 3, "errors": 0}