HTMLStripNormalizer#
- class scikitplot.corpus.HTMLStripNormalizer(use_beautifulsoup=False, parser='html.parser', decode_entities=True)[source]#
Remove HTML and XML tags from the document text.
Two modes are available: *
use_beautifulsoup=False(default): regex-based stripping.Zero additional dependencies; handles well-formed HTML.
use_beautifulsoup=True: usesbs4.BeautifulSoupfor robust parsing of malformed or deeply nested HTML. Requirespip install beautifulsoup4.
- Parameters:
- use_beautifulsoupbool, optional
Use BeautifulSoup for parsing. Default:
False.- parserstr, optional
BeautifulSoup parser (
\"html.parser\",\"lxml\",\"html5lib\"). Ignored whenuse_beautifulsoup=False. Default:\"html.parser\"(stdlib, no extra deps).- decode_entitiesbool, optional
Decode HTML entities (e.g.
&→&). Default:True.
- Parameters:
Examples
>>> norm = HTMLStripNormalizer() >>> doc = CorpusDocument.create("f.txt", 0, "<p>Hello <b>world</b>.</p>") >>> norm.normalize_doc(doc).normalized_text 'Hello world.'