HTMLStripNormalizer#

class scikitplot.corpus.HTMLStripNormalizer(use_beautifulsoup=False, parser='html.parser', decode_entities=True)[source]#

Remove HTML and XML tags from the document text.

Two modes are available: * use_beautifulsoup=False (default): regex-based stripping.

Zero additional dependencies; handles well-formed HTML.

  • use_beautifulsoup=True: uses bs4.BeautifulSoup for robust parsing of malformed or deeply nested HTML. Requires pip install beautifulsoup4.

Parameters:
use_beautifulsoupbool, optional

Use BeautifulSoup for parsing. Default: False.

parserstr, optional

BeautifulSoup parser (\"html.parser\", \"lxml\", \"html5lib\"). Ignored when use_beautifulsoup=False. Default: \"html.parser\" (stdlib, no extra deps).

decode_entitiesbool, optional

Decode HTML entities (e.g. &&). Default: True.

Parameters:
  • use_beautifulsoup (bool)

  • parser (str)

  • decode_entities (bool)

Examples

>>> norm = HTMLStripNormalizer()
>>> doc = CorpusDocument.create("f.txt", 0, "<p>Hello <b>world</b>.</p>")
>>> norm.normalize_doc(doc).normalized_text
'Hello world.'
normalize_doc(doc)[source]#

Strip HTML tags from the document text.

Parameters:
docCorpusDocument
Returns:
CorpusDocument
Raises:
ImportError

If use_beautifulsoup=True and beautifulsoup4 is not installed.

Parameters:

doc (CorpusDocument)

Return type:

CorpusDocument