| |
- exceptions.Exception(exceptions.BaseException)
-
- HTMLParseError
- markupbase.ParserBase
-
- HTMLParser
class HTMLParser(markupbase.ParserBase) |
|
Find tags and other markup and call handler functions.
Usage:
p = HTMLParser()
p.feed(data)
...
p.close()
Start tags are handled by calling self.handle_starttag() or
self.handle_startendtag(); end tags by self.handle_endtag(). The
data between tags is passed from the parser to the derived class
by calling self.handle_data() with the data as argument (the data
may be split up in arbitrary chunks). Entity references are
passed by calling self.handle_entityref() with the entity
reference as the argument. Numeric character references are
passed to self.handle_charref() with the string containing the
reference as the argument. |
|
Methods defined here:
- __init__(self)
- Initialize and reset this instance.
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- error(self, message)
- feed(self, data)
- Feed data to the parser.
Call this as often as you want, with as little or as much text
as you want (may include '\n').
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_charref(self, name)
- # Overridable -- handle character reference
- handle_comment(self, data)
- # Overridable -- handle comment
- handle_data(self, data)
- # Overridable -- handle data
- handle_decl(self, decl)
- # Overridable -- handle declaration
- handle_endtag(self, tag)
- # Overridable -- handle end tag
- handle_entityref(self, name)
- # Overridable -- handle entity reference
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- handle_startendtag(self, tag, attrs)
- # Overridable -- finish processing of start+end tag: <tag.../>
- handle_starttag(self, tag, attrs)
- # Overridable -- handle start tag
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
- unknown_decl(self, data)
Data and other attributes defined here:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
- entitydefs = None
Methods inherited from markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
| |