Python: module htmllib

htmllib

index
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/htmllib.py
Module Docs

HTML 2.0 parser. See the HTML 2.0 specification: http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_toc.html

Modules

sgmllib

Classes



sgmllib.SGMLParseError(exceptions.RuntimeError)

HTMLParseError

sgmllib.SGMLParser(markupbase.ParserBase)

HTMLParser

class HTMLParseError(sgmllib.SGMLParseError)

    Error raised when an HTML document can't be parsed.

Method resolution order:

HTMLParseError

sgmllib.SGMLParseError

exceptions.RuntimeError

exceptions.StandardError

exceptions.Exception

exceptions.BaseException

__builtin__.object

Data descriptors inherited from sgmllib.SGMLParseError:

__weakref__

list of weak references to the object (if defined)

Methods inherited from exceptions.RuntimeError:

__init__(...)
x.__init__(...) initializes x; see help(type(x)) for signature

Data and other attributes inherited from exceptions.RuntimeError:

__new__ = <built-in method __new__ of type object>
T.__new__(S, ...) -> a new object with type S, a subtype of T

Methods inherited from exceptions.BaseException:

__delattr__(...)
x.__delattr__('name') <==> del x.name

__getattribute__(...)
x.__getattribute__('name') <==> x.name

__getitem__(...)
x.__getitem__(y) <==> x[y]

__getslice__(...)
x.__getslice__(i, j) <==> x[i:j] Use of negative indices is not supported.

__reduce__(...)

__repr__(...)
x.__repr__() <==> repr(x)

__setattr__(...)
x.__setattr__('name', value) <==> x.name = value

__setstate__(...)

__str__(...)
x.__str__() <==> str(x)

__unicode__(...)

Data descriptors inherited from exceptions.BaseException:

__dict__

args

message

class HTMLParser(sgmllib.SGMLParser)

    This is the basic HTML parser class. It supports all entity names required by the XHTML 1.0 Recommendation. It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.

Method resolution order:

HTMLParser

sgmllib.SGMLParser

markupbase.ParserBase

Methods defined here:

__init__(self, formatter, verbose=0)
Creates an instance of the HTMLParser class. The formatter parameter is the formatter instance associated with the parser.

anchor_bgn(self, href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names.  The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document.  The list of hyperlinks is available as the data attribute anchorlist.

anchor_end(self)
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by the anchor_bgn()method.

ddpop(self, bl=0)

do_base(self, attrs)

do_br(self, attrs)

do_dd(self, attrs)

do_dt(self, attrs)

do_hr(self, attrs)

do_img(self, attrs)

do_isindex(self, attrs)

do_li(self, attrs)

do_link(self, attrs)

do_meta(self, attrs)

do_nextid(self, attrs)

do_p(self, attrs)

do_plaintext(self, attrs)

end_a(self)

end_address(self)

end_b(self)

end_blockquote(self)

end_body(self)

end_cite(self)

end_code(self)

end_dir(self)

end_dl(self)

end_em(self)

end_h1(self)

end_h2(self)

end_h3(self)

end_h4(self)

end_h5(self)

end_h6(self)

end_head(self)

end_html(self)

end_i(self)

end_kbd(self)

end_listing(self)

end_menu(self)

end_ol(self)

end_pre(self)

end_samp(self)

end_strong(self)

end_title(self)

end_tt(self)

end_ul(self)

end_var(self)

end_xmp(self)

error(self, message)

handle_data(self, data)

handle_image(self, src, alt, *args)
This method is called to handle images. The default implementation simply passes the alt value to the handle_data() method.

reset(self)

save_bgn(self)
Begins saving character data in a buffer instead of sending it to the formatter object. Retrieve the stored data via the save_end() method.  Use of the save_bgn() / save_end() pair may not be nested.

save_end(self)
Ends buffering character data and returns all data saved since the preceding call to the save_bgn() method. If the nofill flag is false, whitespace is collapsed to single spaces.  A call to this method without a preceding call to the save_bgn() method will raise a TypeError exception.

start_a(self, attrs)

start_address(self, attrs)

start_b(self, attrs)

start_blockquote(self, attrs)

start_body(self, attrs)

start_cite(self, attrs)

start_code(self, attrs)

start_dir(self, attrs)

start_dl(self, attrs)

start_em(self, attrs)

start_h1(self, attrs)

start_h2(self, attrs)

start_h3(self, attrs)

start_h4(self, attrs)

start_h5(self, attrs)

start_h6(self, attrs)

start_head(self, attrs)

start_html(self, attrs)

start_i(self, attrs)

start_kbd(self, attrs)

start_listing(self, attrs)

start_menu(self, attrs)

start_ol(self, attrs)

start_pre(self, attrs)

start_samp(self, attrs)

start_strong(self, attrs)

start_title(self, attrs)

start_tt(self, attrs)

start_ul(self, attrs)

start_var(self, attrs)

start_xmp(self, attrs)

unknown_endtag(self, tag)

unknown_starttag(self, tag, attrs)

Data and other attributes defined here:

entitydefs = {'AElig': '\xc6', 'Aacute': '\xc1', 'Acirc': '\xc2', 'Agrave': '\xc0', 'Alpha': 'Α', 'Aring': '\xc5', 'Atilde': '\xc3', 'Auml': '\xc4', 'Beta': 'Β', 'Ccedil': '\xc7', ...}

Methods inherited from sgmllib.SGMLParser:

close(self)
Handle the remaining data.

convert_charref(self, name)
Convert character reference, may be overridden.

convert_codepoint(self, codepoint)

convert_entityref(self, name)
Convert entity references. As an alternative to overriding this method; one can tailor the results by setting up the self.entitydefs mapping appropriately.

feed(self, data)
Feed some data to the parser.         Call this as often as you want, with as little or as much text         as you want (may include ' ').  (This just saves the text,         all the processing is done by goahead().)

finish_endtag(self, tag)
# Internal -- finish processing of end tag

finish_shorttag(self, tag, data)
# Internal -- finish parsing of <tag/data/ (same as <tag>data</tag>)

finish_starttag(self, tag, attrs)
# Internal -- finish processing of start tag # Return -1 for unknown tag, 0 for open-only tag, 1 for balanced tag

get_starttag_text(self)

goahead(self, end)
# Internal -- handle data as far as reasonable.  May leave state # and data to be processed by a subsequent call.  If 'end' is # true, force handling all data as if followed by EOF marker.

handle_charref(self, name)
Handle character reference, no need to override.

handle_comment(self, data)
# Example -- handle comment, could be overridden

handle_decl(self, decl)
# Example -- handle declaration, could be overridden

handle_endtag(self, tag, method)
# Overridable -- handle end tag

handle_entityref(self, name)
Handle entity references, no need to override.

handle_pi(self, data)
# Example -- handle processing instruction, could be overridden

handle_starttag(self, tag, method, attrs)
# Overridable -- handle start tag

parse_endtag(self, i)
# Internal -- parse endtag

parse_pi(self, i)
# Internal -- parse processing instr, return length or -1 if not terminated

parse_starttag(self, i)
# Internal -- handle starttag, return length or -1 if not terminated

report_unbalanced(self, tag)
# Example -- report an unbalanced </...> tag.

setliteral(self, *args)
Enter literal mode (CDATA). Intended for derived classes only.

setnomoretags(self)
Enter literal mode (CDATA) till EOF. Intended for derived classes only.

unknown_charref(self, ref)

unknown_entityref(self, ref)

Data and other attributes inherited from sgmllib.SGMLParser:

entity_or_charref = <_sre.SRE_Pattern object>

Methods inherited from markupbase.ParserBase:

getpos(self)
Return current line number and offset.

parse_comment(self, i, report=1)
# Internal -- parse comment, return length or -1 if not terminated

parse_declaration(self, i)
# Internal -- parse declaration (for use by subclasses).

parse_marked_section(self, i, report=1)
# Internal -- parse a marked section # Override this to handle MS-word extension syntax <![if word]>content<![endif]>

unknown_decl(self, data)
# To be overridden -- handlers for unknown objects

updatepos(self, i, j)
# Internal -- update line number and offset.  This should be # called for each piece of data exactly once, in order -- in other # words the concatenation of all the input strings to this # function should be exactly the entire input.

Data

__all__ = ['HTMLParser', 'HTMLParseError']