The moving parts ================ html5lib consists of a number of components, which are responsible for handling its features. Tree builders ------------- The parser reads HTML by tokenizing the content and building a tree that the user can later access. There are three main types of trees that html5lib can build: * ``etree`` - this is the default; builds a tree based on ``xml.etree``, which can be found in the standard library. Whenever possible, the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x) is used. * ``dom`` - builds a tree based on ``xml.dom.minidom``. * ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` API. The performance gains are relatively small compared to using the accelerated ``ElementTree`` module. You can specify the builder by name when using the shorthand API: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") When instantiating a parser object, you have to pass a tree builder class in the ``tree`` keyword attribute: .. code-block:: python import html5lib parser = html5lib.HTMLParser(tree=SomeTreeBuilder) document = parser.parse("
Hello World!") To get a builder class by name, use the ``getTreeBuilder`` function: .. code-block:: python import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("
Hello World!")
The implementation of builders can be found in `html5lib/treebuilders/
Witam wszystkich')
>>> walker = html5lib.getTreeWalker("etree")
>>> stream = walker(element)
>>> s = html5lib.serializer.HTMLSerializer()
>>> output = s.serialize(stream)
>>> for item in output:
... print("%r" % item)
' '
'Witam wszystkich'
You can customize the serializer behaviour in a variety of ways, consult
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
documentation.
Filters
~~~~~~~
You can alter the stream content with filters provided by html5lib:
* :class:`alphabeticalattributes.Filter
")
HTMLTokenizer
~~~~~~~~~~~~~
This is the default tokenizer, the heart of html5lib. The implementation
can be found in `html5lib/tokenizer.py