diff options
Diffstat (limited to 'testing/web-platform/tests/tools/html5lib/doc/movingparts.rst')
-rw-r--r-- | testing/web-platform/tests/tools/html5lib/doc/movingparts.rst | 209 |
1 files changed, 209 insertions, 0 deletions
diff --git a/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst b/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst new file mode 100644 index 000000000..36539785a --- /dev/null +++ b/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst @@ -0,0 +1,209 @@ +The moving parts +================ + +html5lib consists of a number of components, which are responsible for +handling its features. + + +Tree builders +------------- + +The parser reads HTML by tokenizing the content and building a tree that +the user can later access. There are three main types of trees that +html5lib can build: + +* ``etree`` - this is the default; builds a tree based on ``xml.etree``, + which can be found in the standard library. Whenever possible, the + accelerated ``ElementTree`` implementation (i.e. + ``xml.etree.cElementTree`` on Python 2.x) is used. + +* ``dom`` - builds a tree based on ``xml.dom.minidom``. + +* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` + API. The performance gains are relatively small compared to using the + accelerated ``ElementTree`` module. + +You can specify the builder by name when using the shorthand API: + +.. code-block:: python + + import html5lib + with open("mydocument.html", "rb") as f: + lxml_etree_document = html5lib.parse(f, treebuilder="lxml") + +When instantiating a parser object, you have to pass a tree builder +class in the ``tree`` keyword attribute: + +.. code-block:: python + + import html5lib + parser = html5lib.HTMLParser(tree=SomeTreeBuilder) + document = parser.parse("<p>Hello World!") + +To get a builder class by name, use the ``getTreeBuilder`` function: + +.. code-block:: python + + import html5lib + parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) + minidom_document = parser.parse("<p>Hello World!") + +The implementation of builders can be found in `html5lib/treebuilders/ +<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_. + + +Tree walkers +------------ + +Once a tree is ready, you can work on it either manually, or using +a tree walker, which provides a streaming view of the tree. html5lib +provides walkers for all three supported types of trees (``etree``, +``dom`` and ``lxml``). + +The implementation of walkers can be found in `html5lib/treewalkers/ +<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_. + +Walkers make consuming HTML easier. html5lib uses them to provide you +with has a couple of handy tools. + + +HTMLSerializer +~~~~~~~~~~~~~~ + +The serializer lets you write HTML back as a stream of bytes. + +.. code-block:: pycon + + >>> import html5lib + >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich') + >>> walker = html5lib.getTreeWalker("etree") + >>> stream = walker(element) + >>> s = html5lib.serializer.HTMLSerializer() + >>> output = s.serialize(stream) + >>> for item in output: + ... print("%r" % item) + '<p' + ' ' + 'xml:lang' + '=' + 'pl' + '>' + 'Witam wszystkich' + +You can customize the serializer behaviour in a variety of ways, consult +the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer` +documentation. + + +Filters +~~~~~~~ + +You can alter the stream content with filters provided by html5lib: + +* :class:`alphabeticalattributes.Filter + <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on + tags to be in alphabetical order + +* :class:`inject_meta_charset.Filter + <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified + encoding in the correct ``<meta>`` tag in the ``<head>`` section of + the document + +* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises + ``LintError`` exceptions on invalid tag and attribute names, invalid + PCDATA, etc. + +* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>` + removes tags from the stream which are not necessary to produce valid + HTML + +* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes + unsafe markup and CSS. Elements that are known to be safe are passed + through and the rest is converted to visible text. The default + configuration of the sanitizer follows the `WHATWG Sanitization Rules + <http://wiki.whatwg.org/wiki/Sanitization_rules>`_. + +* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>` + collapses all whitespace characters to single spaces unless they're in + ``<pre/>`` or ``textarea`` tags. + +To use a filter, simply wrap it around a stream: + +.. code-block:: python + + >>> import html5lib + >>> from html5lib.filters import sanitizer + >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom") + >>> walker = html5lib.getTreeWalker("dom") + >>> stream = walker(dom) + >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream) + + +Tree adapters +------------- + +Used to translate one type of tree to another. More documentation +pending, sorry. + + +Encoding discovery +------------------ + +Parsed trees are always Unicode. However a large variety of input +encodings are supported. The encoding of the document is determined in +the following way: + +* The encoding may be explicitly specified by passing the name of the + encoding as the encoding parameter to the + :meth:`~html5lib.html5parser.HTMLParser.parse` method on + ``HTMLParser`` objects. + +* If no encoding is specified, the parser will attempt to detect the + encoding from a ``<meta>`` element in the first 512 bytes of the + document (this is only a partial implementation of the current HTML + 5 specification). + +* If no encoding can be found and the chardet library is available, an + attempt will be made to sniff the encoding from the byte pattern. + +* If all else fails, the default encoding will be used. This is usually + `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is + a common fallback used by Web browsers. + + +Tokenizers +---------- + +The part of the parser responsible for translating a raw input stream +into meaningful tokens is the tokenizer. Currently html5lib provides +two. + +To set up a tokenizer, simply pass it when instantiating +a :class:`~html5lib.html5parser.HTMLParser`: + +.. code-block:: python + + import html5lib + from html5lib import sanitizer + + p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer) + p.parse("<p>Surprise!<script>alert('Boo!');</script>") + +HTMLTokenizer +~~~~~~~~~~~~~ + +This is the default tokenizer, the heart of html5lib. The implementation +can be found in `html5lib/tokenizer.py +<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_. + +HTMLSanitizer +~~~~~~~~~~~~~ + +This is a tokenizer that removes unsafe markup and CSS styles from the +input. Elements that are known to be safe are passed through and the +rest is converted to visible text. The default configuration of the +sanitizer follows the `WHATWG Sanitization Rules +<http://wiki.whatwg.org/wiki/Sanitization_rules>`_. + +The implementation can be found in `html5lib/sanitizer.py +<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_. |