1 files changed, 209 insertions, 0 deletions
diff --git a/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst b/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst
new file mode 100644
index 000000000..36539785a
--- /dev/null
+++ b/testing/web-platform/tests/tools/html5lib/doc/movingparts.rst
@@ -0,0 +1,209 @@
+The moving parts
+================
+
+html5lib consists of a number of components, which are responsible for
+handling its features.
+
+
+Tree builders
+-------------
+
+The parser reads HTML by tokenizing the content and building a tree that
+the user can later access. There are three main types of trees that
+html5lib can build:
+
+* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
+  which can be found in the standard library. Whenever possible, the
+  accelerated ``ElementTree`` implementation (i.e.
+  ``xml.etree.cElementTree`` on Python 2.x) is used.
+
+* ``dom`` - builds a tree based on ``xml.dom.minidom``.
+
+* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
+  API.  The performance gains are relatively small compared to using the
+  accelerated ``ElementTree`` module.
+
+You can specify the builder by name when using the shorthand API:
+
+.. code-block:: python
+
+  import html5lib
+  with open("mydocument.html", "rb") as f:
+      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
+
+When instantiating a parser object, you have to pass a tree builder
+class in the ``tree`` keyword attribute:
+
+.. code-block:: python
+
+  import html5lib
+  parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
+  document = parser.parse("<p>Hello World!")
+
+To get a builder class by name, use the ``getTreeBuilder`` function:
+
+.. code-block:: python
+
+  import html5lib
+  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
+  minidom_document = parser.parse("<p>Hello World!")
+
+The implementation of builders can be found in `html5lib/treebuilders/
+<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
+
+
+Tree walkers
+------------
+
+Once a tree is ready, you can work on it either manually, or using
+a tree walker, which provides a streaming view of the tree. html5lib
+provides walkers for all three supported types of trees (``etree``,
+``dom`` and ``lxml``).
+
+The implementation of walkers can be found in `html5lib/treewalkers/
+<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
+
+Walkers make consuming HTML easier. html5lib uses them to provide you
+with has a couple of handy tools.
+
+
+HTMLSerializer
+~~~~~~~~~~~~~~
+
+The serializer lets you write HTML back as a stream of bytes.
+
+.. code-block:: pycon
+
+  >>> import html5lib
+  >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
+  >>> walker = html5lib.getTreeWalker("etree")
+  >>> stream = walker(element)
+  >>> s = html5lib.serializer.HTMLSerializer()
+  >>> output = s.serialize(stream)
+  >>> for item in output:
+  ...   print("%r" % item)
+  '<p'
+  ' '
+  'xml:lang'
+  '='
+  'pl'
+  '>'
+  'Witam wszystkich'
+
+You can customize the serializer behaviour in a variety of ways, consult
+the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
+documentation.
+
+
+Filters
+~~~~~~~
+
+You can alter the stream content with filters provided by html5lib:
+
+* :class:`alphabeticalattributes.Filter
+  <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
+  tags to be in alphabetical order
+
+* :class:`inject_meta_charset.Filter
+  <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
+  encoding in the correct ``<meta>`` tag in the ``<head>`` section of
+  the document
+
+* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
+  ``LintError`` exceptions on invalid tag and attribute names, invalid
+  PCDATA, etc.
+
+* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
+  removes tags from the stream which are not necessary to produce valid
+  HTML
+
+* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
+  unsafe markup and CSS. Elements that are known to be safe are passed
+  through and the rest is converted to visible text. The default
+  configuration of the sanitizer follows the `WHATWG Sanitization Rules
+  <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
+
+* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
+  collapses all whitespace characters to single spaces unless they're in
+  ``<pre/>`` or ``textarea`` tags.
+
+To use a filter, simply wrap it around a stream:
+
+.. code-block:: python
+
+  >>> import html5lib
+  >>> from html5lib.filters import sanitizer
+  >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
+  >>> walker = html5lib.getTreeWalker("dom")
+  >>> stream = walker(dom)
+  >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream)
+
+
+Tree adapters
+-------------
+
+Used to translate one type of tree to another. More documentation
+pending, sorry.
+
+
+Encoding discovery
+------------------
+
+Parsed trees are always Unicode. However a large variety of input
+encodings are supported. The encoding of the document is determined in
+the following way:
+
+* The encoding may be explicitly specified by passing the name of the
+  encoding as the encoding parameter to the
+  :meth:`~html5lib.html5parser.HTMLParser.parse` method on
+  ``HTMLParser`` objects.
+
+* If no encoding is specified, the parser will attempt to detect the
+  encoding from a ``<meta>``  element in the first 512 bytes of the
+  document (this is only a partial implementation of the current HTML
+  5 specification).
+
+* If no encoding can be found and the chardet library is available, an
+  attempt will be made to sniff the encoding from the byte pattern.
+
+* If all else fails, the default encoding will be used. This is usually
+  `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
+  a common fallback used by Web browsers.
+
+
+Tokenizers
+----------
+
+The part of the parser responsible for translating a raw input stream
+into meaningful tokens is the tokenizer. Currently html5lib provides
+two.
+
+To set up a tokenizer, simply pass it when instantiating
+a :class:`~html5lib.html5parser.HTMLParser`:
+
+.. code-block:: python
+
+  import html5lib
+  from html5lib import sanitizer
+
+  p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
+  p.parse("<p>Surprise!<script>alert('Boo!');</script>")
+
+HTMLTokenizer
+~~~~~~~~~~~~~
+
+This is the default tokenizer, the heart of html5lib. The implementation
+can be found in `html5lib/tokenizer.py
+<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
+
+HTMLSanitizer
+~~~~~~~~~~~~~
+
+This is a tokenizer that removes unsafe markup and CSS styles from the
+input. Elements that are known to be safe are passed through and the
+rest is converted to visible text. The default configuration of the
+sanitizer follows the `WHATWG Sanitization Rules
+<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
+
+The implementation can be found in `html5lib/sanitizer.py
+<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.