1 files changed, 1147 insertions, 0 deletions
diff --git a/parser/html/java/htmlparser/doc/tokenization.txt b/parser/html/java/htmlparser/doc/tokenization.txt
new file mode 100644
index 000000000..21cd7f6e2
--- /dev/null
+++ b/parser/html/java/htmlparser/doc/tokenization.txt
@@ -0,0 +1,1147 @@
+   #8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction
+
+   WHATWG
+
+HTML 5
+
+Draft Recommendation — 7 February 2009
+
+   ← 8.2 Parsing HTML documents – Table of contents – 8.2.5 Tree
+   construction →
+
+    8.2.4 Tokenization
+
+   Implementations must act as if they used the following state machine to
+   tokenise HTML. The state machine must start in the data state. Most
+   states consume a single character, which may have various side-effects,
+   and either switches the state machine to a new state to reconsume the
+   same character, or switches it to a new state (to consume the next
+   character), or repeats the same state (to consume the next character).
+   Some states have more complicated behavior and can consume several
+   characters before switching to another state.
+
+   The exact behavior of certain states depends on a content model flag
+   that is set after certain tokens are emitted. The flag has several
+   states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in
+   the PCDATA state. In the RCDATA and CDATA states, a further escape flag
+   is used to control the behavior of the tokeniser. It is either true or
+   false, and initially must be set to the false state. The insertion mode
+   and the stack of open elements also affects tokenization.
+
+   The output of the tokenization step is a series of zero or more of the
+   following tokens: DOCTYPE, start tag, end tag, comment, character,
+   end-of-file. DOCTYPE tokens have a name, a public identifier, a system
+   identifier, and a force-quirks flag. When a DOCTYPE token is created,
+   its name, public identifier, and system identifier must be marked as
+   missing (which is a distinct state from the empty string), and the
+   force-quirks flag must be set to off (its other state is on). Start and
+   end tag tokens have a tag name, a self-closing flag, and a list of
+   attributes, each of which has a name and a value. When a start or end
+   tag token is created, its self-closing flag must be unset (its other
+   state is that it be set), and its attributes list must be empty.
+   Comment and character tokens have data.
+
+   When a token is emitted, it must immediately be handled by the tree
+   construction stage. The tree construction stage can affect the state of
+   the content model flag, and can insert additional characters into the
+   stream. (For example, the script element can result in scripts
+   executing and using the dynamic markup insertion APIs to insert
+   characters into the stream being tokenised.)
+
+   When a start tag token is emitted with its self-closing flag set, if
+   the flag is not acknowledged when it is processed by the tree
+   construction stage, that is a parse error.
+
+   When an end tag token is emitted, the content model flag must be
+   switched to the PCDATA state.
+
+   When an end tag token is emitted with attributes, that is a parse
+   error.
+
+   When an end tag token is emitted with its self-closing flag set, that
+   is a parse error.
+
+   Before each step of the tokeniser, the user agent must first check the
+   parser pause flag. If it is true, then the tokeniser must abort the
+   processing of any nested invocations of the tokeniser, yielding control
+   back to the caller. If it is false, then the user agent may then check
+   to see if either one of the scripts in the list of scripts that will
+   execute as soon as possible or the first script in the list of scripts
+   that will execute asynchronously, has completed loading. If one has,
+   then it must be executed and removed from its list.
+
+   The tokeniser state machine consists of the states defined in the
+   following subsections.
+
+      8.2.4.1 Data state
+
+   Consume the next input character:
+
+   U+0026 AMPERSAND (&)
+          When the content model flag is set to one of the PCDATA or
+          RCDATA states and the escape flag is false: switch to the
+          character reference data state.
+          Otherwise: treat it as per the "anything else" entry below.
+
+   U+002D HYPHEN-MINUS (-)
+          If the content model flag is set to either the RCDATA state or
+          the CDATA state, and the escape flag is false, and there are at
+          least three characters before this one in the input stream, and
+          the last four characters in the input stream, including this
+          one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D
+          HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
+          escape flag to true.
+
+          In any case, emit the input character as a character token. Stay
+          in the data state.
+
+   U+003C LESS-THAN SIGN (<)
+          When the content model flag is set to the PCDATA state: switch
+          to the tag open state.
+          When the content model flag is set to either the RCDATA state or
+          the CDATA state, and the escape flag is false: switch to the tag
+          open state.
+          Otherwise: treat it as per the "anything else" entry below.
+
+   U+003E GREATER-THAN SIGN (>)
+          If the content model flag is set to either the RCDATA state or
+          the CDATA state, and the escape flag is true, and the last three
+          characters in the input stream including this one are U+002D
+          HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN
+          ("-->"), set the escape flag to false.
+
+          In any case, emit the input character as a character token. Stay
+          in the data state.
+
+   EOF
+          Emit an end-of-file token.
+
+   Anything else
+          Emit the input character as a character token. Stay in the data
+          state.
+
+      8.2.4.2 Character reference data state
+
+   (This cannot happen if the content model flag is set to the CDATA
+   state.)
+
+   Attempt to consume a character reference, with no additional allowed
+   character.
+
+   If nothing is returned, emit a U+0026 AMPERSAND character token.
+
+   Otherwise, emit the character token that was returned.
+
+   Finally, switch to the data state.
+
+      8.2.4.3 Tag open state
+
+   The behavior of this state depends on the content model flag.
+
+   If the content model flag is set to the RCDATA or CDATA states
+          Consume the next input character. If it is a U+002F SOLIDUS (/)
+          character, switch to the close tag open state. Otherwise, emit a
+          U+003C LESS-THAN SIGN character token and reconsume the current
+          input character in the data state.
+
+   If the content model flag is set to the PCDATA state
+          Consume the next input character:
+
+        U+0021 EXCLAMATION MARK (!)
+                Switch to the markup declaration open state.
+
+        U+002F SOLIDUS (/)
+                Switch to the close tag open state.
+
+        U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
+                LETTER Z
+                Create a new start tag token, set its tag name to the
+                lowercase version of the input character (add 0x0020 to
+                the character's code point), then switch to the tag name
+                state. (Don't emit the token yet; further details will be
+                filled in before it is emitted.)
+
+        U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
+                Create a new start tag token, set its tag name to the
+                input character, then switch to the tag name state. (Don't
+                emit the token yet; further details will be filled in
+                before it is emitted.)
+
+        U+003E GREATER-THAN SIGN (>)
+                Parse error. Emit a U+003C LESS-THAN SIGN character token
+                and a U+003E GREATER-THAN SIGN character token. Switch to
+                the data state.
+
+        U+003F QUESTION MARK (?)
+                Parse error. Switch to the bogus comment state.
+
+        Anything else
+                Parse error. Emit a U+003C LESS-THAN SIGN character token
+                and reconsume the current input character in the data
+                state.
+
+      8.2.4.4 Close tag open state
+
+   If the content model flag is set to the RCDATA or CDATA states but no
+   start tag token has ever been emitted by this instance of the tokeniser
+   (fragment case), or, if the content model flag is set to the RCDATA or
+   CDATA states and the next few characters do not match the tag name of
+   the last start tag token emitted (compared in an ASCII case-insensitive
+   manner), or if they do but they are not immediately followed by one of
+   the following characters:
+     * U+0009 CHARACTER TABULATION
+     * U+000A LINE FEED (LF)
+     * U+000C FORM FEED (FF)
+     * U+0020 SPACE
+     * U+003E GREATER-THAN SIGN (>)
+     * U+002F SOLIDUS (/)
+     * EOF
+
+   ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
+   character token, and switch to the data state to process the next input
+   character.
+
+   Otherwise, if the content model flag is set to the PCDATA state, or if
+   the next few characters do match that tag name, consume the next input
+   character:
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Create a new end tag token, set its tag name to the lowercase
+          version of the input character (add 0x0020 to the character's
+          code point), then switch to the tag name state. (Don't emit the
+          token yet; further details will be filled in before it is
+          emitted.)
+
+   U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
+          Create a new end tag token, set its tag name to the input
+          character, then switch to the tag name state. (Don't emit the
+          token yet; further details will be filled in before it is
+          emitted.)
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Switch to the data state.
+
+   EOF
+          Parse error. Emit a U+003C LESS-THAN SIGN character token and a
+          U+002F SOLIDUS character token. Reconsume the EOF character in
+          the data state.
+
+   Anything else
+          Parse error. Switch to the bogus comment state.
+
+      8.2.4.5 Tag name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the before attribute name state.
+
+   U+002F SOLIDUS (/)
+          Switch to the self-closing start tag state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Append the lowercase version of the current input character (add
+          0x0020 to the character's code point) to the current tag token's
+          tag name. Stay in the tag name state.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Append the current input character to the current tag token's
+          tag name. Stay in the tag name state.
+
+      8.2.4.6 Before attribute name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the before attribute name state.
+
+   U+002F SOLIDUS (/)
+          Switch to the self-closing start tag state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Start a new attribute in the current tag token. Set that
+          attribute's name to the lowercase version of the current input
+          character (add 0x0020 to the character's code point), and its
+          value to the empty string. Switch to the attribute name state.
+
+   U+0022 QUOTATION MARK (")
+   U+0027 APOSTROPHE (')
+   U+003D EQUALS SIGN (=)
+          Parse error. Treat it as per the "anything else" entry below.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Start a new attribute in the current tag token. Set that
+          attribute's name to the current input character, and its value
+          to the empty string. Switch to the attribute name state.
+
+      8.2.4.7 Attribute name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the after attribute name state.
+
+   U+002F SOLIDUS (/)
+          Switch to the self-closing start tag state.
+
+   U+003D EQUALS SIGN (=)
+          Switch to the before attribute value state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Append the lowercase version of the current input character (add
+          0x0020 to the character's code point) to the current attribute's
+          name. Stay in the attribute name state.
+
+   U+0022 QUOTATION MARK (")
+   U+0027 APOSTROPHE (')
+          Parse error. Treat it as per the "anything else" entry below.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Append the current input character to the current attribute's
+          name. Stay in the attribute name state.
+
+   When the user agent leaves the attribute name state (and before
+   emitting the tag token, if appropriate), the complete attribute's name
+   must be compared to the other attributes on the same token; if there is
+   already an attribute on the token with the exact same name, then this
+   is a parse error and the new attribute must be dropped, along with the
+   value that gets associated with it (if any).
+
+      8.2.4.8 After attribute name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the after attribute name state.
+
+   U+002F SOLIDUS (/)
+          Switch to the self-closing start tag state.
+
+   U+003D EQUALS SIGN (=)
+          Switch to the before attribute value state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Start a new attribute in the current tag token. Set that
+          attribute's name to the lowercase version of the current input
+          character (add 0x0020 to the character's code point), and its
+          value to the empty string. Switch to the attribute name state.
+
+   U+0022 QUOTATION MARK (")
+   U+0027 APOSTROPHE (')
+          Parse error. Treat it as per the "anything else" entry below.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Start a new attribute in the current tag token. Set that
+          attribute's name to the current input character, and its value
+          to the empty string. Switch to the attribute name state.
+
+      8.2.4.9 Before attribute value state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the before attribute value state.
+
+   U+0022 QUOTATION MARK (")
+          Switch to the attribute value (double-quoted) state.
+
+   U+0026 AMPERSAND (&)
+          Switch to the attribute value (unquoted) state and reconsume
+          this input character.
+
+   U+0027 APOSTROPHE (')
+          Switch to the attribute value (single-quoted) state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Emit the current tag token. Switch to the data
+          state.
+
+   U+003D EQUALS SIGN (=)
+          Parse error. Treat it as per the "anything else" entry below.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the character
+          in the data state.
+
+   Anything else
+          Append the current input character to the current attribute's
+          value. Switch to the attribute value (unquoted) state.
+
+      8.2.4.10 Attribute value (double-quoted) state
+
+   Consume the next input character:
+
+   U+0022 QUOTATION MARK (")
+          Switch to the after attribute value (quoted) state.
+
+   U+0026 AMPERSAND (&)
+          Switch to the character reference in attribute value state, with
+          the additional allowed character being U+0022 QUOTATION MARK
+          (").
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the character
+          in the data state.
+
+   Anything else
+          Append the current input character to the current attribute's
+          value. Stay in the attribute value (double-quoted) state.
+
+      8.2.4.11 Attribute value (single-quoted) state
+
+   Consume the next input character:
+
+   U+0027 APOSTROPHE (')
+          Switch to the after attribute value (quoted) state.
+
+   U+0026 AMPERSAND (&)
+          Switch to the character reference in attribute value state, with
+          the additional allowed character being U+0027 APOSTROPHE (').
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the character
+          in the data state.
+
+   Anything else
+          Append the current input character to the current attribute's
+          value. Stay in the attribute value (single-quoted) state.
+
+      8.2.4.12 Attribute value (unquoted) state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the before attribute name state.
+
+   U+0026 AMPERSAND (&)
+          Switch to the character reference in attribute value state, with
+          no additional allowed character.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   U+0022 QUOTATION MARK (")
+   U+0027 APOSTROPHE (')
+   U+003D EQUALS SIGN (=)
+          Parse error. Treat it as per the "anything else" entry below.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the character
+          in the data state.
+
+   Anything else
+          Append the current input character to the current attribute's
+          value. Stay in the attribute value (unquoted) state.
+
+      8.2.4.13 Character reference in attribute value state
+
+   Attempt to consume a character reference.
+
+   If nothing is returned, append a U+0026 AMPERSAND character to the
+   current attribute's value.
+
+   Otherwise, append the returned character token to the current
+   attribute's value.
+
+   Finally, switch back to the attribute value state that you were in when
+   were switched into this state.
+
+      8.2.4.14 After attribute value (quoted) state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the before attribute name state.
+
+   U+002F SOLIDUS (/)
+          Switch to the self-closing start tag state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current tag token. Switch to the data state.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Parse error. Reconsume the character in the before attribute
+          name state.
+
+      8.2.4.15 Self-closing start tag state
+
+   Consume the next input character:
+
+   U+003E GREATER-THAN SIGN (>)
+          Set the self-closing flag of the current tag token. Emit the
+          current tag token. Switch to the data state.
+
+   EOF
+          Parse error. Emit the current tag token. Reconsume the EOF
+          character in the data state.
+
+   Anything else
+          Parse error. Reconsume the character in the before attribute
+          name state.
+
+      8.2.4.16 Bogus comment state
+
+   (This can only happen if the content model flag is set to the PCDATA
+   state.)
+
+   Consume every character up to and including the first U+003E
+   GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
+   comes first. Emit a comment token whose data is the concatenation of
+   all the characters starting from and including the character that
+   caused the state machine to switch into the bogus comment state, up to
+   and including the character immediately before the last consumed
+   character (i.e. up to the character just before the U+003E or EOF
+   character). (If the comment was started by the end of the file (EOF),
+   the token is empty.)
+
+   Switch to the data state.
+
+   If the end of the file was reached, reconsume the EOF character.
+
+      8.2.4.17 Markup declaration open state
+
+   (This can only happen if the content model flag is set to the PCDATA
+   state.)
+
+   If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
+   consume those two characters, create a comment token whose data is the
+   empty string, and switch to the comment start state.
+
+   Otherwise, if the next seven characters are an ASCII case-insensitive
+   match for the word "DOCTYPE", then consume those characters and switch
+   to the DOCTYPE state.
+
+   Otherwise, if the insertion mode is "in foreign content" and the
+   current node is not an element in the HTML namespace and the next seven
+   characters are an ASCII case-sensitive match for the string "[CDATA["
+   (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
+   character before and after), then consume those characters and switch
+   to the CDATA section state (which is unrelated to the content model
+   flag's CDATA state).
+
+   Otherwise, this is a parse error. Switch to the bogus comment state.
+   The next character that is consumed, if any, is the first character
+   that will be in the comment.
+
+      8.2.4.18 Comment start state
+
+   Consume the next input character:
+
+   U+002D HYPHEN-MINUS (-)
+          Switch to the comment start dash state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Emit the comment token. Switch to the data state.
+
+   EOF
+          Parse error. Emit the comment token. Reconsume the EOF character
+          in the data state.
+
+   Anything else
+          Append the input character to the comment token's data. Switch
+          to the comment state.
+
+      8.2.4.19 Comment start dash state
+
+   Consume the next input character:
+
+   U+002D HYPHEN-MINUS (-)
+          Switch to the comment end state
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Emit the comment token. Switch to the data state.
+
+   EOF
+          Parse error. Emit the comment token. Reconsume the EOF character
+          in the data state.
+
+   Anything else
+          Append a U+002D HYPHEN-MINUS (-) character and the input
+          character to the comment token's data. Switch to the comment
+          state.
+
+      8.2.4.20 Comment state
+
+   Consume the next input character:
+
+   U+002D HYPHEN-MINUS (-)
+          Switch to the comment end dash state
+
+   EOF
+          Parse error. Emit the comment token. Reconsume the EOF character
+          in the data state.
+
+   Anything else
+          Append the input character to the comment token's data. Stay in
+          the comment state.
+
+      8.2.4.21 Comment end dash state
+
+   Consume the next input character:
+
+   U+002D HYPHEN-MINUS (-)
+          Switch to the comment end state
+
+   EOF
+          Parse error. Emit the comment token. Reconsume the EOF character
+          in the data state.
+
+   Anything else
+          Append a U+002D HYPHEN-MINUS (-) character and the input
+          character to the comment token's data. Switch to the comment
+          state.
+
+      8.2.4.22 Comment end state
+
+   Consume the next input character:
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the comment token. Switch to the data state.
+
+   U+002D HYPHEN-MINUS (-)
+          Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
+          comment token's data. Stay in the comment end state.
+
+   EOF
+          Parse error. Emit the comment token. Reconsume the EOF character
+          in the data state.
+
+   Anything else
+          Parse error. Append two U+002D HYPHEN-MINUS (-) characters and
+          the input character to the comment token's data. Switch to the
+          comment state.
+
+      8.2.4.23 DOCTYPE state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the before DOCTYPE name state.
+
+   Anything else
+          Parse error. Reconsume the current character in the before
+          DOCTYPE name state.
+
+      8.2.4.24 Before DOCTYPE name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the before DOCTYPE name state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Create a new DOCTYPE token. Set its force-quirks
+          flag to on. Emit the token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Create a new DOCTYPE token. Set the token's name to the
+          lowercase version of the input character (add 0x0020 to the
+          character's code point). Switch to the DOCTYPE name state.
+
+   EOF
+          Parse error. Create a new DOCTYPE token. Set its force-quirks
+          flag to on. Emit the token. Reconsume the EOF character in the
+          data state.
+
+   Anything else
+          Create a new DOCTYPE token. Set the token's name to the current
+          input character. Switch to the DOCTYPE name state.
+
+      8.2.4.25 DOCTYPE name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Switch to the after DOCTYPE name state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current DOCTYPE token. Switch to the data state.
+
+   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
+          Append the lowercase version of the input character (add 0x0020
+          to the character's code point) to the current DOCTYPE token's
+          name. Stay in the DOCTYPE name state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Append the current input character to the current DOCTYPE
+          token's name. Stay in the DOCTYPE name state.
+
+      8.2.4.26 After DOCTYPE name state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the after DOCTYPE name state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          If the six characters starting from the current input character
+          are an ASCII case-insensitive match for the word "PUBLIC", then
+          consume those characters and switch to the before DOCTYPE public
+          identifier state.
+
+          Otherwise, if the six characters starting from the current input
+          character are an ASCII case-insensitive match for the word
+          "SYSTEM", then consume those characters and switch to the before
+          DOCTYPE system identifier state.
+
+          Otherwise, this is the parse error. Set the DOCTYPE token's
+          force-quirks flag to on. Switch to the bogus DOCTYPE state.
+
+      8.2.4.27 Before DOCTYPE public identifier state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the before DOCTYPE public identifier state.
+
+   U+0022 QUOTATION MARK (")
+          Set the DOCTYPE token's public identifier to the empty string
+          (not missing), then switch to the DOCTYPE public identifier
+          (double-quoted) state.
+
+   U+0027 APOSTROPHE (')
+          Set the DOCTYPE token's public identifier to the empty string
+          (not missing), then switch to the DOCTYPE public identifier
+          (single-quoted) state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Switch to the bogus DOCTYPE state.
+
+      8.2.4.28 DOCTYPE public identifier (double-quoted) state
+
+   Consume the next input character:
+
+   U+0022 QUOTATION MARK (")
+          Switch to the after DOCTYPE public identifier state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Append the current input character to the current DOCTYPE
+          token's public identifier. Stay in the DOCTYPE public identifier
+          (double-quoted) state.
+
+      8.2.4.29 DOCTYPE public identifier (single-quoted) state
+
+   Consume the next input character:
+
+   U+0027 APOSTROPHE (')
+          Switch to the after DOCTYPE public identifier state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Append the current input character to the current DOCTYPE
+          token's public identifier. Stay in the DOCTYPE public identifier
+          (single-quoted) state.
+
+      8.2.4.30 After DOCTYPE public identifier state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the after DOCTYPE public identifier state.
+
+   U+0022 QUOTATION MARK (")
+          Set the DOCTYPE token's system identifier to the empty string
+          (not missing), then switch to the DOCTYPE system identifier
+          (double-quoted) state.
+
+   U+0027 APOSTROPHE (')
+          Set the DOCTYPE token's system identifier to the empty string
+          (not missing), then switch to the DOCTYPE system identifier
+          (single-quoted) state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Switch to the bogus DOCTYPE state.
+
+      8.2.4.31 Before DOCTYPE system identifier state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the before DOCTYPE system identifier state.
+
+   U+0022 QUOTATION MARK (")
+          Set the DOCTYPE token's system identifier to the empty string
+          (not missing), then switch to the DOCTYPE system identifier
+          (double-quoted) state.
+
+   U+0027 APOSTROPHE (')
+          Set the DOCTYPE token's system identifier to the empty string
+          (not missing), then switch to the DOCTYPE system identifier
+          (single-quoted) state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Switch to the bogus DOCTYPE state.
+
+      8.2.4.32 DOCTYPE system identifier (double-quoted) state
+
+   Consume the next input character:
+
+   U+0022 QUOTATION MARK (")
+          Switch to the after DOCTYPE system identifier state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Append the current input character to the current DOCTYPE
+          token's system identifier. Stay in the DOCTYPE system identifier
+          (double-quoted) state.
+
+      8.2.4.33 DOCTYPE system identifier (single-quoted) state
+
+   Consume the next input character:
+
+   U+0027 APOSTROPHE (')
+          Switch to the after DOCTYPE system identifier state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Append the current input character to the current DOCTYPE
+          token's system identifier. Stay in the DOCTYPE system identifier
+          (single-quoted) state.
+
+      8.2.4.34 After DOCTYPE system identifier state
+
+   Consume the next input character:
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+          Stay in the after DOCTYPE system identifier state.
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the current DOCTYPE token. Switch to the data state.
+
+   EOF
+          Parse error. Set the DOCTYPE token's force-quirks flag to on.
+          Emit that DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Parse error. Switch to the bogus DOCTYPE state. (This does not
+          set the DOCTYPE token's force-quirks flag to on.)
+
+      8.2.4.35 Bogus DOCTYPE state
+
+   Consume the next input character:
+
+   U+003E GREATER-THAN SIGN (>)
+          Emit the DOCTYPE token. Switch to the data state.
+
+   EOF
+          Emit the DOCTYPE token. Reconsume the EOF character in the data
+          state.
+
+   Anything else
+          Stay in the bogus DOCTYPE state.
+
+      8.2.4.36 CDATA section state
+
+   (This can only happen if the content model flag is set to the PCDATA
+   state, and is unrelated to the content model flag's CDATA state.)
+
+   Consume every character up to the next occurrence of the three
+   character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
+   BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
+   whichever comes first. Emit a series of character tokens consisting of
+   all the characters consumed except the matching three character
+   sequence at the end (if one was found before the end of the file).
+
+   Switch to the data state.
+
+   If the end of the file was reached, reconsume the EOF character.
+
+      8.2.4.37 Tokenizing character references
+
+   This section defines how to consume a character reference. This
+   definition is used when parsing character references in text and in
+   attributes.
+
+   The behavior depends on the identity of the next character (the one
+   immediately after the U+0026 AMPERSAND character):
+
+   U+0009 CHARACTER TABULATION
+   U+000A LINE FEED (LF)
+   U+000C FORM FEED (FF)
+   U+0020 SPACE
+   U+003C LESS-THAN SIGN
+   U+0026 AMPERSAND
+   EOF
+   The additional allowed character, if there is one
+          Not a character reference. No characters are consumed, and
+          nothing is returned. (This is not an error, either.)
+
+   U+0023 NUMBER SIGN (#)
+          Consume the U+0023 NUMBER SIGN.
+
+          The behavior further depends on the character after the U+0023
+          NUMBER SIGN:
+
+        U+0078 LATIN SMALL LETTER X
+        U+0058 LATIN CAPITAL LETTER X
+                Consume the X.
+
+                Follow the steps below, but using the range of characters
+                U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061
+                LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER
+                F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046
+                LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
+
+                When it comes to interpreting the number, interpret it as
+                a hexadecimal number.
+
+        Anything else
+                Follow the steps below, but using the range of characters
+                U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just
+                0-9).
+
+                When it comes to interpreting the number, interpret it as
+                a decimal number.
+
+          Consume as many characters as match the range of characters
+          given above.
+
+          If no characters match the range, then don't consume any
+          characters (and unconsume the U+0023 NUMBER SIGN character and,
+          if appropriate, the X character). This is a parse error; nothing
+          is returned.
+
+          Otherwise, if the next character is a U+003B SEMICOLON, consume
+          that too. If it isn't, there is a parse error.
+
+          If one or more characters match the range, then take them all
+          and interpret the string of characters as a number (either
+          hexadecimal or decimal as appropriate).
+
+          If that number is one of the numbers in the first column of the
+          following table, then this is a parse error. Find the row with
+          that number in the first column, and return a character token
+          for the Unicode character given in the second column of that
+          row.
+
+          Number                   Unicode character
+          0x0D   U+000A LINE FEED (LF)
+          0x80   U+20AC EURO SIGN ('€')
+          0x81   U+FFFD REPLACEMENT CHARACTER
+          0x82   U+201A SINGLE LOW-9 QUOTATION MARK ('‚')
+          0x83   U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ')
+          0x84   U+201E DOUBLE LOW-9 QUOTATION MARK ('„')
+          0x85   U+2026 HORIZONTAL ELLIPSIS ('…')
+          0x86   U+2020 DAGGER ('†')
+          0x87   U+2021 DOUBLE DAGGER ('‡')
+          0x88   U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
+          0x89   U+2030 PER MILLE SIGN ('‰')
+          0x8A   U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š')
+          0x8B   U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
+          0x8C   U+0152 LATIN CAPITAL LIGATURE OE ('Œ')
+          0x8D   U+FFFD REPLACEMENT CHARACTER
+          0x8E   U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž')
+          0x8F   U+FFFD REPLACEMENT CHARACTER
+          0x90   U+FFFD REPLACEMENT CHARACTER
+          0x91   U+2018 LEFT SINGLE QUOTATION MARK ('‘')
+          0x92   U+2019 RIGHT SINGLE QUOTATION MARK ('’')
+          0x93   U+201C LEFT DOUBLE QUOTATION MARK ('“')
+          0x94   U+201D RIGHT DOUBLE QUOTATION MARK ('”')
+          0x95   U+2022 BULLET ('•')
+          0x96   U+2013 EN DASH ('–')
+          0x97   U+2014 EM DASH ('—')
+          0x98   U+02DC SMALL TILDE ('˜')
+          0x99   U+2122 TRADE MARK SIGN ('™')
+          0x9A   U+0161 LATIN SMALL LETTER S WITH CARON ('š')
+          0x9B   U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
+          0x9C   U+0153 LATIN SMALL LIGATURE OE ('œ')
+          0x9D   U+FFFD REPLACEMENT CHARACTER
+          0x9E   U+017E LATIN SMALL LETTER Z WITH CARON ('ž')
+          0x9F   U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
+
+          Otherwise, if the number is in the range 0x0000 to 0x0008,
+          0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to
+          0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+          0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+          0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+          0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+          0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+          0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is
+          a parse error; return a character token for the U+FFFD
+          REPLACEMENT CHARACTER character instead.
+
+          Otherwise, return a character token for the Unicode character
+          whose code point is that number.
+
+   Anything else
+          Consume the maximum number of characters possible, with the
+          consumed characters matching one of the identifiers in the first
+          column of the named character references table (in a
+          case-sensitive manner).
+
+          If no match can be made, then this is a parse error. No
+          characters are consumed, and nothing is returned.
+
+          If the last character matched is not a U+003B SEMICOLON (;),
+          there is a parse error.
+
+          If the character reference is being consumed as part of an
+          attribute, and the last character matched is not a U+003B
+          SEMICOLON (;), and the next character is in the range U+0030
+          DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A
+          to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A
+          to U+007A LATIN SMALL LETTER Z, then, for historical reasons,
+          all the characters that were matched after the U+0026 AMPERSAND
+          (&) must be unconsumed, and nothing is returned.
+
+          Otherwise, return a character token for the character
+          corresponding to the character reference name (as given by the
+          second column of the named character references table).
+
+          If the markup contains I'm &notit; I tell you, the character
+          reference is parsed as "not", as in, I'm ¬it; I tell you. But if
+          the markup was I'm &notin; I tell you, the character reference
+          would be parsed as "notin;", resulting in I'm ∉ I tell you.