diff options
author | Matt A. Tobin <email@mattatobin.com> | 2020-01-15 14:56:04 -0500 |
---|---|---|
committer | Matt A. Tobin <email@mattatobin.com> | 2020-01-15 14:56:04 -0500 |
commit | 6168dbe21f5f83b906e562ea0ab232d499b275a6 (patch) | |
tree | 658a4b27554c85ebcaad655fc83f2c2bb99e8e80 /parser/html/java/htmlparser/doc/tokenization.txt | |
parent | 09314667a692fedff8564fc347c8a3663474faa6 (diff) | |
download | UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.gz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.lz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.xz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.zip |
Add java htmlparser sources that match the original 52-level state
https://hg.mozilla.org/projects/htmlparser/
Commit: abe62ab2a9b69ccb3b5d8a231ec1ae11154c571d
Diffstat (limited to 'parser/html/java/htmlparser/doc/tokenization.txt')
-rw-r--r-- | parser/html/java/htmlparser/doc/tokenization.txt | 1147 |
1 files changed, 1147 insertions, 0 deletions
diff --git a/parser/html/java/htmlparser/doc/tokenization.txt b/parser/html/java/htmlparser/doc/tokenization.txt new file mode 100644 index 000000000..21cd7f6e2 --- /dev/null +++ b/parser/html/java/htmlparser/doc/tokenization.txt @@ -0,0 +1,1147 @@ + #8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction + + WHATWG + +HTML 5 + +Draft Recommendation — 7 February 2009 + + ← 8.2 Parsing HTML documents – Table of contents – 8.2.5 Tree + construction → + + 8.2.4 Tokenization + + Implementations must act as if they used the following state machine to + tokenise HTML. The state machine must start in the data state. Most + states consume a single character, which may have various side-effects, + and either switches the state machine to a new state to reconsume the + same character, or switches it to a new state (to consume the next + character), or repeats the same state (to consume the next character). + Some states have more complicated behavior and can consume several + characters before switching to another state. + + The exact behavior of certain states depends on a content model flag + that is set after certain tokens are emitted. The flag has several + states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in + the PCDATA state. In the RCDATA and CDATA states, a further escape flag + is used to control the behavior of the tokeniser. It is either true or + false, and initially must be set to the false state. The insertion mode + and the stack of open elements also affects tokenization. + + The output of the tokenization step is a series of zero or more of the + following tokens: DOCTYPE, start tag, end tag, comment, character, + end-of-file. DOCTYPE tokens have a name, a public identifier, a system + identifier, and a force-quirks flag. When a DOCTYPE token is created, + its name, public identifier, and system identifier must be marked as + missing (which is a distinct state from the empty string), and the + force-quirks flag must be set to off (its other state is on). Start and + end tag tokens have a tag name, a self-closing flag, and a list of + attributes, each of which has a name and a value. When a start or end + tag token is created, its self-closing flag must be unset (its other + state is that it be set), and its attributes list must be empty. + Comment and character tokens have data. + + When a token is emitted, it must immediately be handled by the tree + construction stage. The tree construction stage can affect the state of + the content model flag, and can insert additional characters into the + stream. (For example, the script element can result in scripts + executing and using the dynamic markup insertion APIs to insert + characters into the stream being tokenised.) + + When a start tag token is emitted with its self-closing flag set, if + the flag is not acknowledged when it is processed by the tree + construction stage, that is a parse error. + + When an end tag token is emitted, the content model flag must be + switched to the PCDATA state. + + When an end tag token is emitted with attributes, that is a parse + error. + + When an end tag token is emitted with its self-closing flag set, that + is a parse error. + + Before each step of the tokeniser, the user agent must first check the + parser pause flag. If it is true, then the tokeniser must abort the + processing of any nested invocations of the tokeniser, yielding control + back to the caller. If it is false, then the user agent may then check + to see if either one of the scripts in the list of scripts that will + execute as soon as possible or the first script in the list of scripts + that will execute asynchronously, has completed loading. If one has, + then it must be executed and removed from its list. + + The tokeniser state machine consists of the states defined in the + following subsections. + + 8.2.4.1 Data state + + Consume the next input character: + + U+0026 AMPERSAND (&) + When the content model flag is set to one of the PCDATA or + RCDATA states and the escape flag is false: switch to the + character reference data state. + Otherwise: treat it as per the "anything else" entry below. + + U+002D HYPHEN-MINUS (-) + If the content model flag is set to either the RCDATA state or + the CDATA state, and the escape flag is false, and there are at + least three characters before this one in the input stream, and + the last four characters in the input stream, including this + one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D + HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the + escape flag to true. + + In any case, emit the input character as a character token. Stay + in the data state. + + U+003C LESS-THAN SIGN (<) + When the content model flag is set to the PCDATA state: switch + to the tag open state. + When the content model flag is set to either the RCDATA state or + the CDATA state, and the escape flag is false: switch to the tag + open state. + Otherwise: treat it as per the "anything else" entry below. + + U+003E GREATER-THAN SIGN (>) + If the content model flag is set to either the RCDATA state or + the CDATA state, and the escape flag is true, and the last three + characters in the input stream including this one are U+002D + HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN + ("-->"), set the escape flag to false. + + In any case, emit the input character as a character token. Stay + in the data state. + + EOF + Emit an end-of-file token. + + Anything else + Emit the input character as a character token. Stay in the data + state. + + 8.2.4.2 Character reference data state + + (This cannot happen if the content model flag is set to the CDATA + state.) + + Attempt to consume a character reference, with no additional allowed + character. + + If nothing is returned, emit a U+0026 AMPERSAND character token. + + Otherwise, emit the character token that was returned. + + Finally, switch to the data state. + + 8.2.4.3 Tag open state + + The behavior of this state depends on the content model flag. + + If the content model flag is set to the RCDATA or CDATA states + Consume the next input character. If it is a U+002F SOLIDUS (/) + character, switch to the close tag open state. Otherwise, emit a + U+003C LESS-THAN SIGN character token and reconsume the current + input character in the data state. + + If the content model flag is set to the PCDATA state + Consume the next input character: + + U+0021 EXCLAMATION MARK (!) + Switch to the markup declaration open state. + + U+002F SOLIDUS (/) + Switch to the close tag open state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL + LETTER Z + Create a new start tag token, set its tag name to the + lowercase version of the input character (add 0x0020 to + the character's code point), then switch to the tag name + state. (Don't emit the token yet; further details will be + filled in before it is emitted.) + + U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z + Create a new start tag token, set its tag name to the + input character, then switch to the tag name state. (Don't + emit the token yet; further details will be filled in + before it is emitted.) + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit a U+003C LESS-THAN SIGN character token + and a U+003E GREATER-THAN SIGN character token. Switch to + the data state. + + U+003F QUESTION MARK (?) + Parse error. Switch to the bogus comment state. + + Anything else + Parse error. Emit a U+003C LESS-THAN SIGN character token + and reconsume the current input character in the data + state. + + 8.2.4.4 Close tag open state + + If the content model flag is set to the RCDATA or CDATA states but no + start tag token has ever been emitted by this instance of the tokeniser + (fragment case), or, if the content model flag is set to the RCDATA or + CDATA states and the next few characters do not match the tag name of + the last start tag token emitted (compared in an ASCII case-insensitive + manner), or if they do but they are not immediately followed by one of + the following characters: + * U+0009 CHARACTER TABULATION + * U+000A LINE FEED (LF) + * U+000C FORM FEED (FF) + * U+0020 SPACE + * U+003E GREATER-THAN SIGN (>) + * U+002F SOLIDUS (/) + * EOF + + ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS + character token, and switch to the data state to process the next input + character. + + Otherwise, if the content model flag is set to the PCDATA state, or if + the next few characters do match that tag name, consume the next input + character: + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Create a new end tag token, set its tag name to the lowercase + version of the input character (add 0x0020 to the character's + code point), then switch to the tag name state. (Don't emit the + token yet; further details will be filled in before it is + emitted.) + + U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z + Create a new end tag token, set its tag name to the input + character, then switch to the tag name state. (Don't emit the + token yet; further details will be filled in before it is + emitted.) + + U+003E GREATER-THAN SIGN (>) + Parse error. Switch to the data state. + + EOF + Parse error. Emit a U+003C LESS-THAN SIGN character token and a + U+002F SOLIDUS character token. Reconsume the EOF character in + the data state. + + Anything else + Parse error. Switch to the bogus comment state. + + 8.2.4.5 Tag name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the current input character (add + 0x0020 to the character's code point) to the current tag token's + tag name. Stay in the tag name state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Append the current input character to the current tag token's + tag name. Stay in the tag name state. + + 8.2.4.6 Before attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Start a new attribute in the current tag token. Set that + attribute's name to the lowercase version of the current input + character (add 0x0020 to the character's code point), and its + value to the empty string. Switch to the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Start a new attribute in the current tag token. Set that + attribute's name to the current input character, and its value + to the empty string. Switch to the attribute name state. + + 8.2.4.7 Attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the after attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003D EQUALS SIGN (=) + Switch to the before attribute value state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the current input character (add + 0x0020 to the character's code point) to the current attribute's + name. Stay in the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Append the current input character to the current attribute's + name. Stay in the attribute name state. + + When the user agent leaves the attribute name state (and before + emitting the tag token, if appropriate), the complete attribute's name + must be compared to the other attributes on the same token; if there is + already an attribute on the token with the exact same name, then this + is a parse error and the new attribute must be dropped, along with the + value that gets associated with it (if any). + + 8.2.4.8 After attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003D EQUALS SIGN (=) + Switch to the before attribute value state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Start a new attribute in the current tag token. Set that + attribute's name to the lowercase version of the current input + character (add 0x0020 to the character's code point), and its + value to the empty string. Switch to the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Start a new attribute in the current tag token. Set that + attribute's name to the current input character, and its value + to the empty string. Switch to the attribute name state. + + 8.2.4.9 Before attribute value state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before attribute value state. + + U+0022 QUOTATION MARK (") + Switch to the attribute value (double-quoted) state. + + U+0026 AMPERSAND (&) + Switch to the attribute value (unquoted) state and reconsume + this input character. + + U+0027 APOSTROPHE (') + Switch to the attribute value (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the current tag token. Switch to the data + state. + + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Switch to the attribute value (unquoted) state. + + 8.2.4.10 Attribute value (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after attribute value (quoted) state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + the additional allowed character being U+0022 QUOTATION MARK + ("). + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (double-quoted) state. + + 8.2.4.11 Attribute value (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after attribute value (quoted) state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + the additional allowed character being U+0027 APOSTROPHE ('). + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (single-quoted) state. + + 8.2.4.12 Attribute value (unquoted) state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + no additional allowed character. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (unquoted) state. + + 8.2.4.13 Character reference in attribute value state + + Attempt to consume a character reference. + + If nothing is returned, append a U+0026 AMPERSAND character to the + current attribute's value. + + Otherwise, append the returned character token to the current + attribute's value. + + Finally, switch back to the attribute value state that you were in when + were switched into this state. + + 8.2.4.14 After attribute value (quoted) state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Parse error. Reconsume the character in the before attribute + name state. + + 8.2.4.15 Self-closing start tag state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Set the self-closing flag of the current tag token. Emit the + current tag token. Switch to the data state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Parse error. Reconsume the character in the before attribute + name state. + + 8.2.4.16 Bogus comment state + + (This can only happen if the content model flag is set to the PCDATA + state.) + + Consume every character up to and including the first U+003E + GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever + comes first. Emit a comment token whose data is the concatenation of + all the characters starting from and including the character that + caused the state machine to switch into the bogus comment state, up to + and including the character immediately before the last consumed + character (i.e. up to the character just before the U+003E or EOF + character). (If the comment was started by the end of the file (EOF), + the token is empty.) + + Switch to the data state. + + If the end of the file was reached, reconsume the EOF character. + + 8.2.4.17 Markup declaration open state + + (This can only happen if the content model flag is set to the PCDATA + state.) + + If the next two characters are both U+002D HYPHEN-MINUS (-) characters, + consume those two characters, create a comment token whose data is the + empty string, and switch to the comment start state. + + Otherwise, if the next seven characters are an ASCII case-insensitive + match for the word "DOCTYPE", then consume those characters and switch + to the DOCTYPE state. + + Otherwise, if the insertion mode is "in foreign content" and the + current node is not an element in the HTML namespace and the next seven + characters are an ASCII case-sensitive match for the string "[CDATA[" + (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET + character before and after), then consume those characters and switch + to the CDATA section state (which is unrelated to the content model + flag's CDATA state). + + Otherwise, this is a parse error. Switch to the bogus comment state. + The next character that is consumed, if any, is the first character + that will be in the comment. + + 8.2.4.18 Comment start state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment start dash state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the comment token. Switch to the data state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append the input character to the comment token's data. Switch + to the comment state. + + 8.2.4.19 Comment start dash state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end state + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the comment token. Switch to the data state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append a U+002D HYPHEN-MINUS (-) character and the input + character to the comment token's data. Switch to the comment + state. + + 8.2.4.20 Comment state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end dash state + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append the input character to the comment token's data. Stay in + the comment state. + + 8.2.4.21 Comment end dash state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end state + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append a U+002D HYPHEN-MINUS (-) character and the input + character to the comment token's data. Switch to the comment + state. + + 8.2.4.22 Comment end state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Emit the comment token. Switch to the data state. + + U+002D HYPHEN-MINUS (-) + Parse error. Append a U+002D HYPHEN-MINUS (-) character to the + comment token's data. Stay in the comment end state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Parse error. Append two U+002D HYPHEN-MINUS (-) characters and + the input character to the comment token's data. Switch to the + comment state. + + 8.2.4.23 DOCTYPE state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before DOCTYPE name state. + + Anything else + Parse error. Reconsume the current character in the before + DOCTYPE name state. + + 8.2.4.24 Before DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Create a new DOCTYPE token. Set its force-quirks + flag to on. Emit the token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Create a new DOCTYPE token. Set the token's name to the + lowercase version of the input character (add 0x0020 to the + character's code point). Switch to the DOCTYPE name state. + + EOF + Parse error. Create a new DOCTYPE token. Set its force-quirks + flag to on. Emit the token. Reconsume the EOF character in the + data state. + + Anything else + Create a new DOCTYPE token. Set the token's name to the current + input character. Switch to the DOCTYPE name state. + + 8.2.4.25 DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the after DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the input character (add 0x0020 + to the character's code point) to the current DOCTYPE token's + name. Stay in the DOCTYPE name state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's name. Stay in the DOCTYPE name state. + + 8.2.4.26 After DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + If the six characters starting from the current input character + are an ASCII case-insensitive match for the word "PUBLIC", then + consume those characters and switch to the before DOCTYPE public + identifier state. + + Otherwise, if the six characters starting from the current input + character are an ASCII case-insensitive match for the word + "SYSTEM", then consume those characters and switch to the before + DOCTYPE system identifier state. + + Otherwise, this is the parse error. Set the DOCTYPE token's + force-quirks flag to on. Switch to the bogus DOCTYPE state. + + 8.2.4.27 Before DOCTYPE public identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE public identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's public identifier to the empty string + (not missing), then switch to the DOCTYPE public identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's public identifier to the empty string + (not missing), then switch to the DOCTYPE public identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.28 DOCTYPE public identifier (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after DOCTYPE public identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's public identifier. Stay in the DOCTYPE public identifier + (double-quoted) state. + + 8.2.4.29 DOCTYPE public identifier (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after DOCTYPE public identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's public identifier. Stay in the DOCTYPE public identifier + (single-quoted) state. + + 8.2.4.30 After DOCTYPE public identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE public identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.31 Before DOCTYPE system identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE system identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.32 DOCTYPE system identifier (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's system identifier. Stay in the DOCTYPE system identifier + (double-quoted) state. + + 8.2.4.33 DOCTYPE system identifier (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's system identifier. Stay in the DOCTYPE system identifier + (single-quoted) state. + + 8.2.4.34 After DOCTYPE system identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Switch to the bogus DOCTYPE state. (This does not + set the DOCTYPE token's force-quirks flag to on.) + + 8.2.4.35 Bogus DOCTYPE state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Emit the DOCTYPE token. Switch to the data state. + + EOF + Emit the DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Stay in the bogus DOCTYPE state. + + 8.2.4.36 CDATA section state + + (This can only happen if the content model flag is set to the PCDATA + state, and is unrelated to the content model flag's CDATA state.) + + Consume every character up to the next occurrence of the three + character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE + BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF), + whichever comes first. Emit a series of character tokens consisting of + all the characters consumed except the matching three character + sequence at the end (if one was found before the end of the file). + + Switch to the data state. + + If the end of the file was reached, reconsume the EOF character. + + 8.2.4.37 Tokenizing character references + + This section defines how to consume a character reference. This + definition is used when parsing character references in text and in + attributes. + + The behavior depends on the identity of the next character (the one + immediately after the U+0026 AMPERSAND character): + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + U+003C LESS-THAN SIGN + U+0026 AMPERSAND + EOF + The additional allowed character, if there is one + Not a character reference. No characters are consumed, and + nothing is returned. (This is not an error, either.) + + U+0023 NUMBER SIGN (#) + Consume the U+0023 NUMBER SIGN. + + The behavior further depends on the character after the U+0023 + NUMBER SIGN: + + U+0078 LATIN SMALL LETTER X + U+0058 LATIN CAPITAL LETTER X + Consume the X. + + Follow the steps below, but using the range of characters + U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061 + LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER + F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046 + LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f). + + When it comes to interpreting the number, interpret it as + a hexadecimal number. + + Anything else + Follow the steps below, but using the range of characters + U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just + 0-9). + + When it comes to interpreting the number, interpret it as + a decimal number. + + Consume as many characters as match the range of characters + given above. + + If no characters match the range, then don't consume any + characters (and unconsume the U+0023 NUMBER SIGN character and, + if appropriate, the X character). This is a parse error; nothing + is returned. + + Otherwise, if the next character is a U+003B SEMICOLON, consume + that too. If it isn't, there is a parse error. + + If one or more characters match the range, then take them all + and interpret the string of characters as a number (either + hexadecimal or decimal as appropriate). + + If that number is one of the numbers in the first column of the + following table, then this is a parse error. Find the row with + that number in the first column, and return a character token + for the Unicode character given in the second column of that + row. + + Number Unicode character + 0x0D U+000A LINE FEED (LF) + 0x80 U+20AC EURO SIGN ('€') + 0x81 U+FFFD REPLACEMENT CHARACTER + 0x82 U+201A SINGLE LOW-9 QUOTATION MARK ('‚') + 0x83 U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ') + 0x84 U+201E DOUBLE LOW-9 QUOTATION MARK ('„') + 0x85 U+2026 HORIZONTAL ELLIPSIS ('…') + 0x86 U+2020 DAGGER ('†') + 0x87 U+2021 DOUBLE DAGGER ('‡') + 0x88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ') + 0x89 U+2030 PER MILLE SIGN ('‰') + 0x8A U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š') + 0x8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹') + 0x8C U+0152 LATIN CAPITAL LIGATURE OE ('Œ') + 0x8D U+FFFD REPLACEMENT CHARACTER + 0x8E U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž') + 0x8F U+FFFD REPLACEMENT CHARACTER + 0x90 U+FFFD REPLACEMENT CHARACTER + 0x91 U+2018 LEFT SINGLE QUOTATION MARK ('‘') + 0x92 U+2019 RIGHT SINGLE QUOTATION MARK ('’') + 0x93 U+201C LEFT DOUBLE QUOTATION MARK ('“') + 0x94 U+201D RIGHT DOUBLE QUOTATION MARK ('”') + 0x95 U+2022 BULLET ('•') + 0x96 U+2013 EN DASH ('–') + 0x97 U+2014 EM DASH ('—') + 0x98 U+02DC SMALL TILDE ('˜') + 0x99 U+2122 TRADE MARK SIGN ('™') + 0x9A U+0161 LATIN SMALL LETTER S WITH CARON ('š') + 0x9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›') + 0x9C U+0153 LATIN SMALL LIGATURE OE ('œ') + 0x9D U+FFFD REPLACEMENT CHARACTER + 0x9E U+017E LATIN SMALL LETTER Z WITH CARON ('ž') + 0x9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ') + + Otherwise, if the number is in the range 0x0000 to 0x0008, + 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to + 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, + 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, + 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, + 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, + 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, + 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is + a parse error; return a character token for the U+FFFD + REPLACEMENT CHARACTER character instead. + + Otherwise, return a character token for the Unicode character + whose code point is that number. + + Anything else + Consume the maximum number of characters possible, with the + consumed characters matching one of the identifiers in the first + column of the named character references table (in a + case-sensitive manner). + + If no match can be made, then this is a parse error. No + characters are consumed, and nothing is returned. + + If the last character matched is not a U+003B SEMICOLON (;), + there is a parse error. + + If the character reference is being consumed as part of an + attribute, and the last character matched is not a U+003B + SEMICOLON (;), and the next character is in the range U+0030 + DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A + to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A + to U+007A LATIN SMALL LETTER Z, then, for historical reasons, + all the characters that were matched after the U+0026 AMPERSAND + (&) must be unconsumed, and nothing is returned. + + Otherwise, return a character token for the character + corresponding to the character reference name (as given by the + second column of the named character references table). + + If the markup contains I'm ¬it; I tell you, the character + reference is parsed as "not", as in, I'm ¬it; I tell you. But if + the markup was I'm ∉ I tell you, the character reference + would be parsed as "notin;", resulting in I'm ∉ I tell you. |