diff options
author | Matt A. Tobin <email@mattatobin.com> | 2020-01-15 14:56:04 -0500 |
---|---|---|
committer | Matt A. Tobin <email@mattatobin.com> | 2020-01-15 14:56:04 -0500 |
commit | 6168dbe21f5f83b906e562ea0ab232d499b275a6 (patch) | |
tree | 658a4b27554c85ebcaad655fc83f2c2bb99e8e80 /parser/html/java/htmlparser/doc/tree-construction.txt | |
parent | 09314667a692fedff8564fc347c8a3663474faa6 (diff) | |
download | UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.gz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.lz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.tar.xz UXP-6168dbe21f5f83b906e562ea0ab232d499b275a6.zip |
Add java htmlparser sources that match the original 52-level state
https://hg.mozilla.org/projects/htmlparser/
Commit: abe62ab2a9b69ccb3b5d8a231ec1ae11154c571d
Diffstat (limited to 'parser/html/java/htmlparser/doc/tree-construction.txt')
-rw-r--r-- | parser/html/java/htmlparser/doc/tree-construction.txt | 2201 |
1 files changed, 2201 insertions, 0 deletions
diff --git a/parser/html/java/htmlparser/doc/tree-construction.txt b/parser/html/java/htmlparser/doc/tree-construction.txt new file mode 100644 index 000000000..0febf147a --- /dev/null +++ b/parser/html/java/htmlparser/doc/tree-construction.txt @@ -0,0 +1,2201 @@ + #8.2.4 Tokenization Table of contents 8.4 Serializing HTML fragments + + WHATWG + +HTML 5 + +Draft Recommendation — 13 January 2009 + + ← 8.2.4 Tokenization – Table of contents – 8.4 Serializing HTML + fragments → + + 8.2.5 Tree construction + + The input to the tree construction stage is a sequence of tokens from + the tokenization stage. The tree construction stage is associated with + a DOM Document object when a parser is created. The "output" of this + stage consists of dynamically modifying or extending that document's + DOM tree. + + This specification does not define when an interactive user agent has + to render the Document so that it is available to the user, or when it + has to begin accepting user input. + + As each token is emitted from the tokeniser, the user agent must + process the token according to the rules given in the section + corresponding to the current insertion mode. + + When the steps below require the UA to insert a character into a node, + if that node has a child immediately before where the character is to + be inserted, and that child is a Text node, and that Text node was the + last node that the parser inserted into the document, then the + character must be appended to that Text node; otherwise, a new Text + node whose data is just that character must be inserted in the + appropriate place. + + DOM mutation events must not fire for changes caused by the UA parsing + the document. (Conceptually, the parser is not mutating the DOM, it is + constructing it.) This includes the parsing of any content inserted + using document.write() and document.writeln() calls. [DOM3EVENTS] + + Not all of the tag names mentioned below are conformant tag names in + this specification; many are included to handle legacy content. They + still form part of the algorithm that implementations are required to + implement to claim conformance. + + The algorithm described below places no limit on the depth of the DOM + tree generated, or on the length of tag names, attribute names, + attribute values, text nodes, etc. While implementors are encouraged to + avoid arbitrary limits, it is recognized that practical concerns will + likely force user agents to impose nesting depths. + + 8.2.5.1 Creating and inserting elements + + When the steps below require the UA to create an element for a token in + a particular namespace, the UA must create a node implementing the + interface appropriate for the element type corresponding to the tag + name of the token in the given namespace (as given in the specification + that defines that element, e.g. for an a element in the HTML namespace, + this specification defines it to be the HTMLAnchorElement interface), + with the tag name being the name of that element, with the node being + in the given namespace, and with the attributes on the node being those + given in the given token. + + The interface appropriate for an element in the HTML namespace that is + not defined in this specification is HTMLElement. The interface + appropriate for an element in another namespace that is not defined by + that namespace's specification is Element. + + When a resettable element is created in this manner, its reset + algorithm must be invoked once the attributes are set. (This + initializes the element's value and checkedness based on the element's + attributes.) + __________________________________________________________________ + + When the steps below require the UA to insert an HTML element for a + token, the UA must first create an element for the token in the HTML + namespace, and then append this node to the current node, and push it + onto the stack of open elements so that it is the new current node. + + The steps below may also require that the UA insert an HTML element in + a particular place, in which case the UA must follow the same steps + except that it must insert or append the new node in the location + specified instead of appending it to the current node. (This happens in + particular during the parsing of tables with invalid content.) + + If an element created by the insert an HTML element algorithm is a + form-associated element, and the form element pointer is not null, and + the newly created element doesn't have a form attribute, the user agent + must associate the newly created element with the form element pointed + to by the form element pointer before inserting it wherever it is to be + inserted. + __________________________________________________________________ + + When the steps below require the UA to insert a foreign element for a + token, the UA must first create an element for the token in the given + namespace, and then append this node to the current node, and push it + onto the stack of open elements so that it is the new current node. If + the newly created element has an xmlns attribute in the XMLNS namespace + whose value is not exactly the same as the element's namespace, that is + a parse error. + + When the steps below require the user agent to adjust MathML attributes + for a token, then, if the token has an attribute named definitionurl, + change its name to definitionURL (note the case difference). + + When the steps below require the user agent to adjust foreign + attributes for a token, then, if any of the attributes on the token + match the strings given in the first column of the following table, let + the attribute be a namespaced attribute, with the prefix being the + string given in the corresponding cell in the second column, the local + name being the string given in the corresponding cell in the third + column, and the namespace being the namespace given in the + corresponding cell in the fourth column. (This fixes the use of + namespaced attributes, in particular xml:lang.) + + Attribute name Prefix Local name Namespace + xlink:actuate xlink actuate XLink namespace + xlink:arcrole xlink arcrole XLink namespace + xlink:href xlink href XLink namespace + xlink:role xlink role XLink namespace + xlink:show xlink show XLink namespace + xlink:title xlink title XLink namespace + xlink:type xlink type XLink namespace + xml:base xml base XML namespace + xml:lang xml lang XML namespace + xml:space xml space XML namespace + xmlns (none) xmlns XMLNS namespace + xmlns:xlink xmlns xlink XMLNS namespace + __________________________________________________________________ + + The generic CDATA element parsing algorithm and the generic RCDATA + element parsing algorithm consist of the following steps. These + algorithms are always invoked in response to a start tag token. + 1. Insert an HTML element for the token. + 2. If the algorithm that was invoked is the generic CDATA element + parsing algorithm, switch the tokeniser's content model flag to the + CDATA state; otherwise the algorithm invoked was the generic RCDATA + element parsing algorithm, switch the tokeniser's content model + flag to the RCDATA state. + 3. Let the original insertion mode be the current insertion mode. + 4. Then, switch the insertion mode to "in CDATA/RCDATA". + + 8.2.5.2 Closing elements that have implied end tags + + When the steps below require the UA to generate implied end tags, then, + while the current node is a dd element, a dt element, an li element, an + option element, an optgroup element, a p element, an rp element, or an + rt element, the UA must pop the current node off the stack of open + elements. + + If a step requires the UA to generate implied end tags but lists an + element to exclude from the process, then the UA must perform the above + steps as if that element was not in the above list. + + 8.2.5.3 Foster parenting + + Foster parenting happens when content is misnested in tables. + + When a node node is to be foster parented, the node node must be + inserted into the foster parent element, and the current table must be + marked as tainted. (Once the current table has been tainted, whitespace + characters are inserted into the foster parent element instead of the + current node.) + + The foster parent element is the parent element of the last table + element in the stack of open elements, if there is a table element and + it has such a parent element. If there is no table element in the stack + of open elements (fragment case), then the foster parent element is the + first element in the stack of open elements (the html element). + Otherwise, if there is a table element in the stack of open elements, + but the last table element in the stack of open elements has no parent, + or its parent node is not an element, then the foster parent element is + the element before the last table element in the stack of open + elements. + + If the foster parent element is the parent element of the last table + element in the stack of open elements, then node must be inserted + immediately before the last table element in the stack of open elements + in the foster parent element; otherwise, node must be appended to the + foster parent element. + + 8.2.5.4 The "initial" insertion mode + + When the insertion mode is "initial", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + If the DOCTYPE token's name is not a case-sensitive match for + the string "html", or if the token's public identifier is + neither missing nor a case-sensitive match for the string + "XSLT-compat", or if the token's system identifier is not + missing, then there is a parse error (this is the DOCTYPE parse + error). Conformance checkers may, instead of reporting this + error, switch to a conformance checking mode for another + language (e.g. based on the DOCTYPE token a conformance checker + could recognize that the document is an HTML4-era document, and + defer to an HTML4 conformance checker.) + + Append a DocumentType node to the Document node, with the name + attribute set to the name given in the DOCTYPE token; the + publicId attribute set to the public identifier given in the + DOCTYPE token, or the empty string if the public identifier was + missing; the systemId attribute set to the system identifier + given in the DOCTYPE token, or the empty string if the system + identifier was missing; and the other attributes specific to + DocumentType objects set to null and empty lists as appropriate. + Associate the DocumentType node with the Document object so that + it is returned as the value of the doctype attribute of the + Document object. + + Then, if the DOCTYPE token matches one of the conditions in the + following list, then set the document to quirks mode: + + + The force-quirks flag is set to on. + + The name is set to anything other than "HTML". + + The public identifier starts with: "+//Silmaril//dtd html Pro + v0r11 19970101//" + + The public identifier starts with: "-//AdvaSoft Ltd//DTD HTML + 3.0 asWedit + extensions//" + + The public identifier starts with: "-//AS//DTD HTML 3.0 + asWedit + extensions//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0//" + + The public identifier starts with: "-//IETF//DTD HTML 2.1E//" + + The public identifier starts with: "-//IETF//DTD HTML 3.0//" + + The public identifier starts with: "-//IETF//DTD HTML 3.2 + Final//" + + The public identifier starts with: "-//IETF//DTD HTML 3.2//" + + The public identifier starts with: "-//IETF//DTD HTML 3//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 0//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 1//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 2//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 3//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 0//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 3//" + + The public identifier starts with: "-//IETF//DTD HTML + Strict//" + + The public identifier starts with: "-//IETF//DTD HTML//" + + The public identifier starts with: "-//Metrius//DTD Metrius + Presentational//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 HTML Strict//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 HTML//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 Tables//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 HTML Strict//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 HTML//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 Tables//" + + The public identifier starts with: "-//Netscape Comm. + Corp.//DTD HTML//" + + The public identifier starts with: "-//Netscape Comm. + Corp.//DTD Strict HTML//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML 2.0//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML Extended 1.0//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML Extended Relaxed 1.0//" + + The public identifier starts with: "-//SoftQuad Software//DTD + HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//" + + The public identifier starts with: "-//SoftQuad//DTD HoTMetaL + PRO 4.0::19971010::extensions to HTML 4.0//" + + The public identifier starts with: "-//Spyglass//DTD HTML 2.0 + Extended//" + + The public identifier starts with: "-//SQ//DTD HTML 2.0 + HoTMetaL + extensions//" + + The public identifier starts with: "-//Sun Microsystems + Corp.//DTD HotJava HTML//" + + The public identifier starts with: "-//Sun Microsystems + Corp.//DTD HotJava Strict HTML//" + + The public identifier starts with: "-//W3C//DTD HTML 3 + 1995-03-24//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2 + Draft//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2 + Final//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2S + Draft//" + + The public identifier starts with: "-//W3C//DTD HTML 4.0 + Frameset//" + + The public identifier starts with: "-//W3C//DTD HTML 4.0 + Transitional//" + + The public identifier starts with: "-//W3C//DTD HTML + Experimental 19960712//" + + The public identifier starts with: "-//W3C//DTD HTML + Experimental 970421//" + + The public identifier starts with: "-//W3C//DTD W3 HTML//" + + The public identifier starts with: "-//W3O//DTD W3 HTML 3.0//" + + The public identifier is set to: "-//W3O//DTD W3 HTML Strict + 3.0//EN//" + + The public identifier starts with: "-//WebTechs//DTD Mozilla + HTML 2.0//" + + The public identifier starts with: "-//WebTechs//DTD Mozilla + HTML//" + + The public identifier is set to: "-/W3C/DTD HTML 4.0 + Transitional/EN" + + The public identifier is set to: "HTML" + + The system identifier is set to: + "http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd" + + The system identifier is missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Frameset//" + + The system identifier is missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Transitional//" + + Otherwise, if the DOCTYPE token matches one of the conditions in + the following list, then set the document to limited quirks + mode: + + + The public identifier starts with: "-//W3C//DTD XHTML 1.0 + Frameset//" + + The public identifier starts with: "-//W3C//DTD XHTML 1.0 + Transitional//" + + The system identifier is not missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Frameset//" + + The system identifier is not missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Transitional//" + + The name, system identifier, and public identifier strings must + be compared to the values given in the lists above in an ASCII + case-insensitive manner. A system identifier whose value is the + empty string is not considered missing for the purposes of the + conditions above. + + Then, switch the insertion mode to "before html". + + Anything else + Parse error. + + Set the document to quirks mode. + + Switch the insertion mode to "before html", then reprocess the + current token. + + 8.2.5.5 The "before html" insertion mode + + When the insertion mode is "before html", tokens must be handled as + follows: + + A DOCTYPE token + Parse error. Ignore the token. + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A start tag whose tag name is "html" + Create an element for the token in the HTML namespace. Append it + to the Document object. Put this element in the stack of open + elements. + + If the token has an attribute "manifest", then resolve the value + of that attribute to an absolute URL, and if that is successful, + run the application cache selection algorithm with the resulting + absolute URL. Otherwise, if there is no such attribute or + resolving it fails, run the application cache selection + algorithm with no manifest. The algorithm must be passed the + Document object. + + Switch the insertion mode to "before head". + + Anything else + Create an HTMLElement node with the tag name html, in the HTML + namespace. Append it to the Document object. Put this element in + the stack of open elements. + + Run the application cache selection algorithm with no manifest, + passing it the Document object. + + Switch the insertion mode to "before head", then reprocess the + current token. + + Should probably make end tags be ignored, so that "</head><!-- + --><html>" puts the comment before the root node (or should we?) + + The root element can end up being removed from the Document object, + e.g. by scripts; nothing in particular happens in such cases, content + continues being appended to the nodes as described in the next section. + + 8.2.5.6 The "before head" insertion mode + + When the insertion mode is "before head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "head" + Insert an HTML element for the token. + + Set the head element pointer to the newly created head element. + + Switch the insertion mode to "in head". + + An end tag whose tag name is one of: "head", "br" + Act as if a start tag token with the tag name "head" and no + attributes had been seen, then reprocess the current token. + + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if a start tag token with the tag name "head" and no + attributes had been seen, then reprocess the current token. + + This will result in an empty head element being generated, with + the current token being reprocessed in the "after head" + insertion mode. + + 8.2.5.7 The "in head" insertion mode + + When the insertion mode is "in head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is one of: "base", "command", "eventsource", + "link" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "meta" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + If the element has a charset attribute, and its value is a + supported encoding, and the confidence is currently tentative, + then change the encoding to the encoding given by the value of + the charset attribute. + + Otherwise, if the element has a content attribute, and applying + the algorithm for extracting an encoding from a Content-Type to + its value returns a supported encoding encoding, and the + confidence is currently tentative, then change the encoding to + the encoding encoding. + + A start tag whose tag name is "title" + Follow the generic RCDATA element parsing algorithm. + + A start tag whose tag name is "noscript", if the scripting flag is + enabled + + A start tag whose tag name is one of: "noframes", "style" + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "noscript", if the scripting flag is + disabled + Insert an HTML element for the token. + + Switch the insertion mode to "in head noscript". + + A start tag whose tag name is "script" + + 1. Create an element for the token in the HTML namespace. + 2. Mark the element as being "parser-inserted". + This ensures that, if the script is external, any + document.write() calls in the script will execute in-line, + instead of blowing the document away, as would happen in most + other cases. It also prevents the script from executing until + the end tag is seen. + 3. If the parser was originally created for the HTML fragment + parsing algorithm, then mark the script element as "already + executed". (fragment case) + 4. Append the new element to the current node. + 5. Switch the tokeniser's content model flag to the CDATA state. + 6. Let the original insertion mode be the current insertion mode. + 7. Switch the insertion mode to "in CDATA/RCDATA". + + An end tag whose tag name is "head" + Pop the current node (which will be the head element) off the + stack of open elements. + + Switch the insertion mode to "after head". + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is "head" + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if an end tag token with the tag name "head" had been + seen, and reprocess the current token. + + In certain UAs, some elements don't trigger the "in body" mode + straight away, but instead get put into the head. Do we want to + copy that? + + 8.2.5.8 The "in head noscript" insertion mode + + When the insertion mode is "in head noscript", tokens must be handled + as follows: + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "noscript" + Pop the current node (which will be a noscript element) from the + stack of open elements; the new current node will be a head + element. + + Switch the insertion mode to "in head". + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A comment token + A start tag whose tag name is one of: "link", "meta", "noframes", + "style" + Process the token using the rules for the "in head" insertion + mode. + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is one of: "head", "noscript" + Any other end tag + Parse error. Ignore the token. + + Anything else + Parse error. Act as if an end tag with the tag name "noscript" + had been seen and reprocess the current token. + + 8.2.5.9 The "after head" insertion mode + + When the insertion mode is "after head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "body" + Insert an HTML element for the token. + + Switch the insertion mode to "in body". + + A start tag whose tag name is "frameset" + Insert an HTML element for the token. + + Switch the insertion mode to "in frameset". + + A start tag token whose tag name is one of: "base", "link", "meta", + "noframes", "script", "style", "title" + Parse error. + + Push the node pointed to by the head element pointer onto the + stack of open elements. + + Process the token using the rules for the "in head" insertion + mode. + + Remove the node pointed to by the head element pointer from the + stack of open elements. + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is "head" + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if a start tag token with the tag name "body" and no + attributes had been seen, and then reprocess the current token. + + 8.2.5.10 The "in body" insertion mode + + When the insertion mode is "in body", tokens must be handled as + follows: + + A character token + Reconstruct the active formatting elements, if any. + + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Parse error. For each attribute on the token, check to see if + the attribute is already present on the top element of the stack + of open elements. If it is not, add the attribute and its + corresponding value to that element. + + A start tag token whose tag name is one of: "base", "command", + "eventsource", "link", "meta", "noframes", "script", "style", + "title" + Process the token using the rules for the "in head" insertion + mode. + + A start tag whose tag name is "body" + Parse error. + + If the second element on the stack of open elements is not a + body element, or, if the stack of open elements has only one + node on it, then ignore the token. (fragment case) + + Otherwise, for each attribute on the token, check to see if the + attribute is already present on the body element (the second + element) on the stack of open elements. If it is not, add the + attribute and its corresponding value to that element. + + An end-of-file token + If there is a node in the stack of open elements that is not + either a dd element, a dt element, an li element, a p element, a + tbody element, a td element, a tfoot element, a th element, a + thead element, a tr element, the body element, or the html + element, then this is a parse error. + + Stop parsing. + + An end tag whose tag name is "body" + If the stack of open elements does not have a body element in + scope, this is a parse error; ignore the token. + + Otherwise, if there is a node in the stack of open elements that + is not either a dd element, a dt element, an li element, a p + element, a tbody element, a td element, a tfoot element, a th + element, a thead element, a tr element, the body element, or the + html element, then this is a parse error. + + Switch the insertion mode to "after body". + + An end tag whose tag name is "html" + Act as if an end tag with tag name "body" had been seen, then, + if that token wasn't ignored, reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + A start tag whose tag name is one of: "address", "article", "aside", + "blockquote", "center", "datagrid", "details", "dialog", "dir", + "div", "dl", "fieldset", "figure", "footer", "header", "menu", + "nav", "ol", "p", "section", "ul" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", + "h6" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + If the current node is an element whose tag name is one of "h1", + "h2", "h3", "h4", "h5", or "h6", then this is a parse error; pop + the current node off the stack of open elements. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "pre", "listing" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + If the next token is a U+000A LINE FEED (LF) character token, + then ignore that token and move on to the next one. (Newlines at + the start of pre blocks are ignored as an authoring + convenience.) + + A start tag whose tag name is "form" + If the form element pointer is not null, then this is a parse + error; ignore the token. + + Otherwise: + + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token, and set the form element + pointer to point to the element created. + + A start tag whose tag name is "li" + Run the following algorithm: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node is an li element, then act as if an end tag with the + tag name "li" had been seen, then jump to the last step. + 3. If node is not in the formatting category, and is not in the + phrasing category, and is not an address, div, or p element, + then jump to the last step. + 4. Otherwise, set node to the previous entry in the stack of open + elements and return to step 2. + 5. This is the last step. + If the stack of open elements has a p element in scope, then + act as if an end tag with the tag name "p" had been seen. + Finally, insert an HTML element for the token. + + A start tag whose tag name is one of: "dd", "dt" + Run the following algorithm: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node is a dd or dt element, then act as if an end tag with + the same tag name as node had been seen, then jump to the last + step. + 3. If node is not in the formatting category, and is not in the + phrasing category, and is not an address, div, or p element, + then jump to the last step. + 4. Otherwise, set node to the previous entry in the stack of open + elements and return to step 2. + 5. This is the last step. + If the stack of open elements has a p element in scope, then + act as if an end tag with the tag name "p" had been seen. + Finally, insert an HTML element for the token. + + A start tag whose tag name is "plaintext" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + Switch the content model flag to the PLAINTEXT state. + + Once a start tag with the tag name "plaintext" has been seen, + that will be the last token ever seen other than character + tokens (and the end-of-file token), because there is no way to + switch the content model flag out of the PLAINTEXT state. + + An end tag whose tag name is one of: "address", "article", "aside", + "blockquote", "center", "datagrid", "details", "dialog", "dir", + "div", "dl", "fieldset", "figure", "footer", "header", + "listing", "menu", "nav", "ol", "pre", "section", "ul" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is "form" + Let node be the element that the form element pointer is set to. + + Set the form element pointer to null. + + If node is null or the stack of open elements does not have node + in scope, then this is a parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not node, then this is a parse error. + 3. Remove node from the stack of open elements. + + An end tag whose tag name is "p" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; act as if a start tag with the tag name p had been + seen, then reprocess the current token. + + Otherwise, run these steps: + + 1. Generate implied end tags, except for elements with the same + tag name as the token. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is one of: "dd", "dt", "li" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags, except for elements with the same + tag name as the token. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6" + If the stack of open elements does not have an element in scope + whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", + then this is a parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6" + has been popped from the stack. + + An end tag whose tag name is "sarcasm" + Take a deep breath, then act as described in the "any other end + tag" entry below. + + A start tag whose tag name is "a" + If the list of active formatting elements contains an element + whose tag name is "a" between the end of the list and the last + marker on the list (or the start of the list if there is no + marker on the list), then this is a parse error; act as if an + end tag with the tag name "a" had been seen, then remove that + element from the list of active formatting elements and the + stack of open elements if the end tag didn't already remove it + (it might not have if the element is not in table scope). + + In the non-conforming stream + <a href="a">a<table><a href="b">b</table>x, the first a element + would be closed upon seeing the second one, and the "x" + character would be inside a link to "b", not to "a". This is + despite the fact that the outer a element is not in table scope + (meaning that a regular </a> end tag at the start of the table + wouldn't close the outer a element). + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + A start tag whose tag name is one of: "b", "big", "em", "font", "i", + "s", "small", "strike", "strong", "tt", "u" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + A start tag whose tag name is "nobr" + Reconstruct the active formatting elements, if any. + + If the stack of open elements has a nobr element in scope, then + this is a parse error; act as if an end tag with the tag name + "nobr" had been seen, then once again reconstruct the active + formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + An end tag whose tag name is one of: "a", "b", "big", "em", "font", + "i", "nobr", "s", "small", "strike", "strong", "tt", "u" + Follow these steps: + + 1. Let the formatting element be the last element in the list of + active formatting elements that: + o is between the end of the list and the last scope marker + in the list, if any, or the start of the list otherwise, + and + o has the same tag name as the token. + If there is no such node, or, if that node is also in the + stack of open elements but the element is not in scope, then + this is a parse error; ignore the token, and abort these + steps. + Otherwise, if there is such a node, but that node is not in + the stack of open elements, then this is a parse error; remove + the element from the list, and abort these steps. + Otherwise, there is a formatting element and that element is + in the stack and is in scope. If the element is not the + current node, this is a parse error. In any case, proceed with + the algorithm as written in the following steps. + 2. Let the furthest block be the topmost node in the stack of + open elements that is lower in the stack than the formatting + element, and is not an element in the phrasing or formatting + categories. There might not be one. + 3. If there is no furthest block, then the UA must skip the + subsequent steps and instead just pop all the nodes from the + bottom of the stack of open elements, from the current node up + to and including the formatting element, and remove the + formatting element from the list of active formatting + elements. + 4. Let the common ancestor be the element immediately above the + formatting element in the stack of open elements. + 5. If the furthest block has a parent node, then remove the + furthest block from its parent node. + 6. Let a bookmark note the position of the formatting element in + the list of active formatting elements relative to the + elements on either side of it in the list. + 7. Let node and last node be the furthest block. Follow these + steps: + 1. Let node be the element immediately above node in the + stack of open elements. + 2. If node is not in the list of active formatting elements, + then remove node from the stack of open elements and then + go back to step 1. + 3. Otherwise, if node is the formatting element, then go to + the next step in the overall algorithm. + 4. Otherwise, if last node is the furthest block, then move + the aforementioned bookmark to be immediately after the + node in the list of active formatting elements. + 5. If node has any children, perform a shallow clone of + node, replace the entry for node in the list of active + formatting elements with an entry for the clone, replace + the entry for node in the stack of open elements with an + entry for the clone, and let node be the clone. + 6. Insert last node into node, first removing it from its + previous parent node if any. + 7. Let last node be node. + 8. Return to step 1 of this inner set of steps. + 8. If the common ancestor node is a table, tbody, tfoot, thead, + or tr element, then, foster parent whatever last node ended up + being in the previous step. + Otherwise, append whatever last node ended up being in the + previous step to the common ancestor node, first removing it + from its previous parent node if any. + 9. Perform a shallow clone of the formatting element. + 10. Take all of the child nodes of the furthest block and append + them to the clone created in the last step. + 11. Append that clone to the furthest block. + 12. Remove the formatting element from the list of active + formatting elements, and insert the clone into the list of + active formatting elements at the position of the + aforementioned bookmark. + 13. Remove the formatting element from the stack of open elements, + and insert the clone into the stack of open elements + immediately below the position of the furthest block in that + stack. + 14. Jump back to step 1 in this series of steps. + + The way these steps are defined, only elements in the formatting + category ever get cloned by this algorithm. + + Because of the way this algorithm causes elements to change + parents, it has been dubbed the "adoption agency algorithm" (in + contrast with other possibly algorithms for dealing with + misnested content, which included the "incest algorithm", the + "secret affair algorithm", and the "Heisenberg algorithm"). + + A start tag whose tag name is "button" + If the stack of open elements has a button element in scope, + then this is a parse error; act as if an end tag with the tag + name "button" had been seen, then reprocess the token. + + Otherwise: + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + Insert a marker at the end of the list of active formatting + elements. + + A start tag token whose tag name is one of: "applet", "marquee", + "object" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + Insert a marker at the end of the list of active formatting + elements. + + An end tag token whose tag name is one of: "applet", "button", + "marquee", "object" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + 4. Clear the list of active formatting elements up to the last + marker. + + A start tag whose tag name is "xmp" + Reconstruct the active formatting elements, if any. + + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "table" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + Switch the insertion mode to "in table". + + A start tag whose tag name is one of: "area", "basefont", "bgsound", + "br", "embed", "img", "input", "spacer", "wbr" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is one of: "param", "source" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "hr" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "image" + Parse error. Change the token's tag name to "img" and reprocess + it. (Don't ask.) + + A start tag whose tag name is "isindex" + Parse error. + + If the form element pointer is not null, then ignore the token. + + Otherwise: + + Acknowledge the token's self-closing flag, if it is set. + + Act as if a start tag token with the tag name "form" had been + seen. + + If the token has an attribute called "action", set the action + attribute on the resulting form element to the value of the + "action" attribute of the token. + + Act as if a start tag token with the tag name "hr" had been + seen. + + Act as if a start tag token with the tag name "p" had been seen. + + Act as if a start tag token with the tag name "label" had been + seen. + + Act as if a stream of character tokens had been seen (see below + for what they should say). + + Act as if a start tag token with the tag name "input" had been + seen, with all the attributes from the "isindex" token except + "name", "action", and "prompt". Set the name attribute of the + resulting input element to the value "isindex". + + Act as if a stream of character tokens had been seen (see below + for what they should say). + + Act as if an end tag token with the tag name "label" had been + seen. + + Act as if an end tag token with the tag name "p" had been seen. + + Act as if a start tag token with the tag name "hr" had been + seen. + + Act as if an end tag token with the tag name "form" had been + seen. + + If the token has an attribute with the name "prompt", then the + first stream of characters must be the same string as given in + that attribute, and the second stream of characters must be + empty. Otherwise, the two streams of character tokens together + should, together with the input element, express the equivalent + of "This is a searchable index. Insert your search keywords + here: (input field)" in the user's preferred language. + + A start tag whose tag name is "textarea" + + 1. Insert an HTML element for the token. + 2. If the next token is a U+000A LINE FEED (LF) character token, + then ignore that token and move on to the next one. (Newlines + at the start of textarea elements are ignored as an authoring + convenience.) + 3. Switch the tokeniser's content model flag to the RCDATA state. + 4. Let the original insertion mode be the current insertion mode. + 5. Switch the insertion mode to "in CDATA/RCDATA". + + A start tag whose tag name is one of: "iframe", "noembed" + A start tag whose tag name is "noscript", if the scripting flag is + enabled + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "select" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + If the insertion mode is one of in table", "in caption", "in + column group", "in table body", "in row", or "in cell", then + switch the insertion mode to "in select in table". Otherwise, + switch the insertion mode to "in select". + + A start tag whose tag name is one of: "optgroup", "option" + If the stack of open elements has an option element in scope, + then act as if an end tag with the tag name "option" had been + seen. + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "rp", "rt" + If the stack of open elements has a ruby element in scope, then + generate implied end tags. If the current node is not then a + ruby element, this is a parse error; pop all the nodes from the + current node up to the node immediately before the bottommost + ruby element on the stack of open elements. + + Insert an HTML element for the token. + + An end tag whose tag name is "br" + Parse error. Act as if a start tag token with the tag name "br" + had been seen. Ignore the end tag token. + + A start tag whose tag name is "math" + Reconstruct the active formatting elements, if any. + + Adjust MathML attributes for the token. (This fixes the case of + MathML attributes that are not all lowercase.) + + Adjust foreign attributes for the token. (This fixes the use of + namespaced attributes, in particular XLink.) + + Insert a foreign element for the token, in the MathML namespace. + + If the token has its self-closing flag set, pop the current node + off the stack of open elements and acknowledge the token's + self-closing flag. + + Otherwise, let the secondary insertion mode be the current + insertion mode, and then switch the insertion mode to "in + foreign content". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "frame", "frameset", "head", "tbody", "td", "tfoot", "th", + "thead", "tr" + Parse error. Ignore the token. + + Any other start tag + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + This element will be a phrasing element. + + Any other end tag + Run the following steps: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node has the same tag name as the end tag token, then: + 1. Generate implied end tags. + 2. If the tag name of the end tag token does not match the + tag name of the current node, this is a parse error. + 3. Pop all the nodes from the current node up to node, + including node, then stop these steps. + 3. Otherwise, if node is in neither the formatting category nor + the phrasing category, then this is a parse error; ignore the + token, and abort these steps. + 4. Set node to the previous entry in the stack of open elements. + 5. Return to step 2. + + 8.2.5.11 The "in CDATA/RCDATA" insertion mode + + When the insertion mode is "in CDATA/RCDATA", tokens must be handled as + follows: + + A character token + Insert the token's character into the current node. + + An end-of-file token + Parse error. + + If the current node is a script element, mark the script element + as "already executed". + + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode and + reprocess the current token. + + An end tag whose tag name is "script" + Let script be the current node (which will be a script element). + + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode. + + Let the old insertion point have the same value as the current + insertion point. Let the insertion point be just before the next + input character. + + Increment the parser's script nesting level by one. + + Run the script. This might cause some script to execute, which + might cause new characters to be inserted into the tokeniser, + and might cause the tokeniser to output more tokens, resulting + in a reentrant invocation of the parser. + + Decrement the parser's script nesting level by one. If the + parser's script nesting level is zero, then set the parser pause + flag to false. + + Let the insertion point have the value of the old insertion + point. (In other words, restore the insertion point to the value + it had before the previous paragraph. This value might be the + "undefined" value.) + + At this stage, if there is a pending external script, then: + + If the tree construction stage is being called reentrantly, say + from a call to document.write(): + Set the parser pause flag to true, and abort the + processing of any nested invocations of the tokeniser, + yielding control back to the caller. (Tokenization will + resume when the caller returns to the "outer" tree + construction stage.) + + Otherwise: + Follow these steps: + + 1. Let the script be the pending external script. There is + no longer a pending external script. + 2. Pause until the script has completed loading. + 3. Let the insertion point be just before the next input + character. + 4. Execute the script. + 5. Let the insertion point be undefined again. + 6. If there is once again a pending external script, then + repeat these steps from step 1. + + Any other end tag + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode. + + 8.2.5.12 The "in table" insertion mode + + When the insertion mode is "in table", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + If the current table is tainted, then act as described in the + "anything else" entry below. + + Otherwise, insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "caption" + Clear the stack back to a table context. (See below.) + + Insert a marker at the end of the list of active formatting + elements. + + Insert an HTML element for the token, then switch the insertion + mode to "in caption". + + A start tag whose tag name is "colgroup" + Clear the stack back to a table context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in column group". + + A start tag whose tag name is "col" + Act as if a start tag token with the tag name "colgroup" had + been seen, then reprocess the current token. + + A start tag whose tag name is one of: "tbody", "tfoot", "thead" + Clear the stack back to a table context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in table body". + + A start tag whose tag name is one of: "td", "th", "tr" + Act as if a start tag token with the tag name "tbody" had been + seen, then reprocess the current token. + + A start tag whose tag name is "table" + Parse error. Act as if an end tag token with the tag name + "table" had been seen, then, if that token wasn't ignored, + reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is "table" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Pop elements from this stack until a table element has been + popped from the stack. + + Reset the insertion mode appropriately. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "tbody", "td", "tfoot", "th", "thead", "tr" + Parse error. Ignore the token. + + A start tag whose tag name is one of: "style", "script" + If the current table is tainted then act as described in the + "anything else" entry below. + + Otherwise, process the token using the rules for the "in head" + insertion mode. + + A start tag whose tag name is "input" + If the token does not have an attribute with the name "type", or + if it does, but that attribute's value is not an ASCII + case-insensitive match for the string "hidden", or, if the + current table is tainted, then: act as described in the + "anything else" entry below. + + Otherwise: + + Parse error. + + Insert an HTML element for the token. + + Pop that input element off the stack of open elements. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Process the token using the rules for the "in body" + insertion mode, except that if the current node is a table, + tbody, tfoot, thead, or tr element, then, whenever a node would + be inserted into the current node, it must instead be foster + parented. + + When the steps above require the UA to clear the stack back to a table + context, it means that the UA must, while the current node is not a + table element or an html element, pop elements from the stack of open + elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.13 The "in caption" insertion mode + + When the insertion mode is "in caption", tokens must be handled as + follows: + + An end tag whose tag name is "caption" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Generate implied end tags. + + Now, if the current node is not a caption element, then this is + a parse error. + + Pop elements from this stack until a caption element has been + popped from the stack. + + Clear the list of active formatting elements up to the last + marker. + + Switch the insertion mode to "in table". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "td", "tfoot", "th", "thead", "tr" + + An end tag whose tag name is "table" + Parse error. Act as if an end tag with the tag name "caption" + had been seen, then, if that token wasn't ignored, reprocess the + current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is one of: "body", "col", "colgroup", "html", + "tbody", "td", "tfoot", "th", "thead", "tr" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in body" insertion + mode. + + 8.2.5.14 The "in column group" insertion mode + + When the insertion mode is "in column group", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "col" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + An end tag whose tag name is "colgroup" + If the current node is the root html element, then this is a + parse error; ignore the token. (fragment case) + + Otherwise, pop the current node (which will be a colgroup + element) from the stack of open elements. Switch the insertion + mode to "in table". + + An end tag whose tag name is "col" + Parse error. Ignore the token. + + An end-of-file token + If the current node is the root html element, then stop parsing. + (fragment case) + + Otherwise, act as described in the "anything else" entry below. + + Anything else + Act as if an end tag with the tag name "colgroup" had been seen, + and then, if that token wasn't ignored, reprocess the current + token. + + The fake end tag token here can only be ignored in the fragment + case. + + 8.2.5.15 The "in table body" insertion mode + + When the insertion mode is "in table body", tokens must be handled as + follows: + + A start tag whose tag name is "tr" + Clear the stack back to a table body context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in row". + + A start tag whose tag name is one of: "th", "td" + Parse error. Act as if a start tag with the tag name "tr" had + been seen, then reprocess the current token. + + An end tag whose tag name is one of: "tbody", "tfoot", "thead" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. + + Otherwise: + + Clear the stack back to a table body context. (See below.) + + Pop the current node from the stack of open elements. Switch the + insertion mode to "in table". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "tfoot", "thead" + + An end tag whose tag name is "table" + If the stack of open elements does not have a tbody, thead, or + tfoot element in table scope, this is a parse error. Ignore the + token. (fragment case) + + Otherwise: + + Clear the stack back to a table body context. (See below.) + + Act as if an end tag with the same tag name as the current node + ("tbody", "tfoot", or "thead") had been seen, then reprocess the + current token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "td", "th", "tr" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in table" insertion + mode. + + When the steps above require the UA to clear the stack back to a table + body context, it means that the UA must, while the current node is not + a tbody, tfoot, thead, or html element, pop elements from the stack of + open elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.16 The "in row" insertion mode + + When the insertion mode is "in row", tokens must be handled as follows: + + A start tag whose tag name is one of: "th", "td" + Clear the stack back to a table row context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in cell". + + Insert a marker at the end of the list of active formatting + elements. + + An end tag whose tag name is "tr" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Clear the stack back to a table row context. (See below.) + + Pop the current node (which will be a tr element) from the stack + of open elements. Switch the insertion mode to "in table body". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "tfoot", "thead", "tr" + + An end tag whose tag name is "table" + Act as if an end tag with the tag name "tr" had been seen, then, + if that token wasn't ignored, reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is one of: "tbody", "tfoot", "thead" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. + + Otherwise, act as if an end tag with the tag name "tr" had been + seen, then reprocess the current token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "td", "th" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in table" insertion + mode. + + When the steps above require the UA to clear the stack back to a table + row context, it means that the UA must, while the current node is not a + tr element or an html element, pop elements from the stack of open + elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.17 The "in cell" insertion mode + + When the insertion mode is "in cell", tokens must be handled as + follows: + + An end tag whose tag name is one of: "td", "th" + If the stack of open elements does not have an element in table + scope with the same tag name as that of the token, then this is + a parse error and the token must be ignored. + + Otherwise: + + Generate implied end tags. + + Now, if the current node is not an element with the same tag + name as the token, then this is a parse error. + + Pop elements from this stack until an element with the same tag + name as the token has been popped from the stack. + + Clear the list of active formatting elements up to the last + marker. + + Switch the insertion mode to "in row". (The current node will be + a tr element at this point.) + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "td", "tfoot", "th", "thead", "tr" + If the stack of open elements does not have a td or th element + in table scope, then this is a parse error; ignore the token. + (fragment case) + + Otherwise, close the cell (see below) and reprocess the current + token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html" + Parse error. Ignore the token. + + An end tag whose tag name is one of: "table", "tbody", "tfoot", + "thead", "tr" + If the stack of open elements does not have an element in table + scope with the same tag name as that of the token (which can + only happen for "tbody", "tfoot" and "thead", or, in the + fragment case), then this is a parse error and the token must be + ignored. + + Otherwise, close the cell (see below) and reprocess the current + token. + + Anything else + Process the token using the rules for the "in body" insertion + mode. + + Where the steps above say to close the cell, they mean to run the + following algorithm: + 1. If the stack of open elements has a td element in table scope, then + act as if an end tag token with the tag name "td" had been seen. + 2. Otherwise, the stack of open elements will have a th element in + table scope; act as if an end tag token with the tag name "th" had + been seen. + + The stack of open elements cannot have both a td and a th element in + table scope at the same time, nor can it have neither when the + insertion mode is "in cell". + + 8.2.5.18 The "in select" insertion mode + + When the insertion mode is "in select", tokens must be handled as + follows: + + A character token + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "option" + If the current node is an option element, act as if an end tag + with the tag name "option" had been seen. + + Insert an HTML element for the token. + + A start tag whose tag name is "optgroup" + If the current node is an option element, act as if an end tag + with the tag name "option" had been seen. + + If the current node is an optgroup element, act as if an end tag + with the tag name "optgroup" had been seen. + + Insert an HTML element for the token. + + An end tag whose tag name is "optgroup" + First, if the current node is an option element, and the node + immediately before it in the stack of open elements is an + optgroup element, then act as if an end tag with the tag name + "option" had been seen. + + If the current node is an optgroup element, then pop that node + from the stack of open elements. Otherwise, this is a parse + error; ignore the token. + + An end tag whose tag name is "option" + If the current node is an option element, then pop that node + from the stack of open elements. Otherwise, this is a parse + error; ignore the token. + + An end tag whose tag name is "select" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Pop elements from the stack of open elements until a select + element has been popped from the stack. + + Reset the insertion mode appropriately. + + A start tag whose tag name is "select" + Parse error. Act as if the token had been an end tag with the + tag name "select" instead. + + A start tag whose tag name is one of: "input", "textarea" + Parse error. Act as if an end tag with the tag name "select" had + been seen, and reprocess the token. + + A start tag token whose tag name is "script" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Ignore the token. + + 8.2.5.19 The "in select in table" insertion mode + + When the insertion mode is "in select in table", tokens must be handled + as follows: + + A start tag whose tag name is one of: "caption", "table", "tbody", + "tfoot", "thead", "tr", "td", "th" + Parse error. Act as if an end tag with the tag name "select" had + been seen, and reprocess the token. + + An end tag whose tag name is one of: "caption", "table", "tbody", + "tfoot", "thead", "tr", "td", "th" + Parse error. + + If the stack of open elements has an element in table scope with + the same tag name as that of the token, then act as if an end + tag with the tag name "select" had been seen, and reprocess the + token. Otherwise, ignore the token. + + Anything else + Process the token using the rules for the "in select" insertion + mode. + + 8.2.5.20 The "in foreign content" insertion mode + + When the insertion mode is "in foreign content", tokens must be handled + as follows: + + A character token + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mi element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mo element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mn element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an ms element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mtext element in the MathML namespace. + + A start tag, if the current node is an element in the HTML namespace. + An end tag + Process the token using the rules for the secondary insertion + mode. + + If, after doing so, the insertion mode is still "in foreign + content", but there is no element in scope that has a namespace + other than the HTML namespace, switch the insertion mode to the + secondary insertion mode. + + A start tag whose tag name is one of: "b", "big", "blockquote", "body", + "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", + "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", + "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", + "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", + "table", "tt", "u", "ul", "var" + + A start tag whose tag name is "font", if the token has any attributes + named "color", "face", or "size" + + An end-of-file token + Parse error. + + Pop elements from the stack of open elements until the current + node is in the HTML namespace. + + Switch the insertion mode to the secondary insertion mode, and + reprocess the token. + + Any other start tag + If the current node is an element in the MathML namespace, + adjust MathML attributes for the token. (This fixes the case of + MathML attributes that are not all lowercase.) + + Adjust foreign attributes for the token. (This fixes the use of + namespaced attributes, in particular XLink in SVG.) + + Insert a foreign element for the token, in the same namespace as + the current node. + + If the token has its self-closing flag set, pop the current node + off the stack of open elements and acknowledge the token's + self-closing flag. + + 8.2.5.21 The "after body" insertion mode + + When the insertion mode is "after body", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Process the token using the rules for the "in body" insertion + mode. + + A comment token + Append a Comment node to the first element in the stack of open + elements (the html element), with the data attribute set to the + data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "html" + If the parser was originally created as part of the HTML + fragment parsing algorithm, this is a parse error; ignore the + token. (fragment case) + + Otherwise, switch the insertion mode to "after after body". + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Switch the insertion mode to "in body" and + reprocess the token. + + 8.2.5.22 The "in frameset" insertion mode + + When the insertion mode is "in frameset", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "frameset" + Insert an HTML element for the token. + + An end tag whose tag name is "frameset" + If the current node is the root html element, then this is a + parse error; ignore the token. (fragment case) + + Otherwise, pop the current node from the stack of open elements. + + If the parser was not originally created as part of the HTML + fragment parsing algorithm (fragment case), and the current node + is no longer a frameset element, then switch the insertion mode + to "after frameset". + + A start tag whose tag name is "frame" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Ignore the token. + + 8.2.5.23 The "after frameset" insertion mode + + When the insertion mode is "after frameset", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "html" + Switch the insertion mode to "after after frameset". + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Ignore the token. + + This doesn't handle UAs that don't support frames, or that do support + frames but want to show the NOFRAMES content. Supporting the former is + easy; supporting the latter is harder. + + 8.2.5.24 The "after after body" insertion mode + + When the insertion mode is "after after body", tokens must be handled + as follows: + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Switch the insertion mode to "in body" and + reprocess the token. + + 8.2.5.25 The "after after frameset" insertion mode + + When the insertion mode is "after after frameset", tokens must be + handled as follows: + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end-of-file token + Stop parsing. + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + Anything else + Parse error. Ignore the token. + + 8.2.6 The end + + Once the user agent stops parsing the document, the user agent must + follow the steps in this section. + + First, the current document readiness must be set to "interactive". + + Then, the rules for when a script completes loading start applying + (script execution is no longer managed by the parser). + + If any of the scripts in the list of scripts that will execute as soon + as possible have completed loading, or if the list of scripts that will + execute asynchronously is not empty and the first script in that list + has completed loading, then the user agent must act as if those scripts + just completed loading, following the rules given for that in the + script element definition. + + Then, if the list of scripts that will execute when the document has + finished parsing is not empty, and the first item in this list has + already completed loading, then the user agent must act as if that + script just finished loading. + + By this point, there will be no scripts that have loaded but have not + yet been executed. + + The user agent must then fire a simple event called DOMContentLoaded at + the Document. + + Once everything that delays the load event has completed, the user + agent must set the current document readiness to "complete", and then + fire a load event at the body element. + + delaying the load event for things like image loads allows for intranet + port scans (even without javascript!). Should we really encode that + into the spec? + + 8.2.7 Coercing an HTML DOM into an infoset + + When an application uses an HTML parser in conjunction with an XML + pipeline, it is possible that the constructed DOM is not compatible + with the XML tool chain in certain subtle ways. For example, an XML + toolchain might not be able to represent attributes with the name + xmlns, since they conflict with the Namespaces in XML syntax. There is + also some data that the HTML parser generates that isn't included in + the DOM itself. This section specifies some rules for handling these + issues. + + If the XML API being used doesn't support DOCTYPEs, the tool may drop + DOCTYPEs altogether. + + If the XML API doesn't support attributes in no namespace that are + named "xmlns", attributes whose names start with "xmlns:", or + attributes in the XMLNS namespace, then the tool may drop such + attributes. + + The tool may annotate the output with any namespace declarations + required for proper operation. + + If the XML API being used restricts the allowable characters in the + local names of elements and attributes, then the tool may map all + element and attribute local names that the API wouldn't support to a + set of names that are allowed, by replacing any character that isn't + supported with the uppercase letter U and the five digits of the + character's Unicode codepoint when expressed in hexadecimal, using + digits 0-9 and capital letters A-F as the symbols, in increasing + numeric order. + + For example, the element name foo<bar, which can be output by the HTML + parser, though it is neither a legal HTML element name nor a + well-formed XML element name, would be converted into fooU0003Cbar, + which is a well-formed XML element name (though it's still not legal in + HTML by any means). + + As another example, consider the attribute xlink:href. Used on a MathML + element, it becomes, after being adjusted, an attribute with a prefix + "xlink" and a local name "href". However, used on an HTML element, it + becomes an attribute with no prefix and the local name "xlink:href", + which is not a valid NCName, and thus might not be accepted by an XML + API. It could thus get converted, becoming "xlinkU0003Ahref". + + The resulting names from this conversion conveniently can't clash with + any attribute generated by the HTML parser, since those are all either + lowercase or those listed in the adjust foreign attributes algorithm's + table. + + If the XML API restricts comments from having two consecutive U+002D + HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE + character between any such offending characters. + + If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS + character (-), the tool may insert a single U+0020 SPACE character at + the end of such comments. + + If the XML API restricts allowed characters in character data, the tool + may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE + character, and any other literal non-XML character with a U+FFFD + REPLACEMENT CHARACTER. + + If the tool has no way to convey out-of-band information, then the tool + may drop the following information: + * Whether the document is set to no quirks mode, limited quirks mode, + or quirks mode + * The association between form controls and forms that aren't their + nearest form element ancestor (use of the form element pointer in + the parser) + + The mutations allowed by this section apply after the HTML parser's + rules have been applied. For example, a <a::> start tag will be closed + by a </a::> end tag, and never by a </aU0003AU0003A> end tag, even if + the user agent is using the rules above to then generate an actual + element in the DOM with the name aU0003AU0003A for that start tag. + + 8.3 Namespaces + + The HTML namespace is: http://www.w3.org/1999/xhtml + + The MathML namespace is: http://www.w3.org/1998/Math/MathML + + The SVG namespace is: http://www.w3.org/2000/svg + + The XLink namespace is: http://www.w3.org/1999/xlink + + The XML namespace is: http://www.w3.org/XML/1998/namespace + + The XMLNS namespace is: http://www.w3.org/2000/xmlns/ |