From 6168dbe21f5f83b906e562ea0ab232d499b275a6 Mon Sep 17 00:00:00 2001 From: "Matt A. Tobin" Date: Wed, 15 Jan 2020 14:56:04 -0500 Subject: Add java htmlparser sources that match the original 52-level state https://hg.mozilla.org/projects/htmlparser/ Commit: abe62ab2a9b69ccb3b5d8a231ec1ae11154c571d --- parser/html/java/htmlparser/doc/README | 15 + .../htmlparser/doc/named-character-references.html | 4 + parser/html/java/htmlparser/doc/tokenization.txt | 1147 ++++++++++ .../html/java/htmlparser/doc/tree-construction.txt | 2201 ++++++++++++++++++++ 4 files changed, 3367 insertions(+) create mode 100644 parser/html/java/htmlparser/doc/README create mode 100644 parser/html/java/htmlparser/doc/named-character-references.html create mode 100644 parser/html/java/htmlparser/doc/tokenization.txt create mode 100644 parser/html/java/htmlparser/doc/tree-construction.txt (limited to 'parser/html/java/htmlparser/doc') diff --git a/parser/html/java/htmlparser/doc/README b/parser/html/java/htmlparser/doc/README new file mode 100644 index 000000000..e0132a41e --- /dev/null +++ b/parser/html/java/htmlparser/doc/README @@ -0,0 +1,15 @@ +tokenization.txt represents the state of the spec implemented in Tokenizer.java. + +To get a diffable version corresponding to the current spec: +lynx -display_charset=utf-8 -dump -nolist http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html > current.txt + +tree-construction.txt represents the state of the spec implemented in TreeBuilder.java. + +To get a diffable version corresponding to the current spec: +lynx -display_charset=utf-8 -dump -nolist http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html > current.txt + + +The text of the files in this directory comes from the WHATWG HTML 5 spec +which carries the following notice: +© Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and Opera Software ASA. +You are granted a license to use, reproduce and create derivative works of this document. diff --git a/parser/html/java/htmlparser/doc/named-character-references.html b/parser/html/java/htmlparser/doc/named-character-references.html new file mode 100644 index 000000000..5f05a991f --- /dev/null +++ b/parser/html/java/htmlparser/doc/named-character-references.html @@ -0,0 +1,4 @@ + + +
Name Character(s) Glyph
AElig; U+000C6 Æ
AMP; U+00026 &
Aacute; U+000C1 Á
Abreve; U+00102 Ă
Acirc; U+000C2 Â
Acy; U+00410 А
Afr; U+1D504 𝔄
Agrave; U+000C0 À
Alpha; U+00391 Α
Amacr; U+00100 Ā
And; U+02A53
Aogon; U+00104 Ą
Aopf; U+1D538 𝔸
ApplyFunction; U+02061
Aring; U+000C5 Å
Ascr; U+1D49C 𝒜
Assign; U+02254
Atilde; U+000C3 Ã
Auml; U+000C4 Ä
Backslash; U+02216
Barv; U+02AE7
Barwed; U+02306
Bcy; U+00411 Б
Because; U+02235
Bernoullis; U+0212C
Beta; U+00392 Β
Bfr; U+1D505 𝔅
Bopf; U+1D539 𝔹
Breve; U+002D8 ˘
Bscr; U+0212C
Bumpeq; U+0224E
CHcy; U+00427 Ч
COPY; U+000A9 ©
Cacute; U+00106 Ć
Cap; U+022D2
CapitalDifferentialD; U+02145
Cayleys; U+0212D
Ccaron; U+0010C Č
Ccedil; U+000C7 Ç
Ccirc; U+00108 Ĉ
Cconint; U+02230
Cdot; U+0010A Ċ
Cedilla; U+000B8 ¸
CenterDot; U+000B7 ·
Cfr; U+0212D
Chi; U+003A7 Χ
CircleDot; U+02299
CircleMinus; U+02296
CirclePlus; U+02295
CircleTimes; U+02297
ClockwiseContourIntegral; U+02232
CloseCurlyDoubleQuote; U+0201D
CloseCurlyQuote; U+02019
Colon; U+02237
Colone; U+02A74
Congruent; U+02261
Conint; U+0222F
ContourIntegral; U+0222E
Copf; U+02102
Coproduct; U+02210
CounterClockwiseContourIntegral; U+02233
Cross; U+02A2F
Cscr; U+1D49E 𝒞
Cup; U+022D3
CupCap; U+0224D
DD; U+02145
DDotrahd; U+02911
DJcy; U+00402 Ђ
DScy; U+00405 Ѕ
DZcy; U+0040F Џ
Dagger; U+02021
Darr; U+021A1
Dashv; U+02AE4
Dcaron; U+0010E Ď
Dcy; U+00414 Д
Del; U+02207
Delta; U+00394 Δ
Dfr; U+1D507 𝔇
DiacriticalAcute; U+000B4 ´
DiacriticalDot; U+002D9 ˙
DiacriticalDoubleAcute; U+002DD ˝
DiacriticalGrave; U+00060 `
DiacriticalTilde; U+002DC ˜
Diamond; U+022C4
DifferentialD; U+02146
Dopf; U+1D53B 𝔻
Dot; U+000A8 ¨
DotDot; U+020DC ◌⃜
DotEqual; U+02250
DoubleContourIntegral; U+0222F
DoubleDot; U+000A8 ¨
DoubleDownArrow; U+021D3
DoubleLeftArrow; U+021D0
DoubleLeftRightArrow; U+021D4
DoubleLeftTee; U+02AE4
DoubleLongLeftArrow; U+027F8
DoubleLongLeftRightArrow; U+027FA
DoubleLongRightArrow; U+027F9
DoubleRightArrow; U+021D2
DoubleRightTee; U+022A8
DoubleUpArrow; U+021D1
DoubleUpDownArrow; U+021D5
DoubleVerticalBar; U+02225
DownArrow; U+02193
DownArrowBar; U+02913
DownArrowUpArrow; U+021F5
DownBreve; U+00311 ◌̑
DownLeftRightVector; U+02950
DownLeftTeeVector; U+0295E
DownLeftVector; U+021BD
DownLeftVectorBar; U+02956
DownRightTeeVector; U+0295F
DownRightVector; U+021C1
DownRightVectorBar; U+02957
DownTee; U+022A4
DownTeeArrow; U+021A7
Downarrow; U+021D3
Dscr; U+1D49F 𝒟
Dstrok; U+00110 Đ
ENG; U+0014A Ŋ
ETH; U+000D0 Ð
Eacute; U+000C9 É
Ecaron; U+0011A Ě
Ecirc; U+000CA Ê
Ecy; U+0042D Э
Edot; U+00116 Ė
Efr; U+1D508 𝔈
Egrave; U+000C8 È
Element; U+02208
Emacr; U+00112 Ē
EmptySmallSquare; U+025FB
EmptyVerySmallSquare; U+025AB
Eogon; U+00118 Ę
Eopf; U+1D53C 𝔼
Epsilon; U+00395 Ε
Equal; U+02A75
EqualTilde; U+02242
Equilibrium; U+021CC
Escr; U+02130
Esim; U+02A73
Eta; U+00397 Η
Euml; U+000CB Ë
Exists; U+02203
ExponentialE; U+02147
Fcy; U+00424 Ф
Ffr; U+1D509 𝔉
FilledSmallSquare; U+025FC
FilledVerySmallSquare; U+025AA
Fopf; U+1D53D 𝔽
ForAll; U+02200
Fouriertrf; U+02131
Fscr; U+02131
GJcy; U+00403 Ѓ
GT; U+0003E >
Gamma; U+00393 Γ
Gammad; U+003DC Ϝ
Gbreve; U+0011E Ğ
Gcedil; U+00122 Ģ
Gcirc; U+0011C Ĝ
Gcy; U+00413 Г
Gdot; U+00120 Ġ
Gfr; U+1D50A 𝔊
Gg; U+022D9
Gopf; U+1D53E 𝔾
GreaterEqual; U+02265
GreaterEqualLess; U+022DB
GreaterFullEqual; U+02267
GreaterGreater; U+02AA2
GreaterLess; U+02277
GreaterSlantEqual; U+02A7E
GreaterTilde; U+02273
Gscr; U+1D4A2 𝒢
Gt; U+0226B
HARDcy; U+0042A Ъ
Hacek; U+002C7 ˇ
Hat; U+0005E ^
Hcirc; U+00124 Ĥ
Hfr; U+0210C
HilbertSpace; U+0210B
Hopf; U+0210D
HorizontalLine; U+02500
Hscr; U+0210B
Hstrok; U+00126 Ħ
HumpDownHump; U+0224E
HumpEqual; U+0224F
IEcy; U+00415 Е
IJlig; U+00132 IJ
IOcy; U+00401 Ё
Iacute; U+000CD Í
Icirc; U+000CE Î
Icy; U+00418 И
Idot; U+00130 İ
Ifr; U+02111
Igrave; U+000CC Ì
Im; U+02111
Imacr; U+0012A Ī
ImaginaryI; U+02148
Implies; U+021D2
Int; U+0222C
Integral; U+0222B
Intersection; U+022C2
InvisibleComma; U+02063
InvisibleTimes; U+02062
Iogon; U+0012E Į
Iopf; U+1D540 𝕀
Iota; U+00399 Ι
Iscr; U+02110
Itilde; U+00128 Ĩ
Iukcy; U+00406 І
Iuml; U+000CF Ï
Jcirc; U+00134 Ĵ
Jcy; U+00419 Й
Jfr; U+1D50D 𝔍
Jopf; U+1D541 𝕁
Jscr; U+1D4A5 𝒥
Jsercy; U+00408 Ј
Jukcy; U+00404 Є
KHcy; U+00425 Х
KJcy; U+0040C Ќ
Kappa; U+0039A Κ
Kcedil; U+00136 Ķ
Kcy; U+0041A К
Kfr; U+1D50E 𝔎
Kopf; U+1D542 𝕂
Kscr; U+1D4A6 𝒦
LJcy; U+00409 Љ
LT; U+0003C <
Lacute; U+00139 Ĺ
Lambda; U+0039B Λ
Lang; U+027EA
Laplacetrf; U+02112
Larr; U+0219E
Lcaron; U+0013D Ľ
Lcedil; U+0013B Ļ
Lcy; U+0041B Л
LeftAngleBracket; U+027E8
LeftArrow; U+02190
LeftArrowBar; U+021E4
LeftArrowRightArrow; U+021C6
LeftCeiling; U+02308
LeftDoubleBracket; U+027E6
LeftDownTeeVector; U+02961
LeftDownVector; U+021C3
LeftDownVectorBar; U+02959
LeftFloor; U+0230A
LeftRightArrow; U+02194
LeftRightVector; U+0294E
LeftTee; U+022A3
LeftTeeArrow; U+021A4
LeftTeeVector; U+0295A
LeftTriangle; U+022B2
LeftTriangleBar; U+029CF
LeftTriangleEqual; U+022B4
LeftUpDownVector; U+02951
LeftUpTeeVector; U+02960
LeftUpVector; U+021BF
LeftUpVectorBar; U+02958
LeftVector; U+021BC
LeftVectorBar; U+02952
Leftarrow; U+021D0
Leftrightarrow; U+021D4
LessEqualGreater; U+022DA
LessFullEqual; U+02266
LessGreater; U+02276
LessLess; U+02AA1
LessSlantEqual; U+02A7D
LessTilde; U+02272
Lfr; U+1D50F 𝔏
Ll; U+022D8
Lleftarrow; U+021DA
Lmidot; U+0013F Ŀ
LongLeftArrow; U+027F5
LongLeftRightArrow; U+027F7
LongRightArrow; U+027F6
Longleftarrow; U+027F8
Longleftrightarrow; U+027FA
Longrightarrow; U+027F9
Lopf; U+1D543 𝕃
LowerLeftArrow; U+02199
LowerRightArrow; U+02198
Lscr; U+02112
Lsh; U+021B0
Lstrok; U+00141 Ł
Lt; U+0226A
Map; U+02905
Mcy; U+0041C М
MediumSpace; U+0205F
Mellintrf; U+02133
Mfr; U+1D510 𝔐
MinusPlus; U+02213
Mopf; U+1D544 𝕄
Mscr; U+02133
Mu; U+0039C Μ
NJcy; U+0040A Њ
Nacute; U+00143 Ń
Ncaron; U+00147 Ň
Ncedil; U+00145 Ņ
Ncy; U+0041D Н
NegativeMediumSpace; U+0200B
NegativeThickSpace; U+0200B
NegativeThinSpace; U+0200B
NegativeVeryThinSpace; U+0200B
NestedGreaterGreater; U+0226B
NestedLessLess; U+0226A
NewLine; U+0000A
Nfr; U+1D511 𝔑
NoBreak; U+02060
NonBreakingSpace; U+000A0  
Nopf; U+02115
Not; U+02AEC
NotCongruent; U+02262
NotCupCap; U+0226D
NotDoubleVerticalBar; U+02226
NotElement; U+02209
NotEqual; U+02260
NotEqualTilde; U+02242 U+00338 ≂̸
NotExists; U+02204
NotGreater; U+0226F
NotGreaterEqual; U+02271
NotGreaterFullEqual; U+02267 U+00338 ≧̸
NotGreaterGreater; U+0226B U+00338 ≫̸
NotGreaterLess; U+02279
NotGreaterSlantEqual; U+02A7E U+00338 ⩾̸
NotGreaterTilde; U+02275
NotHumpDownHump; U+0224E U+00338 ≎̸
NotHumpEqual; U+0224F U+00338 ≏̸
NotLeftTriangle; U+022EA
NotLeftTriangleBar; U+029CF U+00338 ⧏̸
NotLeftTriangleEqual; U+022EC
NotLess; U+0226E
NotLessEqual; U+02270
NotLessGreater; U+02278
NotLessLess; U+0226A U+00338 ≪̸
NotLessSlantEqual; U+02A7D U+00338 ⩽̸
NotLessTilde; U+02274
NotNestedGreaterGreater; U+02AA2 U+00338 ⪢̸
NotNestedLessLess; U+02AA1 U+00338 ⪡̸
NotPrecedes; U+02280
NotPrecedesEqual; U+02AAF U+00338 ⪯̸
NotPrecedesSlantEqual; U+022E0
NotReverseElement; U+0220C
NotRightTriangle; U+022EB
NotRightTriangleBar; U+029D0 U+00338 ⧐̸
NotRightTriangleEqual; U+022ED
NotSquareSubset; U+0228F U+00338 ⊏̸
NotSquareSubsetEqual; U+022E2
NotSquareSuperset; U+02290 U+00338 ⊐̸
NotSquareSupersetEqual; U+022E3
NotSubset; U+02282 U+020D2 ⊂⃒
NotSubsetEqual; U+02288
NotSucceeds; U+02281
NotSucceedsEqual; U+02AB0 U+00338 ⪰̸
NotSucceedsSlantEqual; U+022E1
NotSucceedsTilde; U+0227F U+00338 ≿̸
NotSuperset; U+02283 U+020D2 ⊃⃒
NotSupersetEqual; U+02289
NotTilde; U+02241
NotTildeEqual; U+02244
NotTildeFullEqual; U+02247
NotTildeTilde; U+02249
NotVerticalBar; U+02224
Nscr; U+1D4A9 𝒩
Ntilde; U+000D1 Ñ
Nu; U+0039D Ν
OElig; U+00152 Œ
Oacute; U+000D3 Ó
Ocirc; U+000D4 Ô
Ocy; U+0041E О
Odblac; U+00150 Ő
Ofr; U+1D512 𝔒
Ograve; U+000D2 Ò
Omacr; U+0014C Ō
Omega; U+003A9 Ω
Omicron; U+0039F Ο
Oopf; U+1D546 𝕆
OpenCurlyDoubleQuote; U+0201C
OpenCurlyQuote; U+02018
Or; U+02A54
Oscr; U+1D4AA 𝒪
Oslash; U+000D8 Ø
Otilde; U+000D5 Õ
Otimes; U+02A37
Ouml; U+000D6 Ö
OverBar; U+0203E
OverBrace; U+023DE
OverBracket; U+023B4
OverParenthesis; U+023DC
PartialD; U+02202
Pcy; U+0041F П
Pfr; U+1D513 𝔓
Phi; U+003A6 Φ
Pi; U+003A0 Π
PlusMinus; U+000B1 ±
Poincareplane; U+0210C
Popf; U+02119
Pr; U+02ABB
Precedes; U+0227A
PrecedesEqual; U+02AAF
PrecedesSlantEqual; U+0227C
PrecedesTilde; U+0227E
Prime; U+02033
Product; U+0220F
Proportion; U+02237
Proportional; U+0221D
Pscr; U+1D4AB 𝒫
Psi; U+003A8 Ψ
QUOT; U+00022 "
Qfr; U+1D514 𝔔
Qopf; U+0211A
Qscr; U+1D4AC 𝒬
RBarr; U+02910
REG; U+000AE ®
Racute; U+00154 Ŕ
Rang; U+027EB
Rarr; U+021A0
Rarrtl; U+02916
Rcaron; U+00158 Ř
Rcedil; U+00156 Ŗ
Rcy; U+00420 Р
Re; U+0211C
ReverseElement; U+0220B
ReverseEquilibrium; U+021CB
ReverseUpEquilibrium; U+0296F
Rfr; U+0211C
Rho; U+003A1 Ρ
RightAngleBracket; U+027E9
RightArrow; U+02192
RightArrowBar; U+021E5
RightArrowLeftArrow; U+021C4
RightCeiling; U+02309
RightDoubleBracket; U+027E7
RightDownTeeVector; U+0295D
RightDownVector; U+021C2
RightDownVectorBar; U+02955
RightFloor; U+0230B
RightTee; U+022A2
RightTeeArrow; U+021A6
RightTeeVector; U+0295B
RightTriangle; U+022B3
RightTriangleBar; U+029D0
RightTriangleEqual; U+022B5
RightUpDownVector; U+0294F
RightUpTeeVector; U+0295C
RightUpVector; U+021BE
RightUpVectorBar; U+02954
RightVector; U+021C0
RightVectorBar; U+02953
Rightarrow; U+021D2
Ropf; U+0211D
RoundImplies; U+02970
Rrightarrow; U+021DB
Rscr; U+0211B
Rsh; U+021B1
RuleDelayed; U+029F4
SHCHcy; U+00429 Щ
SHcy; U+00428 Ш
SOFTcy; U+0042C Ь
Sacute; U+0015A Ś
Sc; U+02ABC
Scaron; U+00160 Š
Scedil; U+0015E Ş
Scirc; U+0015C Ŝ
Scy; U+00421 С
Sfr; U+1D516 𝔖
ShortDownArrow; U+02193
ShortLeftArrow; U+02190
ShortRightArrow; U+02192
ShortUpArrow; U+02191
Sigma; U+003A3 Σ
SmallCircle; U+02218
Sopf; U+1D54A 𝕊
Sqrt; U+0221A
Square; U+025A1
SquareIntersection; U+02293
SquareSubset; U+0228F
SquareSubsetEqual; U+02291
SquareSuperset; U+02290
SquareSupersetEqual; U+02292
SquareUnion; U+02294
Sscr; U+1D4AE 𝒮
Star; U+022C6
Sub; U+022D0
Subset; U+022D0
SubsetEqual; U+02286
Succeeds; U+0227B
SucceedsEqual; U+02AB0
SucceedsSlantEqual; U+0227D
SucceedsTilde; U+0227F
SuchThat; U+0220B
Sum; U+02211
Sup; U+022D1
Superset; U+02283
SupersetEqual; U+02287
Supset; U+022D1
THORN; U+000DE Þ
TRADE; U+02122
TSHcy; U+0040B Ћ
TScy; U+00426 Ц
Tab; U+00009
Tau; U+003A4 Τ
Tcaron; U+00164 Ť
Tcedil; U+00162 Ţ
Tcy; U+00422 Т
Tfr; U+1D517 𝔗
Therefore; U+02234
Theta; U+00398 Θ
ThickSpace; U+0205F U+0200A   
ThinSpace; U+02009
Tilde; U+0223C
TildeEqual; U+02243
TildeFullEqual; U+02245
TildeTilde; U+02248
Topf; U+1D54B 𝕋
TripleDot; U+020DB ◌⃛
Tscr; U+1D4AF 𝒯
Tstrok; U+00166 Ŧ
Uacute; U+000DA Ú
Uarr; U+0219F
Uarrocir; U+02949
Ubrcy; U+0040E Ў
Ubreve; U+0016C Ŭ
Ucirc; U+000DB Û
Ucy; U+00423 У
Udblac; U+00170 Ű
Ufr; U+1D518 𝔘
Ugrave; U+000D9 Ù
Umacr; U+0016A Ū
UnderBar; U+0005F _
UnderBrace; U+023DF
UnderBracket; U+023B5
UnderParenthesis; U+023DD
Union; U+022C3
UnionPlus; U+0228E
Uogon; U+00172 Ų
Uopf; U+1D54C 𝕌
UpArrow; U+02191
UpArrowBar; U+02912
UpArrowDownArrow; U+021C5
UpDownArrow; U+02195
UpEquilibrium; U+0296E
UpTee; U+022A5
UpTeeArrow; U+021A5
Uparrow; U+021D1
Updownarrow; U+021D5
UpperLeftArrow; U+02196
UpperRightArrow; U+02197
Upsi; U+003D2 ϒ
Upsilon; U+003A5 Υ
Uring; U+0016E Ů
Uscr; U+1D4B0 𝒰
Utilde; U+00168 Ũ
Uuml; U+000DC Ü
VDash; U+022AB
Vbar; U+02AEB
Vcy; U+00412 В
Vdash; U+022A9
Vdashl; U+02AE6
Vee; U+022C1
Verbar; U+02016
Vert; U+02016
VerticalBar; U+02223
VerticalLine; U+0007C |
VerticalSeparator; U+02758
VerticalTilde; U+02240
VeryThinSpace; U+0200A
Vfr; U+1D519 𝔙
Vopf; U+1D54D 𝕍
Vscr; U+1D4B1 𝒱
Vvdash; U+022AA
Wcirc; U+00174 Ŵ
Wedge; U+022C0
Wfr; U+1D51A 𝔚
Wopf; U+1D54E 𝕎
Wscr; U+1D4B2 𝒲
Xfr; U+1D51B 𝔛
Xi; U+0039E Ξ
Xopf; U+1D54F 𝕏
Xscr; U+1D4B3 𝒳
YAcy; U+0042F Я
YIcy; U+00407 Ї
YUcy; U+0042E Ю
Yacute; U+000DD Ý
Ycirc; U+00176 Ŷ
Ycy; U+0042B Ы
Yfr; U+1D51C 𝔜
Yopf; U+1D550 𝕐
Yscr; U+1D4B4 𝒴
Yuml; U+00178 Ÿ
ZHcy; U+00416 Ж
Zacute; U+00179 Ź
Zcaron; U+0017D Ž
Zcy; U+00417 З
Zdot; U+0017B Ż
ZeroWidthSpace; U+0200B
Zeta; U+00396 Ζ
Zfr; U+02128
Zopf; U+02124
Zscr; U+1D4B5 𝒵
aacute; U+000E1 á
abreve; U+00103 ă
ac; U+0223E
acE; U+0223E U+00333 ∾̳
acd; U+0223F
acirc; U+000E2 â
acute; U+000B4 ´
acy; U+00430 а
aelig; U+000E6 æ
af; U+02061
afr; U+1D51E 𝔞
agrave; U+000E0 à
alefsym; U+02135
aleph; U+02135
alpha; U+003B1 α
amacr; U+00101 ā
amalg; U+02A3F ⨿
amp; U+00026 &
and; U+02227
andand; U+02A55
andd; U+02A5C
andslope; U+02A58
andv; U+02A5A
ang; U+02220
ange; U+029A4
angle; U+02220
angmsd; U+02221
angmsdaa; U+029A8
angmsdab; U+029A9
angmsdac; U+029AA
angmsdad; U+029AB
angmsdae; U+029AC
angmsdaf; U+029AD
angmsdag; U+029AE
angmsdah; U+029AF
angrt; U+0221F
angrtvb; U+022BE
angrtvbd; U+0299D
angsph; U+02222
angst; U+000C5 Å
angzarr; U+0237C
aogon; U+00105 ą
aopf; U+1D552 𝕒
ap; U+02248
apE; U+02A70
apacir; U+02A6F
ape; U+0224A
apid; U+0224B
apos; U+00027 '
approx; U+02248
approxeq; U+0224A
aring; U+000E5 å
ascr; U+1D4B6 𝒶
ast; U+0002A *
asymp; U+02248
asympeq; U+0224D
atilde; U+000E3 ã
auml; U+000E4 ä
awconint; U+02233
awint; U+02A11
bNot; U+02AED
backcong; U+0224C
backepsilon; U+003F6 ϶
backprime; U+02035
backsim; U+0223D
backsimeq; U+022CD
barvee; U+022BD
barwed; U+02305
barwedge; U+02305
bbrk; U+023B5
bbrktbrk; U+023B6
bcong; U+0224C
bcy; U+00431 б
bdquo; U+0201E
becaus; U+02235
because; U+02235
bemptyv; U+029B0
bepsi; U+003F6 ϶
bernou; U+0212C
beta; U+003B2 β
beth; U+02136
between; U+0226C
bfr; U+1D51F 𝔟
bigcap; U+022C2
bigcirc; U+025EF
bigcup; U+022C3
bigodot; U+02A00
bigoplus; U+02A01
bigotimes; U+02A02
bigsqcup; U+02A06
bigstar; U+02605
bigtriangledown; U+025BD
bigtriangleup; U+025B3
biguplus; U+02A04
bigvee; U+022C1
bigwedge; U+022C0
bkarow; U+0290D
blacklozenge; U+029EB
blacksquare; U+025AA
blacktriangle; U+025B4
blacktriangledown; U+025BE
blacktriangleleft; U+025C2
blacktriangleright; U+025B8
blank; U+02423
blk12; U+02592
blk14; U+02591
blk34; U+02593
block; U+02588
bne; U+0003D U+020E5 =⃥
bnequiv; U+02261 U+020E5 ≡⃥
bnot; U+02310
bopf; U+1D553 𝕓
bot; U+022A5
bottom; U+022A5
bowtie; U+022C8
boxDL; U+02557
boxDR; U+02554
boxDl; U+02556
boxDr; U+02553
boxH; U+02550
boxHD; U+02566
boxHU; U+02569
boxHd; U+02564
boxHu; U+02567
boxUL; U+0255D
boxUR; U+0255A
boxUl; U+0255C
boxUr; U+02559
boxV; U+02551
boxVH; U+0256C
boxVL; U+02563
boxVR; U+02560
boxVh; U+0256B
boxVl; U+02562
boxVr; U+0255F
boxbox; U+029C9
boxdL; U+02555
boxdR; U+02552
boxdl; U+02510
boxdr; U+0250C
boxh; U+02500
boxhD; U+02565
boxhU; U+02568
boxhd; U+0252C
boxhu; U+02534
boxminus; U+0229F
boxplus; U+0229E
boxtimes; U+022A0
boxuL; U+0255B
boxuR; U+02558
boxul; U+02518
boxur; U+02514
boxv; U+02502
boxvH; U+0256A
boxvL; U+02561
boxvR; U+0255E
boxvh; U+0253C
boxvl; U+02524
boxvr; U+0251C
bprime; U+02035
breve; U+002D8 ˘
brvbar; U+000A6 ¦
bscr; U+1D4B7 𝒷
bsemi; U+0204F
bsim; U+0223D
bsime; U+022CD
bsol; U+0005C \
bsolb; U+029C5
bsolhsub; U+027C8
bull; U+02022
bullet; U+02022
bump; U+0224E
bumpE; U+02AAE
bumpe; U+0224F
bumpeq; U+0224F
cacute; U+00107 ć
cap; U+02229
capand; U+02A44
capbrcup; U+02A49
capcap; U+02A4B
capcup; U+02A47
capdot; U+02A40
caps; U+02229 U+0FE00 ∩︀
caret; U+02041
caron; U+002C7 ˇ
ccaps; U+02A4D
ccaron; U+0010D č
ccedil; U+000E7 ç
ccirc; U+00109 ĉ
ccups; U+02A4C
ccupssm; U+02A50
cdot; U+0010B ċ
cedil; U+000B8 ¸
cemptyv; U+029B2
cent; U+000A2 ¢
centerdot; U+000B7 ·
cfr; U+1D520 𝔠
chcy; U+00447 ч
check; U+02713
checkmark; U+02713
chi; U+003C7 χ
cir; U+025CB
cirE; U+029C3
circ; U+002C6 ˆ
circeq; U+02257
circlearrowleft; U+021BA
circlearrowright; U+021BB
circledR; U+000AE ®
circledS; U+024C8
circledast; U+0229B
circledcirc; U+0229A
circleddash; U+0229D
cire; U+02257
cirfnint; U+02A10
cirmid; U+02AEF
cirscir; U+029C2
clubs; U+02663
clubsuit; U+02663
colon; U+0003A :
colone; U+02254
coloneq; U+02254
comma; U+0002C ,
commat; U+00040 @
comp; U+02201
compfn; U+02218
complement; U+02201
complexes; U+02102
cong; U+02245
congdot; U+02A6D
conint; U+0222E
copf; U+1D554 𝕔
coprod; U+02210
copy; U+000A9 ©
copysr; U+02117
crarr; U+021B5
cross; U+02717
cscr; U+1D4B8 𝒸
csub; U+02ACF
csube; U+02AD1
csup; U+02AD0
csupe; U+02AD2
ctdot; U+022EF
cudarrl; U+02938
cudarrr; U+02935
cuepr; U+022DE
cuesc; U+022DF
cularr; U+021B6
cularrp; U+0293D
cup; U+0222A
cupbrcap; U+02A48
cupcap; U+02A46
cupcup; U+02A4A
cupdot; U+0228D
cupor; U+02A45
cups; U+0222A U+0FE00 ∪︀
curarr; U+021B7
curarrm; U+0293C
curlyeqprec; U+022DE
curlyeqsucc; U+022DF
curlyvee; U+022CE
curlywedge; U+022CF
curren; U+000A4 ¤
curvearrowleft; U+021B6
curvearrowright; U+021B7
cuvee; U+022CE
cuwed; U+022CF
cwconint; U+02232
cwint; U+02231
cylcty; U+0232D
dArr; U+021D3
dHar; U+02965
dagger; U+02020
daleth; U+02138
darr; U+02193
dash; U+02010
dashv; U+022A3
dbkarow; U+0290F
dblac; U+002DD ˝
dcaron; U+0010F ď
dcy; U+00434 д
dd; U+02146
ddagger; U+02021
ddarr; U+021CA
ddotseq; U+02A77
deg; U+000B0 °
delta; U+003B4 δ
demptyv; U+029B1
dfisht; U+0297F ⥿
dfr; U+1D521 𝔡
dharl; U+021C3
dharr; U+021C2
diam; U+022C4
diamond; U+022C4
diamondsuit; U+02666
diams; U+02666
die; U+000A8 ¨
digamma; U+003DD ϝ
disin; U+022F2
div; U+000F7 ÷
divide; U+000F7 ÷
divideontimes; U+022C7
divonx; U+022C7
djcy; U+00452 ђ
dlcorn; U+0231E
dlcrop; U+0230D
dollar; U+00024 $
dopf; U+1D555 𝕕
dot; U+002D9 ˙
doteq; U+02250
doteqdot; U+02251
dotminus; U+02238
dotplus; U+02214
dotsquare; U+022A1
doublebarwedge; U+02306
downarrow; U+02193
downdownarrows; U+021CA
downharpoonleft; U+021C3
downharpoonright; U+021C2
drbkarow; U+02910
drcorn; U+0231F
drcrop; U+0230C
dscr; U+1D4B9 𝒹
dscy; U+00455 ѕ
dsol; U+029F6
dstrok; U+00111 đ
dtdot; U+022F1
dtri; U+025BF
dtrif; U+025BE
duarr; U+021F5
duhar; U+0296F
dwangle; U+029A6
dzcy; U+0045F џ
dzigrarr; U+027FF
eDDot; U+02A77
eDot; U+02251
eacute; U+000E9 é
easter; U+02A6E
ecaron; U+0011B ě
ecir; U+02256
ecirc; U+000EA ê
ecolon; U+02255
ecy; U+0044D э
edot; U+00117 ė
ee; U+02147
efDot; U+02252
efr; U+1D522 𝔢
eg; U+02A9A
egrave; U+000E8 è
egs; U+02A96
egsdot; U+02A98
el; U+02A99
elinters; U+023E7
ell; U+02113
els; U+02A95
elsdot; U+02A97
emacr; U+00113 ē
empty; U+02205
emptyset; U+02205
emptyv; U+02205
emsp; U+02003
emsp13; U+02004
emsp14; U+02005
eng; U+0014B ŋ
ensp; U+02002
eogon; U+00119 ę
eopf; U+1D556 𝕖
epar; U+022D5
eparsl; U+029E3
eplus; U+02A71
epsi; U+003B5 ε
epsilon; U+003B5 ε
epsiv; U+003F5 ϵ
eqcirc; U+02256
eqcolon; U+02255
eqsim; U+02242
eqslantgtr; U+02A96
eqslantless; U+02A95
equals; U+0003D =
equest; U+0225F
equiv; U+02261
equivDD; U+02A78
eqvparsl; U+029E5
erDot; U+02253
erarr; U+02971
escr; U+0212F
esdot; U+02250
esim; U+02242
eta; U+003B7 η
eth; U+000F0 ð
euml; U+000EB ë
euro; U+020AC
excl; U+00021 !
exist; U+02203
expectation; U+02130
exponentiale; U+02147
fallingdotseq; U+02252
fcy; U+00444 ф
female; U+02640
ffilig; U+0FB03
fflig; U+0FB00
ffllig; U+0FB04
ffr; U+1D523 𝔣
filig; U+0FB01
fjlig; U+00066 U+0006A fj
flat; U+0266D
fllig; U+0FB02
fltns; U+025B1
fnof; U+00192 ƒ
fopf; U+1D557 𝕗
forall; U+02200
fork; U+022D4
forkv; U+02AD9
fpartint; U+02A0D
frac12; U+000BD ½
frac13; U+02153
frac14; U+000BC ¼
frac15; U+02155
frac16; U+02159
frac18; U+0215B
frac23; U+02154
frac25; U+02156
frac34; U+000BE ¾
frac35; U+02157
frac38; U+0215C
frac45; U+02158
frac56; U+0215A
frac58; U+0215D
frac78; U+0215E
frasl; U+02044
frown; U+02322
fscr; U+1D4BB 𝒻
gE; U+02267
gEl; U+02A8C
gacute; U+001F5 ǵ
gamma; U+003B3 γ
gammad; U+003DD ϝ
gap; U+02A86
gbreve; U+0011F ğ
gcirc; U+0011D ĝ
gcy; U+00433 г
gdot; U+00121 ġ
ge; U+02265
gel; U+022DB
geq; U+02265
geqq; U+02267
geqslant; U+02A7E
ges; U+02A7E
gescc; U+02AA9
gesdot; U+02A80
gesdoto; U+02A82
gesdotol; U+02A84
gesl; U+022DB U+0FE00 ⋛︀
gesles; U+02A94
gfr; U+1D524 𝔤
gg; U+0226B
ggg; U+022D9
gimel; U+02137
gjcy; U+00453 ѓ
gl; U+02277
glE; U+02A92
gla; U+02AA5
glj; U+02AA4
gnE; U+02269
gnap; U+02A8A
gnapprox; U+02A8A
gne; U+02A88
gneq; U+02A88
gneqq; U+02269
gnsim; U+022E7
gopf; U+1D558 𝕘
grave; U+00060 `
gscr; U+0210A
gsim; U+02273
gsime; U+02A8E
gsiml; U+02A90
gt; U+0003E >
gtcc; U+02AA7
gtcir; U+02A7A
gtdot; U+022D7
gtlPar; U+02995
gtquest; U+02A7C
gtrapprox; U+02A86
gtrarr; U+02978
gtrdot; U+022D7
gtreqless; U+022DB
gtreqqless; U+02A8C
gtrless; U+02277
gtrsim; U+02273
gvertneqq; U+02269 U+0FE00 ≩︀
gvnE; U+02269 U+0FE00 ≩︀
hArr; U+021D4
hairsp; U+0200A
half; U+000BD ½
hamilt; U+0210B
hardcy; U+0044A ъ
harr; U+02194
harrcir; U+02948
harrw; U+021AD
hbar; U+0210F
hcirc; U+00125 ĥ
hearts; U+02665
heartsuit; U+02665
hellip; U+02026
hercon; U+022B9
hfr; U+1D525 𝔥
hksearow; U+02925
hkswarow; U+02926
hoarr; U+021FF
homtht; U+0223B
hookleftarrow; U+021A9
hookrightarrow; U+021AA
hopf; U+1D559 𝕙
horbar; U+02015
hscr; U+1D4BD 𝒽
hslash; U+0210F
hstrok; U+00127 ħ
hybull; U+02043
hyphen; U+02010
iacute; U+000ED í
ic; U+02063
icirc; U+000EE î
icy; U+00438 и
iecy; U+00435 е
iexcl; U+000A1 ¡
iff; U+021D4
ifr; U+1D526 𝔦
igrave; U+000EC ì
ii; U+02148
iiiint; U+02A0C
iiint; U+0222D
iinfin; U+029DC
iiota; U+02129
ijlig; U+00133 ij
imacr; U+0012B ī
image; U+02111
imagline; U+02110
imagpart; U+02111
imath; U+00131 ı
imof; U+022B7
imped; U+001B5 Ƶ
in; U+02208
incare; U+02105
infin; U+0221E
infintie; U+029DD
inodot; U+00131 ı
int; U+0222B
intcal; U+022BA
integers; U+02124
intercal; U+022BA
intlarhk; U+02A17
intprod; U+02A3C
iocy; U+00451 ё
iogon; U+0012F į
iopf; U+1D55A 𝕚
iota; U+003B9 ι
iprod; U+02A3C
iquest; U+000BF ¿
iscr; U+1D4BE 𝒾
isin; U+02208
isinE; U+022F9
isindot; U+022F5
isins; U+022F4
isinsv; U+022F3
isinv; U+02208
it; U+02062
itilde; U+00129 ĩ
iukcy; U+00456 і
iuml; U+000EF ï
jcirc; U+00135 ĵ
jcy; U+00439 й
jfr; U+1D527 𝔧
jmath; U+00237 ȷ
jopf; U+1D55B 𝕛
jscr; U+1D4BF 𝒿
jsercy; U+00458 ј
jukcy; U+00454 є
kappa; U+003BA κ
kappav; U+003F0 ϰ
kcedil; U+00137 ķ
kcy; U+0043A к
kfr; U+1D528 𝔨
kgreen; U+00138 ĸ
khcy; U+00445 х
kjcy; U+0045C ќ
kopf; U+1D55C 𝕜
kscr; U+1D4C0 𝓀
lAarr; U+021DA
lArr; U+021D0
lAtail; U+0291B
lBarr; U+0290E
lE; U+02266
lEg; U+02A8B
lHar; U+02962
lacute; U+0013A ĺ
laemptyv; U+029B4
lagran; U+02112
lambda; U+003BB λ
lang; U+027E8
langd; U+02991
langle; U+027E8
lap; U+02A85
laquo; U+000AB «
larr; U+02190
larrb; U+021E4
larrbfs; U+0291F
larrfs; U+0291D
larrhk; U+021A9
larrlp; U+021AB
larrpl; U+02939
larrsim; U+02973
larrtl; U+021A2
lat; U+02AAB
latail; U+02919
late; U+02AAD
lates; U+02AAD U+0FE00 ⪭︀
lbarr; U+0290C
lbbrk; U+02772
lbrace; U+0007B {
lbrack; U+0005B [
lbrke; U+0298B
lbrksld; U+0298F
lbrkslu; U+0298D
lcaron; U+0013E ľ
lcedil; U+0013C ļ
lceil; U+02308
lcub; U+0007B {
lcy; U+0043B л
ldca; U+02936
ldquo; U+0201C
ldquor; U+0201E
ldrdhar; U+02967
ldrushar; U+0294B
ldsh; U+021B2
le; U+02264
leftarrow; U+02190
leftarrowtail; U+021A2
leftharpoondown; U+021BD
leftharpoonup; U+021BC
leftleftarrows; U+021C7
leftrightarrow; U+02194
leftrightarrows; U+021C6
leftrightharpoons; U+021CB
leftrightsquigarrow; U+021AD
leftthreetimes; U+022CB
leg; U+022DA
leq; U+02264
leqq; U+02266
leqslant; U+02A7D
les; U+02A7D
lescc; U+02AA8
lesdot; U+02A7F ⩿
lesdoto; U+02A81
lesdotor; U+02A83
lesg; U+022DA U+0FE00 ⋚︀
lesges; U+02A93
lessapprox; U+02A85
lessdot; U+022D6
lesseqgtr; U+022DA
lesseqqgtr; U+02A8B
lessgtr; U+02276
lesssim; U+02272
lfisht; U+0297C
lfloor; U+0230A
lfr; U+1D529 𝔩
lg; U+02276
lgE; U+02A91
lhard; U+021BD
lharu; U+021BC
lharul; U+0296A
lhblk; U+02584
ljcy; U+00459 љ
ll; U+0226A
llarr; U+021C7
llcorner; U+0231E
llhard; U+0296B
lltri; U+025FA
lmidot; U+00140 ŀ
lmoust; U+023B0
lmoustache; U+023B0
lnE; U+02268
lnap; U+02A89
lnapprox; U+02A89
lne; U+02A87
lneq; U+02A87
lneqq; U+02268
lnsim; U+022E6
loang; U+027EC
loarr; U+021FD
lobrk; U+027E6
longleftarrow; U+027F5
longleftrightarrow; U+027F7
longmapsto; U+027FC
longrightarrow; U+027F6
looparrowleft; U+021AB
looparrowright; U+021AC
lopar; U+02985
lopf; U+1D55D 𝕝
loplus; U+02A2D
lotimes; U+02A34
lowast; U+02217
lowbar; U+0005F _
loz; U+025CA
lozenge; U+025CA
lozf; U+029EB
lpar; U+00028 (
lparlt; U+02993
lrarr; U+021C6
lrcorner; U+0231F
lrhar; U+021CB
lrhard; U+0296D
lrm; U+0200E
lrtri; U+022BF
lsaquo; U+02039
lscr; U+1D4C1 𝓁
lsh; U+021B0
lsim; U+02272
lsime; U+02A8D
lsimg; U+02A8F
lsqb; U+0005B [
lsquo; U+02018
lsquor; U+0201A
lstrok; U+00142 ł
lt; U+0003C <
ltcc; U+02AA6
ltcir; U+02A79
ltdot; U+022D6
lthree; U+022CB
ltimes; U+022C9
ltlarr; U+02976
ltquest; U+02A7B
ltrPar; U+02996
ltri; U+025C3
ltrie; U+022B4
ltrif; U+025C2
lurdshar; U+0294A
luruhar; U+02966
lvertneqq; U+02268 U+0FE00 ≨︀
lvnE; U+02268 U+0FE00 ≨︀
mDDot; U+0223A
macr; U+000AF ¯
male; U+02642
malt; U+02720
maltese; U+02720
map; U+021A6
mapsto; U+021A6
mapstodown; U+021A7
mapstoleft; U+021A4
mapstoup; U+021A5
marker; U+025AE
mcomma; U+02A29
mcy; U+0043C м
mdash; U+02014
measuredangle; U+02221
mfr; U+1D52A 𝔪
mho; U+02127
micro; U+000B5 µ
mid; U+02223
midast; U+0002A *
midcir; U+02AF0
middot; U+000B7 ·
minus; U+02212
minusb; U+0229F
minusd; U+02238
minusdu; U+02A2A
mlcp; U+02ADB
mldr; U+02026
mnplus; U+02213
models; U+022A7
mopf; U+1D55E 𝕞
mp; U+02213
mscr; U+1D4C2 𝓂
mstpos; U+0223E
mu; U+003BC μ
multimap; U+022B8
mumap; U+022B8
nGg; U+022D9 U+00338 ⋙̸
nGt; U+0226B U+020D2 ≫⃒
nGtv; U+0226B U+00338 ≫̸
nLeftarrow; U+021CD
nLeftrightarrow; U+021CE
nLl; U+022D8 U+00338 ⋘̸
nLt; U+0226A U+020D2 ≪⃒
nLtv; U+0226A U+00338 ≪̸
nRightarrow; U+021CF
nVDash; U+022AF
nVdash; U+022AE
nabla; U+02207
nacute; U+00144 ń
nang; U+02220 U+020D2 ∠⃒
nap; U+02249
napE; U+02A70 U+00338 ⩰̸
napid; U+0224B U+00338 ≋̸
napos; U+00149 ʼn
napprox; U+02249
natur; U+0266E
natural; U+0266E
naturals; U+02115
nbsp; U+000A0  
nbump; U+0224E U+00338 ≎̸
nbumpe; U+0224F U+00338 ≏̸
ncap; U+02A43
ncaron; U+00148 ň
ncedil; U+00146 ņ
ncong; U+02247
ncongdot; U+02A6D U+00338 ⩭̸
ncup; U+02A42
ncy; U+0043D н
ndash; U+02013
ne; U+02260
neArr; U+021D7
nearhk; U+02924
nearr; U+02197
nearrow; U+02197
nedot; U+02250 U+00338 ≐̸
nequiv; U+02262
nesear; U+02928
nesim; U+02242 U+00338 ≂̸
nexist; U+02204
nexists; U+02204
nfr; U+1D52B 𝔫
ngE; U+02267 U+00338 ≧̸
nge; U+02271
ngeq; U+02271
ngeqq; U+02267 U+00338 ≧̸
ngeqslant; U+02A7E U+00338 ⩾̸
nges; U+02A7E U+00338 ⩾̸
ngsim; U+02275
ngt; U+0226F
ngtr; U+0226F
nhArr; U+021CE
nharr; U+021AE
nhpar; U+02AF2
ni; U+0220B
nis; U+022FC
nisd; U+022FA
niv; U+0220B
njcy; U+0045A њ
nlArr; U+021CD
nlE; U+02266 U+00338 ≦̸
nlarr; U+0219A
nldr; U+02025
nle; U+02270
nleftarrow; U+0219A
nleftrightarrow; U+021AE
nleq; U+02270
nleqq; U+02266 U+00338 ≦̸
nleqslant; U+02A7D U+00338 ⩽̸
nles; U+02A7D U+00338 ⩽̸
nless; U+0226E
nlsim; U+02274
nlt; U+0226E
nltri; U+022EA
nltrie; U+022EC
nmid; U+02224
nopf; U+1D55F 𝕟
not; U+000AC ¬
notin; U+02209
notinE; U+022F9 U+00338 ⋹̸
notindot; U+022F5 U+00338 ⋵̸
notinva; U+02209
notinvb; U+022F7
notinvc; U+022F6
notni; U+0220C
notniva; U+0220C
notnivb; U+022FE
notnivc; U+022FD
npar; U+02226
nparallel; U+02226
nparsl; U+02AFD U+020E5 ⫽⃥
npart; U+02202 U+00338 ∂̸
npolint; U+02A14
npr; U+02280
nprcue; U+022E0
npre; U+02AAF U+00338 ⪯̸
nprec; U+02280
npreceq; U+02AAF U+00338 ⪯̸
nrArr; U+021CF
nrarr; U+0219B
nrarrc; U+02933 U+00338 ⤳̸
nrarrw; U+0219D U+00338 ↝̸
nrightarrow; U+0219B
nrtri; U+022EB
nrtrie; U+022ED
nsc; U+02281
nsccue; U+022E1
nsce; U+02AB0 U+00338 ⪰̸
nscr; U+1D4C3 𝓃
nshortmid; U+02224
nshortparallel; U+02226
nsim; U+02241
nsime; U+02244
nsimeq; U+02244
nsmid; U+02224
nspar; U+02226
nsqsube; U+022E2
nsqsupe; U+022E3
nsub; U+02284
nsubE; U+02AC5 U+00338 ⫅̸
nsube; U+02288
nsubset; U+02282 U+020D2 ⊂⃒
nsubseteq; U+02288
nsubseteqq; U+02AC5 U+00338 ⫅̸
nsucc; U+02281
nsucceq; U+02AB0 U+00338 ⪰̸
nsup; U+02285
nsupE; U+02AC6 U+00338 ⫆̸
nsupe; U+02289
nsupset; U+02283 U+020D2 ⊃⃒
nsupseteq; U+02289
nsupseteqq; U+02AC6 U+00338 ⫆̸
ntgl; U+02279
ntilde; U+000F1 ñ
ntlg; U+02278
ntriangleleft; U+022EA
ntrianglelefteq; U+022EC
ntriangleright; U+022EB
ntrianglerighteq; U+022ED
nu; U+003BD ν
num; U+00023 #
numero; U+02116
numsp; U+02007
nvDash; U+022AD
nvHarr; U+02904
nvap; U+0224D U+020D2 ≍⃒
nvdash; U+022AC
nvge; U+02265 U+020D2 ≥⃒
nvgt; U+0003E U+020D2 >⃒
nvinfin; U+029DE
nvlArr; U+02902
nvle; U+02264 U+020D2 ≤⃒
nvlt; U+0003C U+020D2 <⃒
nvltrie; U+022B4 U+020D2 ⊴⃒
nvrArr; U+02903
nvrtrie; U+022B5 U+020D2 ⊵⃒
nvsim; U+0223C U+020D2 ∼⃒
nwArr; U+021D6
nwarhk; U+02923
nwarr; U+02196
nwarrow; U+02196
nwnear; U+02927
oS; U+024C8
oacute; U+000F3 ó
oast; U+0229B
ocir; U+0229A
ocirc; U+000F4 ô
ocy; U+0043E о
odash; U+0229D
odblac; U+00151 ő
odiv; U+02A38
odot; U+02299
odsold; U+029BC
oelig; U+00153 œ
ofcir; U+029BF ⦿
ofr; U+1D52C 𝔬
ogon; U+002DB ˛
ograve; U+000F2 ò
ogt; U+029C1
ohbar; U+029B5
ohm; U+003A9 Ω
oint; U+0222E
olarr; U+021BA
olcir; U+029BE
olcross; U+029BB
oline; U+0203E
olt; U+029C0
omacr; U+0014D ō
omega; U+003C9 ω
omicron; U+003BF ο
omid; U+029B6
ominus; U+02296
oopf; U+1D560 𝕠
opar; U+029B7
operp; U+029B9
oplus; U+02295
or; U+02228
orarr; U+021BB
ord; U+02A5D
order; U+02134
orderof; U+02134
ordf; U+000AA ª
ordm; U+000BA º
origof; U+022B6
oror; U+02A56
orslope; U+02A57
orv; U+02A5B
oscr; U+02134
oslash; U+000F8 ø
osol; U+02298
otilde; U+000F5 õ
otimes; U+02297
otimesas; U+02A36
ouml; U+000F6 ö
ovbar; U+0233D
par; U+02225
para; U+000B6
parallel; U+02225
parsim; U+02AF3
parsl; U+02AFD
part; U+02202
pcy; U+0043F п
percnt; U+00025 %
period; U+0002E .
permil; U+02030
perp; U+022A5
pertenk; U+02031
pfr; U+1D52D 𝔭
phi; U+003C6 φ
phiv; U+003D5 ϕ
phmmat; U+02133
phone; U+0260E
pi; U+003C0 π
pitchfork; U+022D4
piv; U+003D6 ϖ
planck; U+0210F
planckh; U+0210E
plankv; U+0210F
plus; U+0002B +
plusacir; U+02A23
plusb; U+0229E
pluscir; U+02A22
plusdo; U+02214
plusdu; U+02A25
pluse; U+02A72
plusmn; U+000B1 ±
plussim; U+02A26
plustwo; U+02A27
pm; U+000B1 ±
pointint; U+02A15
popf; U+1D561 𝕡
pound; U+000A3 £
pr; U+0227A
prE; U+02AB3
prap; U+02AB7
prcue; U+0227C
pre; U+02AAF
prec; U+0227A
precapprox; U+02AB7
preccurlyeq; U+0227C
preceq; U+02AAF
precnapprox; U+02AB9
precneqq; U+02AB5
precnsim; U+022E8
precsim; U+0227E
prime; U+02032
primes; U+02119
prnE; U+02AB5
prnap; U+02AB9
prnsim; U+022E8
prod; U+0220F
profalar; U+0232E
profline; U+02312
profsurf; U+02313
prop; U+0221D
propto; U+0221D
prsim; U+0227E
prurel; U+022B0
pscr; U+1D4C5 𝓅
psi; U+003C8 ψ
puncsp; U+02008
qfr; U+1D52E 𝔮
qint; U+02A0C
qopf; U+1D562 𝕢
qprime; U+02057
qscr; U+1D4C6 𝓆
quaternions; U+0210D
quatint; U+02A16
quest; U+0003F ?
questeq; U+0225F
quot; U+00022 "
rAarr; U+021DB
rArr; U+021D2
rAtail; U+0291C
rBarr; U+0290F
rHar; U+02964
race; U+0223D U+00331 ∽̱
racute; U+00155 ŕ
radic; U+0221A
raemptyv; U+029B3
rang; U+027E9
rangd; U+02992
range; U+029A5
rangle; U+027E9
raquo; U+000BB »
rarr; U+02192
rarrap; U+02975
rarrb; U+021E5
rarrbfs; U+02920
rarrc; U+02933
rarrfs; U+0291E
rarrhk; U+021AA
rarrlp; U+021AC
rarrpl; U+02945
rarrsim; U+02974
rarrtl; U+021A3
rarrw; U+0219D
ratail; U+0291A
ratio; U+02236
rationals; U+0211A
rbarr; U+0290D
rbbrk; U+02773
rbrace; U+0007D }
rbrack; U+0005D ]
rbrke; U+0298C
rbrksld; U+0298E
rbrkslu; U+02990
rcaron; U+00159 ř
rcedil; U+00157 ŗ
rceil; U+02309
rcub; U+0007D }
rcy; U+00440 р
rdca; U+02937
rdldhar; U+02969
rdquo; U+0201D
rdquor; U+0201D
rdsh; U+021B3
real; U+0211C
realine; U+0211B
realpart; U+0211C
reals; U+0211D
rect; U+025AD
reg; U+000AE ®
rfisht; U+0297D
rfloor; U+0230B
rfr; U+1D52F 𝔯
rhard; U+021C1
rharu; U+021C0
rharul; U+0296C
rho; U+003C1 ρ
rhov; U+003F1 ϱ
rightarrow; U+02192
rightarrowtail; U+021A3
rightharpoondown; U+021C1
rightharpoonup; U+021C0
rightleftarrows; U+021C4
rightleftharpoons; U+021CC
rightrightarrows; U+021C9
rightsquigarrow; U+0219D
rightthreetimes; U+022CC
ring; U+002DA ˚
risingdotseq; U+02253
rlarr; U+021C4
rlhar; U+021CC
rlm; U+0200F
rmoust; U+023B1
rmoustache; U+023B1
rnmid; U+02AEE
roang; U+027ED
roarr; U+021FE
robrk; U+027E7
ropar; U+02986
ropf; U+1D563 𝕣
roplus; U+02A2E
rotimes; U+02A35
rpar; U+00029 )
rpargt; U+02994
rppolint; U+02A12
rrarr; U+021C9
rsaquo; U+0203A
rscr; U+1D4C7 𝓇
rsh; U+021B1
rsqb; U+0005D ]
rsquo; U+02019
rsquor; U+02019
rthree; U+022CC
rtimes; U+022CA
rtri; U+025B9
rtrie; U+022B5
rtrif; U+025B8
rtriltri; U+029CE
ruluhar; U+02968
rx; U+0211E
sacute; U+0015B ś
sbquo; U+0201A
sc; U+0227B
scE; U+02AB4
scap; U+02AB8
scaron; U+00161 š
sccue; U+0227D
sce; U+02AB0
scedil; U+0015F ş
scirc; U+0015D ŝ
scnE; U+02AB6
scnap; U+02ABA
scnsim; U+022E9
scpolint; U+02A13
scsim; U+0227F
scy; U+00441 с
sdot; U+022C5
sdotb; U+022A1
sdote; U+02A66
seArr; U+021D8
searhk; U+02925
searr; U+02198
searrow; U+02198
sect; U+000A7 §
semi; U+0003B ;
seswar; U+02929
setminus; U+02216
setmn; U+02216
sext; U+02736
sfr; U+1D530 𝔰
sfrown; U+02322
sharp; U+0266F
shchcy; U+00449 щ
shcy; U+00448 ш
shortmid; U+02223
shortparallel; U+02225
shy; U+000AD ­
sigma; U+003C3 σ
sigmaf; U+003C2 ς
sigmav; U+003C2 ς
sim; U+0223C
simdot; U+02A6A
sime; U+02243
simeq; U+02243
simg; U+02A9E
simgE; U+02AA0
siml; U+02A9D
simlE; U+02A9F
simne; U+02246
simplus; U+02A24
simrarr; U+02972
slarr; U+02190
smallsetminus; U+02216
smashp; U+02A33
smeparsl; U+029E4
smid; U+02223
smile; U+02323
smt; U+02AAA
smte; U+02AAC
smtes; U+02AAC U+0FE00 ⪬︀
softcy; U+0044C ь
sol; U+0002F /
solb; U+029C4
solbar; U+0233F
sopf; U+1D564 𝕤
spades; U+02660
spadesuit; U+02660
spar; U+02225
sqcap; U+02293
sqcaps; U+02293 U+0FE00 ⊓︀
sqcup; U+02294
sqcups; U+02294 U+0FE00 ⊔︀
sqsub; U+0228F
sqsube; U+02291
sqsubset; U+0228F
sqsubseteq; U+02291
sqsup; U+02290
sqsupe; U+02292
sqsupset; U+02290
sqsupseteq; U+02292
squ; U+025A1
square; U+025A1
squarf; U+025AA
squf; U+025AA
srarr; U+02192
sscr; U+1D4C8 𝓈
ssetmn; U+02216
ssmile; U+02323
sstarf; U+022C6
star; U+02606
starf; U+02605
straightepsilon; U+003F5 ϵ
straightphi; U+003D5 ϕ
strns; U+000AF ¯
sub; U+02282
subE; U+02AC5
subdot; U+02ABD
sube; U+02286
subedot; U+02AC3
submult; U+02AC1
subnE; U+02ACB
subne; U+0228A
subplus; U+02ABF ⪿
subrarr; U+02979
subset; U+02282
subseteq; U+02286
subseteqq; U+02AC5
subsetneq; U+0228A
subsetneqq; U+02ACB
subsim; U+02AC7
subsub; U+02AD5
subsup; U+02AD3
succ; U+0227B
succapprox; U+02AB8
succcurlyeq; U+0227D
succeq; U+02AB0
succnapprox; U+02ABA
succneqq; U+02AB6
succnsim; U+022E9
succsim; U+0227F
sum; U+02211
sung; U+0266A
sup; U+02283
sup1; U+000B9 ¹
sup2; U+000B2 ²
sup3; U+000B3 ³
supE; U+02AC6
supdot; U+02ABE
supdsub; U+02AD8
supe; U+02287
supedot; U+02AC4
suphsol; U+027C9
suphsub; U+02AD7
suplarr; U+0297B
supmult; U+02AC2
supnE; U+02ACC
supne; U+0228B
supplus; U+02AC0
supset; U+02283
supseteq; U+02287
supseteqq; U+02AC6
supsetneq; U+0228B
supsetneqq; U+02ACC
supsim; U+02AC8
supsub; U+02AD4
supsup; U+02AD6
swArr; U+021D9
swarhk; U+02926
swarr; U+02199
swarrow; U+02199
swnwar; U+0292A
szlig; U+000DF ß
target; U+02316
tau; U+003C4 τ
tbrk; U+023B4
tcaron; U+00165 ť
tcedil; U+00163 ţ
tcy; U+00442 т
tdot; U+020DB ◌⃛
telrec; U+02315
tfr; U+1D531 𝔱
there4; U+02234
therefore; U+02234
theta; U+003B8 θ
thetasym; U+003D1 ϑ
thetav; U+003D1 ϑ
thickapprox; U+02248
thicksim; U+0223C
thinsp; U+02009
thkap; U+02248
thksim; U+0223C
thorn; U+000FE þ
tilde; U+002DC ˜
times; U+000D7 ×
timesb; U+022A0
timesbar; U+02A31
timesd; U+02A30
tint; U+0222D
toea; U+02928
top; U+022A4
topbot; U+02336
topcir; U+02AF1
topf; U+1D565 𝕥
topfork; U+02ADA
tosa; U+02929
tprime; U+02034
trade; U+02122
triangle; U+025B5
triangledown; U+025BF
triangleleft; U+025C3
trianglelefteq; U+022B4
triangleq; U+0225C
triangleright; U+025B9
trianglerighteq; U+022B5
tridot; U+025EC
trie; U+0225C
triminus; U+02A3A
triplus; U+02A39
trisb; U+029CD
tritime; U+02A3B
trpezium; U+023E2
tscr; U+1D4C9 𝓉
tscy; U+00446 ц
tshcy; U+0045B ћ
tstrok; U+00167 ŧ
twixt; U+0226C
twoheadleftarrow; U+0219E
twoheadrightarrow; U+021A0
uArr; U+021D1
uHar; U+02963
uacute; U+000FA ú
uarr; U+02191
ubrcy; U+0045E ў
ubreve; U+0016D ŭ
ucirc; U+000FB û
ucy; U+00443 у
udarr; U+021C5
udblac; U+00171 ű
udhar; U+0296E
ufisht; U+0297E
ufr; U+1D532 𝔲
ugrave; U+000F9 ù
uharl; U+021BF
uharr; U+021BE
uhblk; U+02580
ulcorn; U+0231C
ulcorner; U+0231C
ulcrop; U+0230F
ultri; U+025F8
umacr; U+0016B ū
uml; U+000A8 ¨
uogon; U+00173 ų
uopf; U+1D566 𝕦
uparrow; U+02191
updownarrow; U+02195
upharpoonleft; U+021BF
upharpoonright; U+021BE
uplus; U+0228E
upsi; U+003C5 υ
upsih; U+003D2 ϒ
upsilon; U+003C5 υ
upuparrows; U+021C8
urcorn; U+0231D
urcorner; U+0231D
urcrop; U+0230E
uring; U+0016F ů
urtri; U+025F9
uscr; U+1D4CA 𝓊
utdot; U+022F0
utilde; U+00169 ũ
utri; U+025B5
utrif; U+025B4
uuarr; U+021C8
uuml; U+000FC ü
uwangle; U+029A7
vArr; U+021D5
vBar; U+02AE8
vBarv; U+02AE9
vDash; U+022A8
vangrt; U+0299C
varepsilon; U+003F5 ϵ
varkappa; U+003F0 ϰ
varnothing; U+02205
varphi; U+003D5 ϕ
varpi; U+003D6 ϖ
varpropto; U+0221D
varr; U+02195
varrho; U+003F1 ϱ
varsigma; U+003C2 ς
varsubsetneq; U+0228A U+0FE00 ⊊︀
varsubsetneqq; U+02ACB U+0FE00 ⫋︀
varsupsetneq; U+0228B U+0FE00 ⊋︀
varsupsetneqq; U+02ACC U+0FE00 ⫌︀
vartheta; U+003D1 ϑ
vartriangleleft; U+022B2
vartriangleright; U+022B3
vcy; U+00432 в
vdash; U+022A2
vee; U+02228
veebar; U+022BB
veeeq; U+0225A
vellip; U+022EE
verbar; U+0007C |
vert; U+0007C |
vfr; U+1D533 𝔳
vltri; U+022B2
vnsub; U+02282 U+020D2 ⊂⃒
vnsup; U+02283 U+020D2 ⊃⃒
vopf; U+1D567 𝕧
vprop; U+0221D
vrtri; U+022B3
vscr; U+1D4CB 𝓋
vsubnE; U+02ACB U+0FE00 ⫋︀
vsubne; U+0228A U+0FE00 ⊊︀
vsupnE; U+02ACC U+0FE00 ⫌︀
vsupne; U+0228B U+0FE00 ⊋︀
vzigzag; U+0299A
wcirc; U+00175 ŵ
wedbar; U+02A5F
wedge; U+02227
wedgeq; U+02259
weierp; U+02118
wfr; U+1D534 𝔴
wopf; U+1D568 𝕨
wp; U+02118
wr; U+02240
wreath; U+02240
wscr; U+1D4CC 𝓌
xcap; U+022C2
xcirc; U+025EF
xcup; U+022C3
xdtri; U+025BD
xfr; U+1D535 𝔵
xhArr; U+027FA
xharr; U+027F7
xi; U+003BE ξ
xlArr; U+027F8
xlarr; U+027F5
xmap; U+027FC
xnis; U+022FB
xodot; U+02A00
xopf; U+1D569 𝕩
xoplus; U+02A01
xotime; U+02A02
xrArr; U+027F9
xrarr; U+027F6
xscr; U+1D4CD 𝓍
xsqcup; U+02A06
xuplus; U+02A04
xutri; U+025B3
xvee; U+022C1
xwedge; U+022C0
yacute; U+000FD ý
yacy; U+0044F я
ycirc; U+00177 ŷ
ycy; U+0044B ы
yen; U+000A5 ¥
yfr; U+1D536 𝔶
yicy; U+00457 ї
yopf; U+1D56A 𝕪
yscr; U+1D4CE 𝓎
yucy; U+0044E ю
yuml; U+000FF ÿ
zacute; U+0017A ź
zcaron; U+0017E ž
zcy; U+00437 з
zdot; U+0017C ż
zeetrf; U+02128
zeta; U+003B6 ζ
zfr; U+1D537 𝔷
zhcy; U+00436 ж
zigrarr; U+021DD
zopf; U+1D56B 𝕫
zscr; U+1D4CF 𝓏
zwj; U+0200D
zwnj; U+0200C
AElig U+000C6 Æ
AMP U+00026 &
Aacute U+000C1 Á
Acirc U+000C2 Â
Agrave U+000C0 À
Aring U+000C5 Å
Atilde U+000C3 Ã
Auml U+000C4 Ä
COPY U+000A9 ©
Ccedil U+000C7 Ç
ETH U+000D0 Ð
Eacute U+000C9 É
Ecirc U+000CA Ê
Egrave U+000C8 È
Euml U+000CB Ë
GT U+0003E >
Iacute U+000CD Í
Icirc U+000CE Î
Igrave U+000CC Ì
Iuml U+000CF Ï
LT U+0003C <
Ntilde U+000D1 Ñ
Oacute U+000D3 Ó
Ocirc U+000D4 Ô
Ograve U+000D2 Ò
Oslash U+000D8 Ø
Otilde U+000D5 Õ
Ouml U+000D6 Ö
QUOT U+00022 "
REG U+000AE ®
THORN U+000DE Þ
Uacute U+000DA Ú
Ucirc U+000DB Û
Ugrave U+000D9 Ù
Uuml U+000DC Ü
Yacute U+000DD Ý
aacute U+000E1 á
acirc U+000E2 â
acute U+000B4 ´
aelig U+000E6 æ
agrave U+000E0 à
amp U+00026 &
aring U+000E5 å
atilde U+000E3 ã
auml U+000E4 ä
brvbar U+000A6 ¦
ccedil U+000E7 ç
cedil U+000B8 ¸
cent U+000A2 ¢
copy U+000A9 ©
curren U+000A4 ¤
deg U+000B0 °
divide U+000F7 ÷
eacute U+000E9 é
ecirc U+000EA ê
egrave U+000E8 è
eth U+000F0 ð
euml U+000EB ë
frac12 U+000BD ½
frac14 U+000BC ¼
frac34 U+000BE ¾
gt U+0003E >
iacute U+000ED í
icirc U+000EE î
iexcl U+000A1 ¡
igrave U+000EC ì
iquest U+000BF ¿
iuml U+000EF ï
laquo U+000AB «
lt U+0003C <
macr U+000AF ¯
micro U+000B5 µ
middot U+000B7 ·
nbsp U+000A0  
not U+000AC ¬
ntilde U+000F1 ñ
oacute U+000F3 ó
ocirc U+000F4 ô
ograve U+000F2 ò
ordf U+000AA ª
ordm U+000BA º
oslash U+000F8 ø
otilde U+000F5 õ
ouml U+000F6 ö
para U+000B6
plusmn U+000B1 ±
pound U+000A3 £
quot U+00022 "
raquo U+000BB »
reg U+000AE ®
sect U+000A7 §
shy U+000AD ­
sup1 U+000B9 ¹
sup2 U+000B2 ²
sup3 U+000B3 ³
szlig U+000DF ß
thorn U+000FE þ
times U+000D7 ×
uacute U+000FA ú
ucirc U+000FB û
ugrave U+000F9 ù
uml U+000A8 ¨
uuml U+000FC ü
yacute U+000FD ý
yen U+000A5 ¥
yuml U+000FF ÿ
"), set the escape flag to false. + + In any case, emit the input character as a character token. Stay + in the data state. + + EOF + Emit an end-of-file token. + + Anything else + Emit the input character as a character token. Stay in the data + state. + + 8.2.4.2 Character reference data state + + (This cannot happen if the content model flag is set to the CDATA + state.) + + Attempt to consume a character reference, with no additional allowed + character. + + If nothing is returned, emit a U+0026 AMPERSAND character token. + + Otherwise, emit the character token that was returned. + + Finally, switch to the data state. + + 8.2.4.3 Tag open state + + The behavior of this state depends on the content model flag. + + If the content model flag is set to the RCDATA or CDATA states + Consume the next input character. If it is a U+002F SOLIDUS (/) + character, switch to the close tag open state. Otherwise, emit a + U+003C LESS-THAN SIGN character token and reconsume the current + input character in the data state. + + If the content model flag is set to the PCDATA state + Consume the next input character: + + U+0021 EXCLAMATION MARK (!) + Switch to the markup declaration open state. + + U+002F SOLIDUS (/) + Switch to the close tag open state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL + LETTER Z + Create a new start tag token, set its tag name to the + lowercase version of the input character (add 0x0020 to + the character's code point), then switch to the tag name + state. (Don't emit the token yet; further details will be + filled in before it is emitted.) + + U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z + Create a new start tag token, set its tag name to the + input character, then switch to the tag name state. (Don't + emit the token yet; further details will be filled in + before it is emitted.) + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit a U+003C LESS-THAN SIGN character token + and a U+003E GREATER-THAN SIGN character token. Switch to + the data state. + + U+003F QUESTION MARK (?) + Parse error. Switch to the bogus comment state. + + Anything else + Parse error. Emit a U+003C LESS-THAN SIGN character token + and reconsume the current input character in the data + state. + + 8.2.4.4 Close tag open state + + If the content model flag is set to the RCDATA or CDATA states but no + start tag token has ever been emitted by this instance of the tokeniser + (fragment case), or, if the content model flag is set to the RCDATA or + CDATA states and the next few characters do not match the tag name of + the last start tag token emitted (compared in an ASCII case-insensitive + manner), or if they do but they are not immediately followed by one of + the following characters: + * U+0009 CHARACTER TABULATION + * U+000A LINE FEED (LF) + * U+000C FORM FEED (FF) + * U+0020 SPACE + * U+003E GREATER-THAN SIGN (>) + * U+002F SOLIDUS (/) + * EOF + + ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS + character token, and switch to the data state to process the next input + character. + + Otherwise, if the content model flag is set to the PCDATA state, or if + the next few characters do match that tag name, consume the next input + character: + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Create a new end tag token, set its tag name to the lowercase + version of the input character (add 0x0020 to the character's + code point), then switch to the tag name state. (Don't emit the + token yet; further details will be filled in before it is + emitted.) + + U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z + Create a new end tag token, set its tag name to the input + character, then switch to the tag name state. (Don't emit the + token yet; further details will be filled in before it is + emitted.) + + U+003E GREATER-THAN SIGN (>) + Parse error. Switch to the data state. + + EOF + Parse error. Emit a U+003C LESS-THAN SIGN character token and a + U+002F SOLIDUS character token. Reconsume the EOF character in + the data state. + + Anything else + Parse error. Switch to the bogus comment state. + + 8.2.4.5 Tag name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the current input character (add + 0x0020 to the character's code point) to the current tag token's + tag name. Stay in the tag name state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Append the current input character to the current tag token's + tag name. Stay in the tag name state. + + 8.2.4.6 Before attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Start a new attribute in the current tag token. Set that + attribute's name to the lowercase version of the current input + character (add 0x0020 to the character's code point), and its + value to the empty string. Switch to the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Start a new attribute in the current tag token. Set that + attribute's name to the current input character, and its value + to the empty string. Switch to the attribute name state. + + 8.2.4.7 Attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the after attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003D EQUALS SIGN (=) + Switch to the before attribute value state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the current input character (add + 0x0020 to the character's code point) to the current attribute's + name. Stay in the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Append the current input character to the current attribute's + name. Stay in the attribute name state. + + When the user agent leaves the attribute name state (and before + emitting the tag token, if appropriate), the complete attribute's name + must be compared to the other attributes on the same token; if there is + already an attribute on the token with the exact same name, then this + is a parse error and the new attribute must be dropped, along with the + value that gets associated with it (if any). + + 8.2.4.8 After attribute name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003D EQUALS SIGN (=) + Switch to the before attribute value state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Start a new attribute in the current tag token. Set that + attribute's name to the lowercase version of the current input + character (add 0x0020 to the character's code point), and its + value to the empty string. Switch to the attribute name state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Start a new attribute in the current tag token. Set that + attribute's name to the current input character, and its value + to the empty string. Switch to the attribute name state. + + 8.2.4.9 Before attribute value state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before attribute value state. + + U+0022 QUOTATION MARK (") + Switch to the attribute value (double-quoted) state. + + U+0026 AMPERSAND (&) + Switch to the attribute value (unquoted) state and reconsume + this input character. + + U+0027 APOSTROPHE (') + Switch to the attribute value (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the current tag token. Switch to the data + state. + + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Switch to the attribute value (unquoted) state. + + 8.2.4.10 Attribute value (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after attribute value (quoted) state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + the additional allowed character being U+0022 QUOTATION MARK + ("). + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (double-quoted) state. + + 8.2.4.11 Attribute value (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after attribute value (quoted) state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + the additional allowed character being U+0027 APOSTROPHE ('). + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (single-quoted) state. + + 8.2.4.12 Attribute value (unquoted) state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+0026 AMPERSAND (&) + Switch to the character reference in attribute value state, with + no additional allowed character. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + U+0022 QUOTATION MARK (") + U+0027 APOSTROPHE (') + U+003D EQUALS SIGN (=) + Parse error. Treat it as per the "anything else" entry below. + + EOF + Parse error. Emit the current tag token. Reconsume the character + in the data state. + + Anything else + Append the current input character to the current attribute's + value. Stay in the attribute value (unquoted) state. + + 8.2.4.13 Character reference in attribute value state + + Attempt to consume a character reference. + + If nothing is returned, append a U+0026 AMPERSAND character to the + current attribute's value. + + Otherwise, append the returned character token to the current + attribute's value. + + Finally, switch back to the attribute value state that you were in when + were switched into this state. + + 8.2.4.14 After attribute value (quoted) state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before attribute name state. + + U+002F SOLIDUS (/) + Switch to the self-closing start tag state. + + U+003E GREATER-THAN SIGN (>) + Emit the current tag token. Switch to the data state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Parse error. Reconsume the character in the before attribute + name state. + + 8.2.4.15 Self-closing start tag state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Set the self-closing flag of the current tag token. Emit the + current tag token. Switch to the data state. + + EOF + Parse error. Emit the current tag token. Reconsume the EOF + character in the data state. + + Anything else + Parse error. Reconsume the character in the before attribute + name state. + + 8.2.4.16 Bogus comment state + + (This can only happen if the content model flag is set to the PCDATA + state.) + + Consume every character up to and including the first U+003E + GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever + comes first. Emit a comment token whose data is the concatenation of + all the characters starting from and including the character that + caused the state machine to switch into the bogus comment state, up to + and including the character immediately before the last consumed + character (i.e. up to the character just before the U+003E or EOF + character). (If the comment was started by the end of the file (EOF), + the token is empty.) + + Switch to the data state. + + If the end of the file was reached, reconsume the EOF character. + + 8.2.4.17 Markup declaration open state + + (This can only happen if the content model flag is set to the PCDATA + state.) + + If the next two characters are both U+002D HYPHEN-MINUS (-) characters, + consume those two characters, create a comment token whose data is the + empty string, and switch to the comment start state. + + Otherwise, if the next seven characters are an ASCII case-insensitive + match for the word "DOCTYPE", then consume those characters and switch + to the DOCTYPE state. + + Otherwise, if the insertion mode is "in foreign content" and the + current node is not an element in the HTML namespace and the next seven + characters are an ASCII case-sensitive match for the string "[CDATA[" + (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET + character before and after), then consume those characters and switch + to the CDATA section state (which is unrelated to the content model + flag's CDATA state). + + Otherwise, this is a parse error. Switch to the bogus comment state. + The next character that is consumed, if any, is the first character + that will be in the comment. + + 8.2.4.18 Comment start state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment start dash state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the comment token. Switch to the data state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append the input character to the comment token's data. Switch + to the comment state. + + 8.2.4.19 Comment start dash state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end state + + U+003E GREATER-THAN SIGN (>) + Parse error. Emit the comment token. Switch to the data state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append a U+002D HYPHEN-MINUS (-) character and the input + character to the comment token's data. Switch to the comment + state. + + 8.2.4.20 Comment state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end dash state + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append the input character to the comment token's data. Stay in + the comment state. + + 8.2.4.21 Comment end dash state + + Consume the next input character: + + U+002D HYPHEN-MINUS (-) + Switch to the comment end state + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Append a U+002D HYPHEN-MINUS (-) character and the input + character to the comment token's data. Switch to the comment + state. + + 8.2.4.22 Comment end state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Emit the comment token. Switch to the data state. + + U+002D HYPHEN-MINUS (-) + Parse error. Append a U+002D HYPHEN-MINUS (-) character to the + comment token's data. Stay in the comment end state. + + EOF + Parse error. Emit the comment token. Reconsume the EOF character + in the data state. + + Anything else + Parse error. Append two U+002D HYPHEN-MINUS (-) characters and + the input character to the comment token's data. Switch to the + comment state. + + 8.2.4.23 DOCTYPE state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the before DOCTYPE name state. + + Anything else + Parse error. Reconsume the current character in the before + DOCTYPE name state. + + 8.2.4.24 Before DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Create a new DOCTYPE token. Set its force-quirks + flag to on. Emit the token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Create a new DOCTYPE token. Set the token's name to the + lowercase version of the input character (add 0x0020 to the + character's code point). Switch to the DOCTYPE name state. + + EOF + Parse error. Create a new DOCTYPE token. Set its force-quirks + flag to on. Emit the token. Reconsume the EOF character in the + data state. + + Anything else + Create a new DOCTYPE token. Set the token's name to the current + input character. Switch to the DOCTYPE name state. + + 8.2.4.25 DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Switch to the after DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z + Append the lowercase version of the input character (add 0x0020 + to the character's code point) to the current DOCTYPE token's + name. Stay in the DOCTYPE name state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's name. Stay in the DOCTYPE name state. + + 8.2.4.26 After DOCTYPE name state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE name state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + If the six characters starting from the current input character + are an ASCII case-insensitive match for the word "PUBLIC", then + consume those characters and switch to the before DOCTYPE public + identifier state. + + Otherwise, if the six characters starting from the current input + character are an ASCII case-insensitive match for the word + "SYSTEM", then consume those characters and switch to the before + DOCTYPE system identifier state. + + Otherwise, this is the parse error. Set the DOCTYPE token's + force-quirks flag to on. Switch to the bogus DOCTYPE state. + + 8.2.4.27 Before DOCTYPE public identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE public identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's public identifier to the empty string + (not missing), then switch to the DOCTYPE public identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's public identifier to the empty string + (not missing), then switch to the DOCTYPE public identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.28 DOCTYPE public identifier (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after DOCTYPE public identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's public identifier. Stay in the DOCTYPE public identifier + (double-quoted) state. + + 8.2.4.29 DOCTYPE public identifier (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after DOCTYPE public identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's public identifier. Stay in the DOCTYPE public identifier + (single-quoted) state. + + 8.2.4.30 After DOCTYPE public identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE public identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.31 Before DOCTYPE system identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the before DOCTYPE system identifier state. + + U+0022 QUOTATION MARK (") + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (double-quoted) state. + + U+0027 APOSTROPHE (') + Set the DOCTYPE token's system identifier to the empty string + (not missing), then switch to the DOCTYPE system identifier + (single-quoted) state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Switch to the bogus DOCTYPE state. + + 8.2.4.32 DOCTYPE system identifier (double-quoted) state + + Consume the next input character: + + U+0022 QUOTATION MARK (") + Switch to the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's system identifier. Stay in the DOCTYPE system identifier + (double-quoted) state. + + 8.2.4.33 DOCTYPE system identifier (single-quoted) state + + Consume the next input character: + + U+0027 APOSTROPHE (') + Switch to the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Append the current input character to the current DOCTYPE + token's system identifier. Stay in the DOCTYPE system identifier + (single-quoted) state. + + 8.2.4.34 After DOCTYPE system identifier state + + Consume the next input character: + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + Stay in the after DOCTYPE system identifier state. + + U+003E GREATER-THAN SIGN (>) + Emit the current DOCTYPE token. Switch to the data state. + + EOF + Parse error. Set the DOCTYPE token's force-quirks flag to on. + Emit that DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Parse error. Switch to the bogus DOCTYPE state. (This does not + set the DOCTYPE token's force-quirks flag to on.) + + 8.2.4.35 Bogus DOCTYPE state + + Consume the next input character: + + U+003E GREATER-THAN SIGN (>) + Emit the DOCTYPE token. Switch to the data state. + + EOF + Emit the DOCTYPE token. Reconsume the EOF character in the data + state. + + Anything else + Stay in the bogus DOCTYPE state. + + 8.2.4.36 CDATA section state + + (This can only happen if the content model flag is set to the PCDATA + state, and is unrelated to the content model flag's CDATA state.) + + Consume every character up to the next occurrence of the three + character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE + BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF), + whichever comes first. Emit a series of character tokens consisting of + all the characters consumed except the matching three character + sequence at the end (if one was found before the end of the file). + + Switch to the data state. + + If the end of the file was reached, reconsume the EOF character. + + 8.2.4.37 Tokenizing character references + + This section defines how to consume a character reference. This + definition is used when parsing character references in text and in + attributes. + + The behavior depends on the identity of the next character (the one + immediately after the U+0026 AMPERSAND character): + + U+0009 CHARACTER TABULATION + U+000A LINE FEED (LF) + U+000C FORM FEED (FF) + U+0020 SPACE + U+003C LESS-THAN SIGN + U+0026 AMPERSAND + EOF + The additional allowed character, if there is one + Not a character reference. No characters are consumed, and + nothing is returned. (This is not an error, either.) + + U+0023 NUMBER SIGN (#) + Consume the U+0023 NUMBER SIGN. + + The behavior further depends on the character after the U+0023 + NUMBER SIGN: + + U+0078 LATIN SMALL LETTER X + U+0058 LATIN CAPITAL LETTER X + Consume the X. + + Follow the steps below, but using the range of characters + U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061 + LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER + F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046 + LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f). + + When it comes to interpreting the number, interpret it as + a hexadecimal number. + + Anything else + Follow the steps below, but using the range of characters + U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just + 0-9). + + When it comes to interpreting the number, interpret it as + a decimal number. + + Consume as many characters as match the range of characters + given above. + + If no characters match the range, then don't consume any + characters (and unconsume the U+0023 NUMBER SIGN character and, + if appropriate, the X character). This is a parse error; nothing + is returned. + + Otherwise, if the next character is a U+003B SEMICOLON, consume + that too. If it isn't, there is a parse error. + + If one or more characters match the range, then take them all + and interpret the string of characters as a number (either + hexadecimal or decimal as appropriate). + + If that number is one of the numbers in the first column of the + following table, then this is a parse error. Find the row with + that number in the first column, and return a character token + for the Unicode character given in the second column of that + row. + + Number Unicode character + 0x0D U+000A LINE FEED (LF) + 0x80 U+20AC EURO SIGN ('€') + 0x81 U+FFFD REPLACEMENT CHARACTER + 0x82 U+201A SINGLE LOW-9 QUOTATION MARK ('‚') + 0x83 U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ') + 0x84 U+201E DOUBLE LOW-9 QUOTATION MARK ('„') + 0x85 U+2026 HORIZONTAL ELLIPSIS ('…') + 0x86 U+2020 DAGGER ('†') + 0x87 U+2021 DOUBLE DAGGER ('‡') + 0x88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ') + 0x89 U+2030 PER MILLE SIGN ('‰') + 0x8A U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š') + 0x8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹') + 0x8C U+0152 LATIN CAPITAL LIGATURE OE ('Œ') + 0x8D U+FFFD REPLACEMENT CHARACTER + 0x8E U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž') + 0x8F U+FFFD REPLACEMENT CHARACTER + 0x90 U+FFFD REPLACEMENT CHARACTER + 0x91 U+2018 LEFT SINGLE QUOTATION MARK ('‘') + 0x92 U+2019 RIGHT SINGLE QUOTATION MARK ('’') + 0x93 U+201C LEFT DOUBLE QUOTATION MARK ('“') + 0x94 U+201D RIGHT DOUBLE QUOTATION MARK ('”') + 0x95 U+2022 BULLET ('•') + 0x96 U+2013 EN DASH ('–') + 0x97 U+2014 EM DASH ('—') + 0x98 U+02DC SMALL TILDE ('˜') + 0x99 U+2122 TRADE MARK SIGN ('™') + 0x9A U+0161 LATIN SMALL LETTER S WITH CARON ('š') + 0x9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›') + 0x9C U+0153 LATIN SMALL LIGATURE OE ('œ') + 0x9D U+FFFD REPLACEMENT CHARACTER + 0x9E U+017E LATIN SMALL LETTER Z WITH CARON ('ž') + 0x9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ') + + Otherwise, if the number is in the range 0x0000 to 0x0008, + 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to + 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, + 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, + 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, + 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, + 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, + 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is + a parse error; return a character token for the U+FFFD + REPLACEMENT CHARACTER character instead. + + Otherwise, return a character token for the Unicode character + whose code point is that number. + + Anything else + Consume the maximum number of characters possible, with the + consumed characters matching one of the identifiers in the first + column of the named character references table (in a + case-sensitive manner). + + If no match can be made, then this is a parse error. No + characters are consumed, and nothing is returned. + + If the last character matched is not a U+003B SEMICOLON (;), + there is a parse error. + + If the character reference is being consumed as part of an + attribute, and the last character matched is not a U+003B + SEMICOLON (;), and the next character is in the range U+0030 + DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A + to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A + to U+007A LATIN SMALL LETTER Z, then, for historical reasons, + all the characters that were matched after the U+0026 AMPERSAND + (&) must be unconsumed, and nothing is returned. + + Otherwise, return a character token for the character + corresponding to the character reference name (as given by the + second column of the named character references table). + + If the markup contains I'm ¬it; I tell you, the character + reference is parsed as "not", as in, I'm ¬it; I tell you. But if + the markup was I'm ∉ I tell you, the character reference + would be parsed as "notin;", resulting in I'm ∉ I tell you. diff --git a/parser/html/java/htmlparser/doc/tree-construction.txt b/parser/html/java/htmlparser/doc/tree-construction.txt new file mode 100644 index 000000000..0febf147a --- /dev/null +++ b/parser/html/java/htmlparser/doc/tree-construction.txt @@ -0,0 +1,2201 @@ + #8.2.4 Tokenization Table of contents 8.4 Serializing HTML fragments + + WHATWG + +HTML 5 + +Draft Recommendation — 13 January 2009 + + ← 8.2.4 Tokenization – Table of contents – 8.4 Serializing HTML + fragments → + + 8.2.5 Tree construction + + The input to the tree construction stage is a sequence of tokens from + the tokenization stage. The tree construction stage is associated with + a DOM Document object when a parser is created. The "output" of this + stage consists of dynamically modifying or extending that document's + DOM tree. + + This specification does not define when an interactive user agent has + to render the Document so that it is available to the user, or when it + has to begin accepting user input. + + As each token is emitted from the tokeniser, the user agent must + process the token according to the rules given in the section + corresponding to the current insertion mode. + + When the steps below require the UA to insert a character into a node, + if that node has a child immediately before where the character is to + be inserted, and that child is a Text node, and that Text node was the + last node that the parser inserted into the document, then the + character must be appended to that Text node; otherwise, a new Text + node whose data is just that character must be inserted in the + appropriate place. + + DOM mutation events must not fire for changes caused by the UA parsing + the document. (Conceptually, the parser is not mutating the DOM, it is + constructing it.) This includes the parsing of any content inserted + using document.write() and document.writeln() calls. [DOM3EVENTS] + + Not all of the tag names mentioned below are conformant tag names in + this specification; many are included to handle legacy content. They + still form part of the algorithm that implementations are required to + implement to claim conformance. + + The algorithm described below places no limit on the depth of the DOM + tree generated, or on the length of tag names, attribute names, + attribute values, text nodes, etc. While implementors are encouraged to + avoid arbitrary limits, it is recognized that practical concerns will + likely force user agents to impose nesting depths. + + 8.2.5.1 Creating and inserting elements + + When the steps below require the UA to create an element for a token in + a particular namespace, the UA must create a node implementing the + interface appropriate for the element type corresponding to the tag + name of the token in the given namespace (as given in the specification + that defines that element, e.g. for an a element in the HTML namespace, + this specification defines it to be the HTMLAnchorElement interface), + with the tag name being the name of that element, with the node being + in the given namespace, and with the attributes on the node being those + given in the given token. + + The interface appropriate for an element in the HTML namespace that is + not defined in this specification is HTMLElement. The interface + appropriate for an element in another namespace that is not defined by + that namespace's specification is Element. + + When a resettable element is created in this manner, its reset + algorithm must be invoked once the attributes are set. (This + initializes the element's value and checkedness based on the element's + attributes.) + __________________________________________________________________ + + When the steps below require the UA to insert an HTML element for a + token, the UA must first create an element for the token in the HTML + namespace, and then append this node to the current node, and push it + onto the stack of open elements so that it is the new current node. + + The steps below may also require that the UA insert an HTML element in + a particular place, in which case the UA must follow the same steps + except that it must insert or append the new node in the location + specified instead of appending it to the current node. (This happens in + particular during the parsing of tables with invalid content.) + + If an element created by the insert an HTML element algorithm is a + form-associated element, and the form element pointer is not null, and + the newly created element doesn't have a form attribute, the user agent + must associate the newly created element with the form element pointed + to by the form element pointer before inserting it wherever it is to be + inserted. + __________________________________________________________________ + + When the steps below require the UA to insert a foreign element for a + token, the UA must first create an element for the token in the given + namespace, and then append this node to the current node, and push it + onto the stack of open elements so that it is the new current node. If + the newly created element has an xmlns attribute in the XMLNS namespace + whose value is not exactly the same as the element's namespace, that is + a parse error. + + When the steps below require the user agent to adjust MathML attributes + for a token, then, if the token has an attribute named definitionurl, + change its name to definitionURL (note the case difference). + + When the steps below require the user agent to adjust foreign + attributes for a token, then, if any of the attributes on the token + match the strings given in the first column of the following table, let + the attribute be a namespaced attribute, with the prefix being the + string given in the corresponding cell in the second column, the local + name being the string given in the corresponding cell in the third + column, and the namespace being the namespace given in the + corresponding cell in the fourth column. (This fixes the use of + namespaced attributes, in particular xml:lang.) + + Attribute name Prefix Local name Namespace + xlink:actuate xlink actuate XLink namespace + xlink:arcrole xlink arcrole XLink namespace + xlink:href xlink href XLink namespace + xlink:role xlink role XLink namespace + xlink:show xlink show XLink namespace + xlink:title xlink title XLink namespace + xlink:type xlink type XLink namespace + xml:base xml base XML namespace + xml:lang xml lang XML namespace + xml:space xml space XML namespace + xmlns (none) xmlns XMLNS namespace + xmlns:xlink xmlns xlink XMLNS namespace + __________________________________________________________________ + + The generic CDATA element parsing algorithm and the generic RCDATA + element parsing algorithm consist of the following steps. These + algorithms are always invoked in response to a start tag token. + 1. Insert an HTML element for the token. + 2. If the algorithm that was invoked is the generic CDATA element + parsing algorithm, switch the tokeniser's content model flag to the + CDATA state; otherwise the algorithm invoked was the generic RCDATA + element parsing algorithm, switch the tokeniser's content model + flag to the RCDATA state. + 3. Let the original insertion mode be the current insertion mode. + 4. Then, switch the insertion mode to "in CDATA/RCDATA". + + 8.2.5.2 Closing elements that have implied end tags + + When the steps below require the UA to generate implied end tags, then, + while the current node is a dd element, a dt element, an li element, an + option element, an optgroup element, a p element, an rp element, or an + rt element, the UA must pop the current node off the stack of open + elements. + + If a step requires the UA to generate implied end tags but lists an + element to exclude from the process, then the UA must perform the above + steps as if that element was not in the above list. + + 8.2.5.3 Foster parenting + + Foster parenting happens when content is misnested in tables. + + When a node node is to be foster parented, the node node must be + inserted into the foster parent element, and the current table must be + marked as tainted. (Once the current table has been tainted, whitespace + characters are inserted into the foster parent element instead of the + current node.) + + The foster parent element is the parent element of the last table + element in the stack of open elements, if there is a table element and + it has such a parent element. If there is no table element in the stack + of open elements (fragment case), then the foster parent element is the + first element in the stack of open elements (the html element). + Otherwise, if there is a table element in the stack of open elements, + but the last table element in the stack of open elements has no parent, + or its parent node is not an element, then the foster parent element is + the element before the last table element in the stack of open + elements. + + If the foster parent element is the parent element of the last table + element in the stack of open elements, then node must be inserted + immediately before the last table element in the stack of open elements + in the foster parent element; otherwise, node must be appended to the + foster parent element. + + 8.2.5.4 The "initial" insertion mode + + When the insertion mode is "initial", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + If the DOCTYPE token's name is not a case-sensitive match for + the string "html", or if the token's public identifier is + neither missing nor a case-sensitive match for the string + "XSLT-compat", or if the token's system identifier is not + missing, then there is a parse error (this is the DOCTYPE parse + error). Conformance checkers may, instead of reporting this + error, switch to a conformance checking mode for another + language (e.g. based on the DOCTYPE token a conformance checker + could recognize that the document is an HTML4-era document, and + defer to an HTML4 conformance checker.) + + Append a DocumentType node to the Document node, with the name + attribute set to the name given in the DOCTYPE token; the + publicId attribute set to the public identifier given in the + DOCTYPE token, or the empty string if the public identifier was + missing; the systemId attribute set to the system identifier + given in the DOCTYPE token, or the empty string if the system + identifier was missing; and the other attributes specific to + DocumentType objects set to null and empty lists as appropriate. + Associate the DocumentType node with the Document object so that + it is returned as the value of the doctype attribute of the + Document object. + + Then, if the DOCTYPE token matches one of the conditions in the + following list, then set the document to quirks mode: + + + The force-quirks flag is set to on. + + The name is set to anything other than "HTML". + + The public identifier starts with: "+//Silmaril//dtd html Pro + v0r11 19970101//" + + The public identifier starts with: "-//AdvaSoft Ltd//DTD HTML + 3.0 asWedit + extensions//" + + The public identifier starts with: "-//AS//DTD HTML 3.0 + asWedit + extensions//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0 + Strict//" + + The public identifier starts with: "-//IETF//DTD HTML 2.0//" + + The public identifier starts with: "-//IETF//DTD HTML 2.1E//" + + The public identifier starts with: "-//IETF//DTD HTML 3.0//" + + The public identifier starts with: "-//IETF//DTD HTML 3.2 + Final//" + + The public identifier starts with: "-//IETF//DTD HTML 3.2//" + + The public identifier starts with: "-//IETF//DTD HTML 3//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 0//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 1//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 2//" + + The public identifier starts with: "-//IETF//DTD HTML Level + 3//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 0//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 1//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 2//" + + The public identifier starts with: "-//IETF//DTD HTML Strict + Level 3//" + + The public identifier starts with: "-//IETF//DTD HTML + Strict//" + + The public identifier starts with: "-//IETF//DTD HTML//" + + The public identifier starts with: "-//Metrius//DTD Metrius + Presentational//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 HTML Strict//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 HTML//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 2.0 Tables//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 HTML Strict//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 HTML//" + + The public identifier starts with: "-//Microsoft//DTD Internet + Explorer 3.0 Tables//" + + The public identifier starts with: "-//Netscape Comm. + Corp.//DTD HTML//" + + The public identifier starts with: "-//Netscape Comm. + Corp.//DTD Strict HTML//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML 2.0//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML Extended 1.0//" + + The public identifier starts with: "-//O'Reilly and + Associates//DTD HTML Extended Relaxed 1.0//" + + The public identifier starts with: "-//SoftQuad Software//DTD + HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//" + + The public identifier starts with: "-//SoftQuad//DTD HoTMetaL + PRO 4.0::19971010::extensions to HTML 4.0//" + + The public identifier starts with: "-//Spyglass//DTD HTML 2.0 + Extended//" + + The public identifier starts with: "-//SQ//DTD HTML 2.0 + HoTMetaL + extensions//" + + The public identifier starts with: "-//Sun Microsystems + Corp.//DTD HotJava HTML//" + + The public identifier starts with: "-//Sun Microsystems + Corp.//DTD HotJava Strict HTML//" + + The public identifier starts with: "-//W3C//DTD HTML 3 + 1995-03-24//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2 + Draft//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2 + Final//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2//" + + The public identifier starts with: "-//W3C//DTD HTML 3.2S + Draft//" + + The public identifier starts with: "-//W3C//DTD HTML 4.0 + Frameset//" + + The public identifier starts with: "-//W3C//DTD HTML 4.0 + Transitional//" + + The public identifier starts with: "-//W3C//DTD HTML + Experimental 19960712//" + + The public identifier starts with: "-//W3C//DTD HTML + Experimental 970421//" + + The public identifier starts with: "-//W3C//DTD W3 HTML//" + + The public identifier starts with: "-//W3O//DTD W3 HTML 3.0//" + + The public identifier is set to: "-//W3O//DTD W3 HTML Strict + 3.0//EN//" + + The public identifier starts with: "-//WebTechs//DTD Mozilla + HTML 2.0//" + + The public identifier starts with: "-//WebTechs//DTD Mozilla + HTML//" + + The public identifier is set to: "-/W3C/DTD HTML 4.0 + Transitional/EN" + + The public identifier is set to: "HTML" + + The system identifier is set to: + "http://www.ibm.com/data/dtd/v11/ibmxhtml1-transitional.dtd" + + The system identifier is missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Frameset//" + + The system identifier is missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Transitional//" + + Otherwise, if the DOCTYPE token matches one of the conditions in + the following list, then set the document to limited quirks + mode: + + + The public identifier starts with: "-//W3C//DTD XHTML 1.0 + Frameset//" + + The public identifier starts with: "-//W3C//DTD XHTML 1.0 + Transitional//" + + The system identifier is not missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Frameset//" + + The system identifier is not missing and the public identifier + starts with: "-//W3C//DTD HTML 4.01 Transitional//" + + The name, system identifier, and public identifier strings must + be compared to the values given in the lists above in an ASCII + case-insensitive manner. A system identifier whose value is the + empty string is not considered missing for the purposes of the + conditions above. + + Then, switch the insertion mode to "before html". + + Anything else + Parse error. + + Set the document to quirks mode. + + Switch the insertion mode to "before html", then reprocess the + current token. + + 8.2.5.5 The "before html" insertion mode + + When the insertion mode is "before html", tokens must be handled as + follows: + + A DOCTYPE token + Parse error. Ignore the token. + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A start tag whose tag name is "html" + Create an element for the token in the HTML namespace. Append it + to the Document object. Put this element in the stack of open + elements. + + If the token has an attribute "manifest", then resolve the value + of that attribute to an absolute URL, and if that is successful, + run the application cache selection algorithm with the resulting + absolute URL. Otherwise, if there is no such attribute or + resolving it fails, run the application cache selection + algorithm with no manifest. The algorithm must be passed the + Document object. + + Switch the insertion mode to "before head". + + Anything else + Create an HTMLElement node with the tag name html, in the HTML + namespace. Append it to the Document object. Put this element in + the stack of open elements. + + Run the application cache selection algorithm with no manifest, + passing it the Document object. + + Switch the insertion mode to "before head", then reprocess the + current token. + + Should probably make end tags be ignored, so that "" puts the comment before the root node (or should we?) + + The root element can end up being removed from the Document object, + e.g. by scripts; nothing in particular happens in such cases, content + continues being appended to the nodes as described in the next section. + + 8.2.5.6 The "before head" insertion mode + + When the insertion mode is "before head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Ignore the token. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "head" + Insert an HTML element for the token. + + Set the head element pointer to the newly created head element. + + Switch the insertion mode to "in head". + + An end tag whose tag name is one of: "head", "br" + Act as if a start tag token with the tag name "head" and no + attributes had been seen, then reprocess the current token. + + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if a start tag token with the tag name "head" and no + attributes had been seen, then reprocess the current token. + + This will result in an empty head element being generated, with + the current token being reprocessed in the "after head" + insertion mode. + + 8.2.5.7 The "in head" insertion mode + + When the insertion mode is "in head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is one of: "base", "command", "eventsource", + "link" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "meta" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + If the element has a charset attribute, and its value is a + supported encoding, and the confidence is currently tentative, + then change the encoding to the encoding given by the value of + the charset attribute. + + Otherwise, if the element has a content attribute, and applying + the algorithm for extracting an encoding from a Content-Type to + its value returns a supported encoding encoding, and the + confidence is currently tentative, then change the encoding to + the encoding encoding. + + A start tag whose tag name is "title" + Follow the generic RCDATA element parsing algorithm. + + A start tag whose tag name is "noscript", if the scripting flag is + enabled + + A start tag whose tag name is one of: "noframes", "style" + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "noscript", if the scripting flag is + disabled + Insert an HTML element for the token. + + Switch the insertion mode to "in head noscript". + + A start tag whose tag name is "script" + + 1. Create an element for the token in the HTML namespace. + 2. Mark the element as being "parser-inserted". + This ensures that, if the script is external, any + document.write() calls in the script will execute in-line, + instead of blowing the document away, as would happen in most + other cases. It also prevents the script from executing until + the end tag is seen. + 3. If the parser was originally created for the HTML fragment + parsing algorithm, then mark the script element as "already + executed". (fragment case) + 4. Append the new element to the current node. + 5. Switch the tokeniser's content model flag to the CDATA state. + 6. Let the original insertion mode be the current insertion mode. + 7. Switch the insertion mode to "in CDATA/RCDATA". + + An end tag whose tag name is "head" + Pop the current node (which will be the head element) off the + stack of open elements. + + Switch the insertion mode to "after head". + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is "head" + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if an end tag token with the tag name "head" had been + seen, and reprocess the current token. + + In certain UAs, some elements don't trigger the "in body" mode + straight away, but instead get put into the head. Do we want to + copy that? + + 8.2.5.8 The "in head noscript" insertion mode + + When the insertion mode is "in head noscript", tokens must be handled + as follows: + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "noscript" + Pop the current node (which will be a noscript element) from the + stack of open elements; the new current node will be a head + element. + + Switch the insertion mode to "in head". + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A comment token + A start tag whose tag name is one of: "link", "meta", "noframes", + "style" + Process the token using the rules for the "in head" insertion + mode. + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is one of: "head", "noscript" + Any other end tag + Parse error. Ignore the token. + + Anything else + Parse error. Act as if an end tag with the tag name "noscript" + had been seen and reprocess the current token. + + 8.2.5.9 The "after head" insertion mode + + When the insertion mode is "after head", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "body" + Insert an HTML element for the token. + + Switch the insertion mode to "in body". + + A start tag whose tag name is "frameset" + Insert an HTML element for the token. + + Switch the insertion mode to "in frameset". + + A start tag token whose tag name is one of: "base", "link", "meta", + "noframes", "script", "style", "title" + Parse error. + + Push the node pointed to by the head element pointer onto the + stack of open elements. + + Process the token using the rules for the "in head" insertion + mode. + + Remove the node pointed to by the head element pointer from the + stack of open elements. + + An end tag whose tag name is "br" + Act as described in the "anything else" entry below. + + A start tag whose tag name is "head" + Any other end tag + Parse error. Ignore the token. + + Anything else + Act as if a start tag token with the tag name "body" and no + attributes had been seen, and then reprocess the current token. + + 8.2.5.10 The "in body" insertion mode + + When the insertion mode is "in body", tokens must be handled as + follows: + + A character token + Reconstruct the active formatting elements, if any. + + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Parse error. For each attribute on the token, check to see if + the attribute is already present on the top element of the stack + of open elements. If it is not, add the attribute and its + corresponding value to that element. + + A start tag token whose tag name is one of: "base", "command", + "eventsource", "link", "meta", "noframes", "script", "style", + "title" + Process the token using the rules for the "in head" insertion + mode. + + A start tag whose tag name is "body" + Parse error. + + If the second element on the stack of open elements is not a + body element, or, if the stack of open elements has only one + node on it, then ignore the token. (fragment case) + + Otherwise, for each attribute on the token, check to see if the + attribute is already present on the body element (the second + element) on the stack of open elements. If it is not, add the + attribute and its corresponding value to that element. + + An end-of-file token + If there is a node in the stack of open elements that is not + either a dd element, a dt element, an li element, a p element, a + tbody element, a td element, a tfoot element, a th element, a + thead element, a tr element, the body element, or the html + element, then this is a parse error. + + Stop parsing. + + An end tag whose tag name is "body" + If the stack of open elements does not have a body element in + scope, this is a parse error; ignore the token. + + Otherwise, if there is a node in the stack of open elements that + is not either a dd element, a dt element, an li element, a p + element, a tbody element, a td element, a tfoot element, a th + element, a thead element, a tr element, the body element, or the + html element, then this is a parse error. + + Switch the insertion mode to "after body". + + An end tag whose tag name is "html" + Act as if an end tag with tag name "body" had been seen, then, + if that token wasn't ignored, reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + A start tag whose tag name is one of: "address", "article", "aside", + "blockquote", "center", "datagrid", "details", "dialog", "dir", + "div", "dl", "fieldset", "figure", "footer", "header", "menu", + "nav", "ol", "p", "section", "ul" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", + "h6" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + If the current node is an element whose tag name is one of "h1", + "h2", "h3", "h4", "h5", or "h6", then this is a parse error; pop + the current node off the stack of open elements. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "pre", "listing" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + If the next token is a U+000A LINE FEED (LF) character token, + then ignore that token and move on to the next one. (Newlines at + the start of pre blocks are ignored as an authoring + convenience.) + + A start tag whose tag name is "form" + If the form element pointer is not null, then this is a parse + error; ignore the token. + + Otherwise: + + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token, and set the form element + pointer to point to the element created. + + A start tag whose tag name is "li" + Run the following algorithm: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node is an li element, then act as if an end tag with the + tag name "li" had been seen, then jump to the last step. + 3. If node is not in the formatting category, and is not in the + phrasing category, and is not an address, div, or p element, + then jump to the last step. + 4. Otherwise, set node to the previous entry in the stack of open + elements and return to step 2. + 5. This is the last step. + If the stack of open elements has a p element in scope, then + act as if an end tag with the tag name "p" had been seen. + Finally, insert an HTML element for the token. + + A start tag whose tag name is one of: "dd", "dt" + Run the following algorithm: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node is a dd or dt element, then act as if an end tag with + the same tag name as node had been seen, then jump to the last + step. + 3. If node is not in the formatting category, and is not in the + phrasing category, and is not an address, div, or p element, + then jump to the last step. + 4. Otherwise, set node to the previous entry in the stack of open + elements and return to step 2. + 5. This is the last step. + If the stack of open elements has a p element in scope, then + act as if an end tag with the tag name "p" had been seen. + Finally, insert an HTML element for the token. + + A start tag whose tag name is "plaintext" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + Switch the content model flag to the PLAINTEXT state. + + Once a start tag with the tag name "plaintext" has been seen, + that will be the last token ever seen other than character + tokens (and the end-of-file token), because there is no way to + switch the content model flag out of the PLAINTEXT state. + + An end tag whose tag name is one of: "address", "article", "aside", + "blockquote", "center", "datagrid", "details", "dialog", "dir", + "div", "dl", "fieldset", "figure", "footer", "header", + "listing", "menu", "nav", "ol", "pre", "section", "ul" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is "form" + Let node be the element that the form element pointer is set to. + + Set the form element pointer to null. + + If node is null or the stack of open elements does not have node + in scope, then this is a parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not node, then this is a parse error. + 3. Remove node from the stack of open elements. + + An end tag whose tag name is "p" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; act as if a start tag with the tag name p had been + seen, then reprocess the current token. + + Otherwise, run these steps: + + 1. Generate implied end tags, except for elements with the same + tag name as the token. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is one of: "dd", "dt", "li" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags, except for elements with the same + tag name as the token. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + + An end tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6" + If the stack of open elements does not have an element in scope + whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6", + then this is a parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + whose tag name is one of "h1", "h2", "h3", "h4", "h5", or "h6" + has been popped from the stack. + + An end tag whose tag name is "sarcasm" + Take a deep breath, then act as described in the "any other end + tag" entry below. + + A start tag whose tag name is "a" + If the list of active formatting elements contains an element + whose tag name is "a" between the end of the list and the last + marker on the list (or the start of the list if there is no + marker on the list), then this is a parse error; act as if an + end tag with the tag name "a" had been seen, then remove that + element from the list of active formatting elements and the + stack of open elements if the end tag didn't already remove it + (it might not have if the element is not in table scope). + + In the non-conforming stream + ab
x, the first a element + would be closed upon seeing the second one, and the "x" + character would be inside a link to "b", not to "a". This is + despite the fact that the outer a element is not in table scope + (meaning that a regular end tag at the start of the table + wouldn't close the outer a element). + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + A start tag whose tag name is one of: "b", "big", "em", "font", "i", + "s", "small", "strike", "strong", "tt", "u" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + A start tag whose tag name is "nobr" + Reconstruct the active formatting elements, if any. + + If the stack of open elements has a nobr element in scope, then + this is a parse error; act as if an end tag with the tag name + "nobr" had been seen, then once again reconstruct the active + formatting elements, if any. + + Insert an HTML element for the token. Add that element to the + list of active formatting elements. + + An end tag whose tag name is one of: "a", "b", "big", "em", "font", + "i", "nobr", "s", "small", "strike", "strong", "tt", "u" + Follow these steps: + + 1. Let the formatting element be the last element in the list of + active formatting elements that: + o is between the end of the list and the last scope marker + in the list, if any, or the start of the list otherwise, + and + o has the same tag name as the token. + If there is no such node, or, if that node is also in the + stack of open elements but the element is not in scope, then + this is a parse error; ignore the token, and abort these + steps. + Otherwise, if there is such a node, but that node is not in + the stack of open elements, then this is a parse error; remove + the element from the list, and abort these steps. + Otherwise, there is a formatting element and that element is + in the stack and is in scope. If the element is not the + current node, this is a parse error. In any case, proceed with + the algorithm as written in the following steps. + 2. Let the furthest block be the topmost node in the stack of + open elements that is lower in the stack than the formatting + element, and is not an element in the phrasing or formatting + categories. There might not be one. + 3. If there is no furthest block, then the UA must skip the + subsequent steps and instead just pop all the nodes from the + bottom of the stack of open elements, from the current node up + to and including the formatting element, and remove the + formatting element from the list of active formatting + elements. + 4. Let the common ancestor be the element immediately above the + formatting element in the stack of open elements. + 5. If the furthest block has a parent node, then remove the + furthest block from its parent node. + 6. Let a bookmark note the position of the formatting element in + the list of active formatting elements relative to the + elements on either side of it in the list. + 7. Let node and last node be the furthest block. Follow these + steps: + 1. Let node be the element immediately above node in the + stack of open elements. + 2. If node is not in the list of active formatting elements, + then remove node from the stack of open elements and then + go back to step 1. + 3. Otherwise, if node is the formatting element, then go to + the next step in the overall algorithm. + 4. Otherwise, if last node is the furthest block, then move + the aforementioned bookmark to be immediately after the + node in the list of active formatting elements. + 5. If node has any children, perform a shallow clone of + node, replace the entry for node in the list of active + formatting elements with an entry for the clone, replace + the entry for node in the stack of open elements with an + entry for the clone, and let node be the clone. + 6. Insert last node into node, first removing it from its + previous parent node if any. + 7. Let last node be node. + 8. Return to step 1 of this inner set of steps. + 8. If the common ancestor node is a table, tbody, tfoot, thead, + or tr element, then, foster parent whatever last node ended up + being in the previous step. + Otherwise, append whatever last node ended up being in the + previous step to the common ancestor node, first removing it + from its previous parent node if any. + 9. Perform a shallow clone of the formatting element. + 10. Take all of the child nodes of the furthest block and append + them to the clone created in the last step. + 11. Append that clone to the furthest block. + 12. Remove the formatting element from the list of active + formatting elements, and insert the clone into the list of + active formatting elements at the position of the + aforementioned bookmark. + 13. Remove the formatting element from the stack of open elements, + and insert the clone into the stack of open elements + immediately below the position of the furthest block in that + stack. + 14. Jump back to step 1 in this series of steps. + + The way these steps are defined, only elements in the formatting + category ever get cloned by this algorithm. + + Because of the way this algorithm causes elements to change + parents, it has been dubbed the "adoption agency algorithm" (in + contrast with other possibly algorithms for dealing with + misnested content, which included the "incest algorithm", the + "secret affair algorithm", and the "Heisenberg algorithm"). + + A start tag whose tag name is "button" + If the stack of open elements has a button element in scope, + then this is a parse error; act as if an end tag with the tag + name "button" had been seen, then reprocess the token. + + Otherwise: + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + Insert a marker at the end of the list of active formatting + elements. + + A start tag token whose tag name is one of: "applet", "marquee", + "object" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + Insert a marker at the end of the list of active formatting + elements. + + An end tag token whose tag name is one of: "applet", "button", + "marquee", "object" + If the stack of open elements does not have an element in scope + with the same tag name as that of the token, then this is a + parse error; ignore the token. + + Otherwise, run these steps: + + 1. Generate implied end tags. + 2. If the current node is not an element with the same tag name + as that of the token, then this is a parse error. + 3. Pop elements from the stack of open elements until an element + with the same tag name as the token has been popped from the + stack. + 4. Clear the list of active formatting elements up to the last + marker. + + A start tag whose tag name is "xmp" + Reconstruct the active formatting elements, if any. + + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "table" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. + + Switch the insertion mode to "in table". + + A start tag whose tag name is one of: "area", "basefont", "bgsound", + "br", "embed", "img", "input", "spacer", "wbr" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is one of: "param", "source" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "hr" + If the stack of open elements has a p element in scope, then act + as if an end tag with the tag name "p" had been seen. + + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "image" + Parse error. Change the token's tag name to "img" and reprocess + it. (Don't ask.) + + A start tag whose tag name is "isindex" + Parse error. + + If the form element pointer is not null, then ignore the token. + + Otherwise: + + Acknowledge the token's self-closing flag, if it is set. + + Act as if a start tag token with the tag name "form" had been + seen. + + If the token has an attribute called "action", set the action + attribute on the resulting form element to the value of the + "action" attribute of the token. + + Act as if a start tag token with the tag name "hr" had been + seen. + + Act as if a start tag token with the tag name "p" had been seen. + + Act as if a start tag token with the tag name "label" had been + seen. + + Act as if a stream of character tokens had been seen (see below + for what they should say). + + Act as if a start tag token with the tag name "input" had been + seen, with all the attributes from the "isindex" token except + "name", "action", and "prompt". Set the name attribute of the + resulting input element to the value "isindex". + + Act as if a stream of character tokens had been seen (see below + for what they should say). + + Act as if an end tag token with the tag name "label" had been + seen. + + Act as if an end tag token with the tag name "p" had been seen. + + Act as if a start tag token with the tag name "hr" had been + seen. + + Act as if an end tag token with the tag name "form" had been + seen. + + If the token has an attribute with the name "prompt", then the + first stream of characters must be the same string as given in + that attribute, and the second stream of characters must be + empty. Otherwise, the two streams of character tokens together + should, together with the input element, express the equivalent + of "This is a searchable index. Insert your search keywords + here: (input field)" in the user's preferred language. + + A start tag whose tag name is "textarea" + + 1. Insert an HTML element for the token. + 2. If the next token is a U+000A LINE FEED (LF) character token, + then ignore that token and move on to the next one. (Newlines + at the start of textarea elements are ignored as an authoring + convenience.) + 3. Switch the tokeniser's content model flag to the RCDATA state. + 4. Let the original insertion mode be the current insertion mode. + 5. Switch the insertion mode to "in CDATA/RCDATA". + + A start tag whose tag name is one of: "iframe", "noembed" + A start tag whose tag name is "noscript", if the scripting flag is + enabled + Follow the generic CDATA element parsing algorithm. + + A start tag whose tag name is "select" + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + If the insertion mode is one of in table", "in caption", "in + column group", "in table body", "in row", or "in cell", then + switch the insertion mode to "in select in table". Otherwise, + switch the insertion mode to "in select". + + A start tag whose tag name is one of: "optgroup", "option" + If the stack of open elements has an option element in scope, + then act as if an end tag with the tag name "option" had been + seen. + + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + A start tag whose tag name is one of: "rp", "rt" + If the stack of open elements has a ruby element in scope, then + generate implied end tags. If the current node is not then a + ruby element, this is a parse error; pop all the nodes from the + current node up to the node immediately before the bottommost + ruby element on the stack of open elements. + + Insert an HTML element for the token. + + An end tag whose tag name is "br" + Parse error. Act as if a start tag token with the tag name "br" + had been seen. Ignore the end tag token. + + A start tag whose tag name is "math" + Reconstruct the active formatting elements, if any. + + Adjust MathML attributes for the token. (This fixes the case of + MathML attributes that are not all lowercase.) + + Adjust foreign attributes for the token. (This fixes the use of + namespaced attributes, in particular XLink.) + + Insert a foreign element for the token, in the MathML namespace. + + If the token has its self-closing flag set, pop the current node + off the stack of open elements and acknowledge the token's + self-closing flag. + + Otherwise, let the secondary insertion mode be the current + insertion mode, and then switch the insertion mode to "in + foreign content". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "frame", "frameset", "head", "tbody", "td", "tfoot", "th", + "thead", "tr" + Parse error. Ignore the token. + + Any other start tag + Reconstruct the active formatting elements, if any. + + Insert an HTML element for the token. + + This element will be a phrasing element. + + Any other end tag + Run the following steps: + + 1. Initialize node to be the current node (the bottommost node of + the stack). + 2. If node has the same tag name as the end tag token, then: + 1. Generate implied end tags. + 2. If the tag name of the end tag token does not match the + tag name of the current node, this is a parse error. + 3. Pop all the nodes from the current node up to node, + including node, then stop these steps. + 3. Otherwise, if node is in neither the formatting category nor + the phrasing category, then this is a parse error; ignore the + token, and abort these steps. + 4. Set node to the previous entry in the stack of open elements. + 5. Return to step 2. + + 8.2.5.11 The "in CDATA/RCDATA" insertion mode + + When the insertion mode is "in CDATA/RCDATA", tokens must be handled as + follows: + + A character token + Insert the token's character into the current node. + + An end-of-file token + Parse error. + + If the current node is a script element, mark the script element + as "already executed". + + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode and + reprocess the current token. + + An end tag whose tag name is "script" + Let script be the current node (which will be a script element). + + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode. + + Let the old insertion point have the same value as the current + insertion point. Let the insertion point be just before the next + input character. + + Increment the parser's script nesting level by one. + + Run the script. This might cause some script to execute, which + might cause new characters to be inserted into the tokeniser, + and might cause the tokeniser to output more tokens, resulting + in a reentrant invocation of the parser. + + Decrement the parser's script nesting level by one. If the + parser's script nesting level is zero, then set the parser pause + flag to false. + + Let the insertion point have the value of the old insertion + point. (In other words, restore the insertion point to the value + it had before the previous paragraph. This value might be the + "undefined" value.) + + At this stage, if there is a pending external script, then: + + If the tree construction stage is being called reentrantly, say + from a call to document.write(): + Set the parser pause flag to true, and abort the + processing of any nested invocations of the tokeniser, + yielding control back to the caller. (Tokenization will + resume when the caller returns to the "outer" tree + construction stage.) + + Otherwise: + Follow these steps: + + 1. Let the script be the pending external script. There is + no longer a pending external script. + 2. Pause until the script has completed loading. + 3. Let the insertion point be just before the next input + character. + 4. Execute the script. + 5. Let the insertion point be undefined again. + 6. If there is once again a pending external script, then + repeat these steps from step 1. + + Any other end tag + Pop the current node off the stack of open elements. + + Switch the insertion mode to the original insertion mode. + + 8.2.5.12 The "in table" insertion mode + + When the insertion mode is "in table", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + If the current table is tainted, then act as described in the + "anything else" entry below. + + Otherwise, insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "caption" + Clear the stack back to a table context. (See below.) + + Insert a marker at the end of the list of active formatting + elements. + + Insert an HTML element for the token, then switch the insertion + mode to "in caption". + + A start tag whose tag name is "colgroup" + Clear the stack back to a table context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in column group". + + A start tag whose tag name is "col" + Act as if a start tag token with the tag name "colgroup" had + been seen, then reprocess the current token. + + A start tag whose tag name is one of: "tbody", "tfoot", "thead" + Clear the stack back to a table context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in table body". + + A start tag whose tag name is one of: "td", "th", "tr" + Act as if a start tag token with the tag name "tbody" had been + seen, then reprocess the current token. + + A start tag whose tag name is "table" + Parse error. Act as if an end tag token with the tag name + "table" had been seen, then, if that token wasn't ignored, + reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is "table" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Pop elements from this stack until a table element has been + popped from the stack. + + Reset the insertion mode appropriately. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "tbody", "td", "tfoot", "th", "thead", "tr" + Parse error. Ignore the token. + + A start tag whose tag name is one of: "style", "script" + If the current table is tainted then act as described in the + "anything else" entry below. + + Otherwise, process the token using the rules for the "in head" + insertion mode. + + A start tag whose tag name is "input" + If the token does not have an attribute with the name "type", or + if it does, but that attribute's value is not an ASCII + case-insensitive match for the string "hidden", or, if the + current table is tainted, then: act as described in the + "anything else" entry below. + + Otherwise: + + Parse error. + + Insert an HTML element for the token. + + Pop that input element off the stack of open elements. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Process the token using the rules for the "in body" + insertion mode, except that if the current node is a table, + tbody, tfoot, thead, or tr element, then, whenever a node would + be inserted into the current node, it must instead be foster + parented. + + When the steps above require the UA to clear the stack back to a table + context, it means that the UA must, while the current node is not a + table element or an html element, pop elements from the stack of open + elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.13 The "in caption" insertion mode + + When the insertion mode is "in caption", tokens must be handled as + follows: + + An end tag whose tag name is "caption" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Generate implied end tags. + + Now, if the current node is not a caption element, then this is + a parse error. + + Pop elements from this stack until a caption element has been + popped from the stack. + + Clear the list of active formatting elements up to the last + marker. + + Switch the insertion mode to "in table". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "td", "tfoot", "th", "thead", "tr" + + An end tag whose tag name is "table" + Parse error. Act as if an end tag with the tag name "caption" + had been seen, then, if that token wasn't ignored, reprocess the + current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is one of: "body", "col", "colgroup", "html", + "tbody", "td", "tfoot", "th", "thead", "tr" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in body" insertion + mode. + + 8.2.5.14 The "in column group" insertion mode + + When the insertion mode is "in column group", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "col" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + An end tag whose tag name is "colgroup" + If the current node is the root html element, then this is a + parse error; ignore the token. (fragment case) + + Otherwise, pop the current node (which will be a colgroup + element) from the stack of open elements. Switch the insertion + mode to "in table". + + An end tag whose tag name is "col" + Parse error. Ignore the token. + + An end-of-file token + If the current node is the root html element, then stop parsing. + (fragment case) + + Otherwise, act as described in the "anything else" entry below. + + Anything else + Act as if an end tag with the tag name "colgroup" had been seen, + and then, if that token wasn't ignored, reprocess the current + token. + + The fake end tag token here can only be ignored in the fragment + case. + + 8.2.5.15 The "in table body" insertion mode + + When the insertion mode is "in table body", tokens must be handled as + follows: + + A start tag whose tag name is "tr" + Clear the stack back to a table body context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in row". + + A start tag whose tag name is one of: "th", "td" + Parse error. Act as if a start tag with the tag name "tr" had + been seen, then reprocess the current token. + + An end tag whose tag name is one of: "tbody", "tfoot", "thead" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. + + Otherwise: + + Clear the stack back to a table body context. (See below.) + + Pop the current node from the stack of open elements. Switch the + insertion mode to "in table". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "tfoot", "thead" + + An end tag whose tag name is "table" + If the stack of open elements does not have a tbody, thead, or + tfoot element in table scope, this is a parse error. Ignore the + token. (fragment case) + + Otherwise: + + Clear the stack back to a table body context. (See below.) + + Act as if an end tag with the same tag name as the current node + ("tbody", "tfoot", or "thead") had been seen, then reprocess the + current token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "td", "th", "tr" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in table" insertion + mode. + + When the steps above require the UA to clear the stack back to a table + body context, it means that the UA must, while the current node is not + a tbody, tfoot, thead, or html element, pop elements from the stack of + open elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.16 The "in row" insertion mode + + When the insertion mode is "in row", tokens must be handled as follows: + + A start tag whose tag name is one of: "th", "td" + Clear the stack back to a table row context. (See below.) + + Insert an HTML element for the token, then switch the insertion + mode to "in cell". + + Insert a marker at the end of the list of active formatting + elements. + + An end tag whose tag name is "tr" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Clear the stack back to a table row context. (See below.) + + Pop the current node (which will be a tr element) from the stack + of open elements. Switch the insertion mode to "in table body". + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "tfoot", "thead", "tr" + + An end tag whose tag name is "table" + Act as if an end tag with the tag name "tr" had been seen, then, + if that token wasn't ignored, reprocess the current token. + + The fake end tag token here can only be ignored in the fragment + case. + + An end tag whose tag name is one of: "tbody", "tfoot", "thead" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. + + Otherwise, act as if an end tag with the tag name "tr" had been + seen, then reprocess the current token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html", "td", "th" + Parse error. Ignore the token. + + Anything else + Process the token using the rules for the "in table" insertion + mode. + + When the steps above require the UA to clear the stack back to a table + row context, it means that the UA must, while the current node is not a + tr element or an html element, pop elements from the stack of open + elements. + + The current node being an html element after this process is a fragment + case. + + 8.2.5.17 The "in cell" insertion mode + + When the insertion mode is "in cell", tokens must be handled as + follows: + + An end tag whose tag name is one of: "td", "th" + If the stack of open elements does not have an element in table + scope with the same tag name as that of the token, then this is + a parse error and the token must be ignored. + + Otherwise: + + Generate implied end tags. + + Now, if the current node is not an element with the same tag + name as the token, then this is a parse error. + + Pop elements from this stack until an element with the same tag + name as the token has been popped from the stack. + + Clear the list of active formatting elements up to the last + marker. + + Switch the insertion mode to "in row". (The current node will be + a tr element at this point.) + + A start tag whose tag name is one of: "caption", "col", "colgroup", + "tbody", "td", "tfoot", "th", "thead", "tr" + If the stack of open elements does not have a td or th element + in table scope, then this is a parse error; ignore the token. + (fragment case) + + Otherwise, close the cell (see below) and reprocess the current + token. + + An end tag whose tag name is one of: "body", "caption", "col", + "colgroup", "html" + Parse error. Ignore the token. + + An end tag whose tag name is one of: "table", "tbody", "tfoot", + "thead", "tr" + If the stack of open elements does not have an element in table + scope with the same tag name as that of the token (which can + only happen for "tbody", "tfoot" and "thead", or, in the + fragment case), then this is a parse error and the token must be + ignored. + + Otherwise, close the cell (see below) and reprocess the current + token. + + Anything else + Process the token using the rules for the "in body" insertion + mode. + + Where the steps above say to close the cell, they mean to run the + following algorithm: + 1. If the stack of open elements has a td element in table scope, then + act as if an end tag token with the tag name "td" had been seen. + 2. Otherwise, the stack of open elements will have a th element in + table scope; act as if an end tag token with the tag name "th" had + been seen. + + The stack of open elements cannot have both a td and a th element in + table scope at the same time, nor can it have neither when the + insertion mode is "in cell". + + 8.2.5.18 The "in select" insertion mode + + When the insertion mode is "in select", tokens must be handled as + follows: + + A character token + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "option" + If the current node is an option element, act as if an end tag + with the tag name "option" had been seen. + + Insert an HTML element for the token. + + A start tag whose tag name is "optgroup" + If the current node is an option element, act as if an end tag + with the tag name "option" had been seen. + + If the current node is an optgroup element, act as if an end tag + with the tag name "optgroup" had been seen. + + Insert an HTML element for the token. + + An end tag whose tag name is "optgroup" + First, if the current node is an option element, and the node + immediately before it in the stack of open elements is an + optgroup element, then act as if an end tag with the tag name + "option" had been seen. + + If the current node is an optgroup element, then pop that node + from the stack of open elements. Otherwise, this is a parse + error; ignore the token. + + An end tag whose tag name is "option" + If the current node is an option element, then pop that node + from the stack of open elements. Otherwise, this is a parse + error; ignore the token. + + An end tag whose tag name is "select" + If the stack of open elements does not have an element in table + scope with the same tag name as the token, this is a parse + error. Ignore the token. (fragment case) + + Otherwise: + + Pop elements from the stack of open elements until a select + element has been popped from the stack. + + Reset the insertion mode appropriately. + + A start tag whose tag name is "select" + Parse error. Act as if the token had been an end tag with the + tag name "select" instead. + + A start tag whose tag name is one of: "input", "textarea" + Parse error. Act as if an end tag with the tag name "select" had + been seen, and reprocess the token. + + A start tag token whose tag name is "script" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Ignore the token. + + 8.2.5.19 The "in select in table" insertion mode + + When the insertion mode is "in select in table", tokens must be handled + as follows: + + A start tag whose tag name is one of: "caption", "table", "tbody", + "tfoot", "thead", "tr", "td", "th" + Parse error. Act as if an end tag with the tag name "select" had + been seen, and reprocess the token. + + An end tag whose tag name is one of: "caption", "table", "tbody", + "tfoot", "thead", "tr", "td", "th" + Parse error. + + If the stack of open elements has an element in table scope with + the same tag name as that of the token, then act as if an end + tag with the tag name "select" had been seen, and reprocess the + token. Otherwise, ignore the token. + + Anything else + Process the token using the rules for the "in select" insertion + mode. + + 8.2.5.20 The "in foreign content" insertion mode + + When the insertion mode is "in foreign content", tokens must be handled + as follows: + + A character token + Insert the token's character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mi element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mo element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mn element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an ms element in the MathML namespace. + + A start tag whose tag name is neither "mglyph" nor "malignmark", if the + current node is an mtext element in the MathML namespace. + + A start tag, if the current node is an element in the HTML namespace. + An end tag + Process the token using the rules for the secondary insertion + mode. + + If, after doing so, the insertion mode is still "in foreign + content", but there is no element in scope that has a namespace + other than the HTML namespace, switch the insertion mode to the + secondary insertion mode. + + A start tag whose tag name is one of: "b", "big", "blockquote", "body", + "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", + "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", + "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", + "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", + "table", "tt", "u", "ul", "var" + + A start tag whose tag name is "font", if the token has any attributes + named "color", "face", or "size" + + An end-of-file token + Parse error. + + Pop elements from the stack of open elements until the current + node is in the HTML namespace. + + Switch the insertion mode to the secondary insertion mode, and + reprocess the token. + + Any other start tag + If the current node is an element in the MathML namespace, + adjust MathML attributes for the token. (This fixes the case of + MathML attributes that are not all lowercase.) + + Adjust foreign attributes for the token. (This fixes the use of + namespaced attributes, in particular XLink in SVG.) + + Insert a foreign element for the token, in the same namespace as + the current node. + + If the token has its self-closing flag set, pop the current node + off the stack of open elements and acknowledge the token's + self-closing flag. + + 8.2.5.21 The "after body" insertion mode + + When the insertion mode is "after body", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Process the token using the rules for the "in body" insertion + mode. + + A comment token + Append a Comment node to the first element in the stack of open + elements (the html element), with the data attribute set to the + data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "html" + If the parser was originally created as part of the HTML + fragment parsing algorithm, this is a parse error; ignore the + token. (fragment case) + + Otherwise, switch the insertion mode to "after after body". + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Switch the insertion mode to "in body" and + reprocess the token. + + 8.2.5.22 The "in frameset" insertion mode + + When the insertion mode is "in frameset", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + A start tag whose tag name is "frameset" + Insert an HTML element for the token. + + An end tag whose tag name is "frameset" + If the current node is the root html element, then this is a + parse error; ignore the token. (fragment case) + + Otherwise, pop the current node from the stack of open elements. + + If the parser was not originally created as part of the HTML + fragment parsing algorithm (fragment case), and the current node + is no longer a frameset element, then switch the insertion mode + to "after frameset". + + A start tag whose tag name is "frame" + Insert an HTML element for the token. Immediately pop the + current node off the stack of open elements. + + Acknowledge the token's self-closing flag, if it is set. + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + If the current node is not the root html element, then this is a + parse error. + + It can only be the current node in the fragment case. + + Stop parsing. + + Anything else + Parse error. Ignore the token. + + 8.2.5.23 The "after frameset" insertion mode + + When the insertion mode is "after frameset", tokens must be handled as + follows: + + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + Insert the character into the current node. + + A comment token + Append a Comment node to the current node with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + Parse error. Ignore the token. + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end tag whose tag name is "html" + Switch the insertion mode to "after after frameset". + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Ignore the token. + + This doesn't handle UAs that don't support frames, or that do support + frames but want to show the NOFRAMES content. Supporting the former is + easy; supporting the latter is harder. + + 8.2.5.24 The "after after body" insertion mode + + When the insertion mode is "after after body", tokens must be handled + as follows: + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end-of-file token + Stop parsing. + + Anything else + Parse error. Switch the insertion mode to "in body" and + reprocess the token. + + 8.2.5.25 The "after after frameset" insertion mode + + When the insertion mode is "after after frameset", tokens must be + handled as follows: + + A comment token + Append a Comment node to the Document object with the data + attribute set to the data given in the comment token. + + A DOCTYPE token + A character token that is one of one of U+0009 CHARACTER TABULATION, + U+000A LINE FEED (LF), U+000C FORM FEED (FF), or U+0020 SPACE + + A start tag whose tag name is "html" + Process the token using the rules for the "in body" insertion + mode. + + An end-of-file token + Stop parsing. + + A start tag whose tag name is "noframes" + Process the token using the rules for the "in head" insertion + mode. + + Anything else + Parse error. Ignore the token. + + 8.2.6 The end + + Once the user agent stops parsing the document, the user agent must + follow the steps in this section. + + First, the current document readiness must be set to "interactive". + + Then, the rules for when a script completes loading start applying + (script execution is no longer managed by the parser). + + If any of the scripts in the list of scripts that will execute as soon + as possible have completed loading, or if the list of scripts that will + execute asynchronously is not empty and the first script in that list + has completed loading, then the user agent must act as if those scripts + just completed loading, following the rules given for that in the + script element definition. + + Then, if the list of scripts that will execute when the document has + finished parsing is not empty, and the first item in this list has + already completed loading, then the user agent must act as if that + script just finished loading. + + By this point, there will be no scripts that have loaded but have not + yet been executed. + + The user agent must then fire a simple event called DOMContentLoaded at + the Document. + + Once everything that delays the load event has completed, the user + agent must set the current document readiness to "complete", and then + fire a load event at the body element. + + delaying the load event for things like image loads allows for intranet + port scans (even without javascript!). Should we really encode that + into the spec? + + 8.2.7 Coercing an HTML DOM into an infoset + + When an application uses an HTML parser in conjunction with an XML + pipeline, it is possible that the constructed DOM is not compatible + with the XML tool chain in certain subtle ways. For example, an XML + toolchain might not be able to represent attributes with the name + xmlns, since they conflict with the Namespaces in XML syntax. There is + also some data that the HTML parser generates that isn't included in + the DOM itself. This section specifies some rules for handling these + issues. + + If the XML API being used doesn't support DOCTYPEs, the tool may drop + DOCTYPEs altogether. + + If the XML API doesn't support attributes in no namespace that are + named "xmlns", attributes whose names start with "xmlns:", or + attributes in the XMLNS namespace, then the tool may drop such + attributes. + + The tool may annotate the output with any namespace declarations + required for proper operation. + + If the XML API being used restricts the allowable characters in the + local names of elements and attributes, then the tool may map all + element and attribute local names that the API wouldn't support to a + set of names that are allowed, by replacing any character that isn't + supported with the uppercase letter U and the five digits of the + character's Unicode codepoint when expressed in hexadecimal, using + digits 0-9 and capital letters A-F as the symbols, in increasing + numeric order. + + For example, the element name foo start tag will be closed + by a end tag, and never by a end tag, even if + the user agent is using the rules above to then generate an actual + element in the DOM with the name aU0003AU0003A for that start tag. + + 8.3 Namespaces + + The HTML namespace is: http://www.w3.org/1999/xhtml + + The MathML namespace is: http://www.w3.org/1998/Math/MathML + + The SVG namespace is: http://www.w3.org/2000/svg + + The XLink namespace is: http://www.w3.org/1999/xlink + + The XML namespace is: http://www.w3.org/XML/1998/namespace + + The XMLNS namespace is: http://www.w3.org/2000/xmlns/ -- cgit v1.2.3