HTML

This documents describes the complete handling of HTML in magellan. The document covers the parsing process - how HTML is lexically analysized and then interprted. After the parsing process is discussed we give a detailed analysis of each HTML tag and the attributes that are supported, the values for the attributes and how the tag is treated by magellan.

Parsing

HTML is tokenized by an HTML scanner. The scanner is fed unicode data to parse. Stream converters are used to translate from various encodings to unicode. The scanner separates the input stream into tokens which consist of: The HTML parsing engine uses the HTML scanner for lexical anlaysis. The parsing engine operates by attacking the input stream in a set of well defined steps:

Tag Processing

Tags are processed by the parser by locating a "tag handler" for the tag. The HTML parser serves as the tag handler for all of the builtin tags documented below. Tag attribute handling is done during translation of tags into content. This mapping translates the tag attributes into content data and into style data. The translation to style data is documented below by indicating the mapping from tag attributes to their CSS1 (plus extensions) equivalents.

Special Hacks

The following list describes hacks added to the magellan parsing engine to deal with navigator compatibility. These are just the parser hacks, not the layout or presentation hacks. Most hacks are intriduced for HTML syntax error recovering. HTML doesn't specify much how to handle those error conditions. Netscape has made big effort to render pages with non-prefect HTML. For many reasons, new browsers need to keep compatible in thsi area. TODO: List of 6.0 features incompatible with 4.0

Tags (Categorically sorted)

All line breaks are conditional. If the x coordinate is at the current left margin then a soft line break does nothing. Hard line breaks are ignored if the last tag did a hard line break.

divalign = left | right | center | justify
alignparam = abscenter | left | right | texttop | absbottom | baseline | center | bottom | top | middle | absmiddle
colorspec = named-color | #xyz | #xxyyzz | #xxxyyyzzz | #xxxxyyyyzzzz
clip = [auto | value-or-pct-xy](1..4) (pct of width for even coordinates; pct of height for odd coordinates)
value-or-pct = an integer with an optional %; ifthe percent is present any following characters are ignored!
coord-list = XXX
whitespace-strip = remove leading and trailing and any embedded whitespace that is not an actual space (e.g. newlines)

Head objects:

TITLE BASE META LINK HEAD HTML STYLE FRAMESET FRAME NOFRAMES


Body objects:

 BODY LAYER, ILAYER NOLAYER P ADDRESS PLAINTEXT, XMP LISTING PRE NOBR CENTER DIV H1-H6 A note regarding closing paragraphs: Any time a close paragraph is done (for any tag) if the top of the alignment stack has a tag named "P" then a conditional soft line break is done and the alignment is popped.


TABLE TR TH, TD CAPTION MULTICOL


BLOCKQUOTE UL, OL, MENU, DIR DL LI DD DT


A STRIKE, S, TT, CODE, SAMPLE, KBD, B, STRONG, I, EM, VAR, CITE, BLINK, BIG, SMALL, U, INLINEINPUT, SPELL SUP, SUB SPAN FONT A note regarding the style stack: The pop of the stack checks to see if the top of the stack is an ANCHOR tag. If it is not an anchor then the top item is unconditionally popped. If the top of the style stack is an anchor tag then the code searches for either the bottom of the stack or the first style stack entry not created by an anchor tag. If the entry is followed by another entry then the entry is removed from the stack (an out-of-order pop in other words). In this case the anchor style stack entry is left untouched.


text, entities IMG, IMAGE HR BR WBR EMBED NOEBMED APPLET PARAM OBJECT MAP AREA SERVER SPACER


SCRIPT NOSCRIPT


FORM  ISINDEX  INPUT  SELECT  OPTION  TEXTAREA  KEYGEN 


BASEFONT 


Unsupported

NSCP_CLOSE, NSCP_OPEN, NSCP_REBLOCK, MQUOTE, CELL, SUBDOC, CERTIFICATE, INLINEINPUTTHICK, INLINEINPUTDOTTED, COLORMAP, HYPE, SPELL, NSDT