 October 2001, --Jcid
 Last update: Oct 2001

                        ---------------
                        THE HTML PARSER
                        ---------------


   Dillo's  parser is more than just a HTML parser, it does XHTML
and  plain  text  also.  It  has  parsing 'modes' that define its
behaviour while working:

   typedef enum {
     DILLO_HTML_PARSE_MODE_INIT,
     DILLO_HTML_PARSE_MODE_STASH,
     DILLO_HTML_PARSE_MODE_STASH_AND_BODY,
     DILLO_HTML_PARSE_MODE_SCRIPT,
     DILLO_HTML_PARSE_MODE_BODY,
     DILLO_HTML_PARSE_MODE_VERBATIM,
     DILLO_HTML_PARSE_MODE_PRE
   } DilloHtmlParseMode;


   The  parser  works  upon  a token-grained basis. i.e. the data
stream is parsed into tokens and the parser is fed with them. The
process  is  simple:  whenever  the  cache  has new data, it gets
passed to Html_write, which groups data into tokens and calls the
appropriate functions for the token type (TAG, SPACE or WORD).

------
TOKENS
------

  * A chunk of WHITE SPACE     --> Html_process_space


  * TAG                        --> Html_process_tag

    The tag start is defined by two adjacent characters:

      first : '<'
      second: ALPHA | '/' | '!' | '?'

      Note: comments are discarded ( <!-- ... --> )

    The tag end is defined by the first '>' character after the
    TAG start. 

    Note: The HTML 4.01 sec. 3.2.2 states that "Attribute/value
    pairs appear before the final '>' of an element's start tag",
    but it doesn't define how to discriminate the "final" '>'.

    Dillo assumes the first '>' as "final" for two reasons:

    1)  It's  a  common  mistake  for human authors to mistype or
       forget  one  of the quote marks of an attribute value, and
       keeping the parser seeking for the closing quote, leads to
       much worse rendering (it may skip significative amounts of
       well written HTML).
    2)  Authors  can  use  &gt;  instead  of '>' inside attribute
       values.



  * WORD                       --> Html_process_word

    A  word is anything that doesn't start with SPACE, and that's
    outside  of  a  tag, up to the first SPACE or tag start.

    SPACE = ' ' | \n | \r | \t | \f | \v


-----------------
THE PARSING STACK 
-----------------

  The parsing state of the document is kept in a stack:

  struct _DilloHtml {
     [...]
     DilloHtmlState *stack;
     gint stack_top; /* Index to the top of the stack [0 based] */
     gint stack_max;
     [...]
  };

  struct _DilloHtmlState {
    char *tag;
    DwStyle *style, *table_cell_style;
    DilloHtmlParseMode parse_mode;
    DilloHtmlTableMode table_mode;
    gint list_level;
    gint list_number;
    DwWidget *page, *table;
    gint32  current_bg_color;
  };


  Basically,  when a TAG is processed, a new state is pushed into
the  'stack'  and  its  'style'  is  set  to  reflect the desired
appearance (details in DwStyle.txt).

  That way, when a word is processed later (added to the Dw), all
the information is within the top state.

  Closing TAGs just pop the stack.


