How to specify tokens
If S is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
d1r1
d2 r2
.............
dn rn
where each di is a distinct name, and each ri is a regular expression over the symbols in i.e. the basic symbols and the previously defined names. By restricting each ri to symbols of S and the previously defined names, we can construct a regular expression over S for any ri by repeatedly replacing regular-expression names by the expressions they denote. If ri used dkfor some k >= i, then ri might be recursively defined, and this substitution process would not terminate.
So, we treat tokens as terminal symbols in the grammar for the source language. The lexeme matched by the pattern for the token consists of a string of characters in the source program and can be treated as a lexical unit. The lexical analyzer collects information about tokens into there associated attributes. As a practical matter a token has usually only a single attribute, appointed to the symbol table entry in which the information about the token is kept; the pointer becomes the attribute for the token.
|