Lexical Analysis
. Sentences consist of string of tokens (a syntactic category)
for example, number, identifier, keyword, string
. Sequences of characters in a token is a lexeme
for example, 100.01, counter, const, "How are you?"
. Rule of description is a pattern
for example, letter(letter/digit)*
. Discard whatever does not contribute to parsing like white spaces ( blanks, tabs, newlines ) and comments
. construct constants: convert numbers to token num and pass number as its attribute, for example, integer 31 becomes <num, 31>
. recognize keyword and identifiers
for example counter = counter + increment
becomes id = id + id /*check if id is a keyword*/
We often use the terms "token", "pattern" and "lexeme" while studying lexical analysis. Lets see what each term stands for.
Token: A token is a syntactic category. Sentences consist of a string of tokens. For example number, identifier, keyword, string etc are tokens.
Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, "How are you?" etc are lexemes.
Pattern: Rule of description is a pattern. For example letter (letter | digit)* is a pattern to symbolize a set of strings which consist of a letter followed by a letter or digit.
In general, there is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. This pattern is said to match each string in the set. A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. The patterns are specified using regular expressions. For example, in the Pascal statement
Const pi = 3.1416;
The substring pi is a lexeme for the token "identifier".
We discard whatever does not contribute to parsing like white spaces (blanks, tabs, new lines) and comments.
When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. For example, the pattern num matches both 1 and 0 but it is essential for the code generator to know what string was actually matched. The lexical analyzer collects information about tokens into their associated attributes. For example integer 31 becomes <num, 31>. So, the constants are constructed by converting numbers to token 'num' and passing the number as its attribute. Similarly, we recognize keywords and identifiers. For example count = count + inc becomes id = id + id.
|