diff --git a/usr.bin/lex/Makefile b/usr.bin/lex/Makefile index 7f88e6eea4de..41d4afddceb7 100644 --- a/usr.bin/lex/Makefile +++ b/usr.bin/lex/Makefile @@ -1,4 +1,4 @@ -# $Id$ +# $Id: Makefile,v 1.1.1.1 1994/08/24 13:10:33 csgr Exp $ # # By default, flex will be configured to generate 8-bit scanners only if the # -8 flag is given. If you want it to always generate 8-bit scanners, add @@ -6,13 +6,12 @@ # of all uncompressed scanners. # # Bootstrapping of lex is handled automatically. -# ALso note that flex.skel no longer gets installed. +# Also note that flex.skel no longer gets installed. # # XXX Todo: # Install as lex++, and install FlexLexer.h PROG= lex -LINKS= ${BINDIR}/lex ${BINDIR}/flex #LINKS+= ${BINDIR}/lex ${BINDIR}/lex++ ${BINDIR}/flex ${BINDIR}/flex++ SRCS= ccl.c dfa.c ecs.c gen.c main.c misc.c nfa.c parse.y \ @@ -20,10 +19,9 @@ SRCS= ccl.c dfa.c ecs.c gen.c main.c misc.c nfa.c parse.y \ OBJS+= scan.o LFLAGS+= -is CFLAGS+= -I. -I${.CURDIR} -MAN1= flex.1 flexdoc.1 -MLINKS= flex.1 lex.1 flexdoc.1 lexdoc.1 +MAN1= lex.1 lexdoc.1 -CLEANFILES+= parse.c parse.h scan.c y.tab.h +CLEANFILES+= parse.c parse.h scan.c y.tab.h y.tab.c SUBDIR= lib diff --git a/usr.bin/lex/lex.1 b/usr.bin/lex/lex.1 new file mode 100644 index 000000000000..6aba4d6977ac --- /dev/null +++ b/usr.bin/lex/lex.1 @@ -0,0 +1,1001 @@ +.TH FLEX 1 "November 1993" "Version 2.4" +.SH NAME +flex \- fast lexical analyzer generator +.SH SYNOPSIS +.B flex +.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton] +.I [filename ...] +.SH DESCRIPTION +.I flex +is a tool for generating +.I scanners: +programs which recognized lexical patterns in text. +.I flex +reads +the given input files, or its standard input if no file names are given, +for a description of a scanner to generate. The description is in +the form of pairs +of regular expressions and C code, called +.I rules. flex +generates as output a C source file, +.B lex.yy.c, +which defines a routine +.B yylex(). +This file is compiled and linked with the +.B \-lfl +library to produce an executable. When the executable is run, +it analyzes its input for occurrences +of the regular expressions. Whenever it finds one, it executes +the corresponding C code. +.PP +For full documentation, see +.B flexdoc(1). +This manual entry is intended for use as a quick reference. +.SH OPTIONS +.I flex +has the following options: +.TP +.B \-b +generate backing-up information to +.I lex.backup. +This is a list of scanner states which require backing up and the input +characters on which they do so. By adding rules one can remove +backing-up states. If all backing-up states are eliminated and +.B \-Cf +or +.B \-CF +is used, the generated scanner will run faster. +.TP +.B \-c +is a do-nothing, deprecated option included for POSIX compliance. +.IP +.B NOTE: +in previous releases of +.I flex +.B \-c +specified table-compression options. This functionality is +now given by the +.B \-C +flag. To ease the the impact of this change, when +.I flex +encounters +.B \-c, +it currently issues a warning message and assumes that +.B \-C +was desired instead. In the future this "promotion" of +.B \-c +to +.B \-C +will go away in the name of full POSIX compliance (unless +the POSIX meaning is removed first). +.TP +.B \-d +makes the generated scanner run in +.I debug +mode. Whenever a pattern is recognized and the global +.B yy_flex_debug +is non-zero (which is the default), the scanner will +write to +.I stderr +a line of the form: +.nf + + --accepting rule at line 53 ("the matched text") + +.fi +The line number refers to the location of the rule in the file +defining the scanner (i.e., the file that was fed to flex). Messages +are also generated when the scanner backs up, accepts the +default rule, reaches the end of its input buffer (or encounters +a NUL; the two look the same as far as the scanner's concerned), +or reaches an end-of-file. +.TP +.B \-f +specifies +.I fast scanner. +No table compression is done and stdio is bypassed. +The result is large but fast. This option is equivalent to +.B \-Cfr +(see below). +.TP +.B \-h +generates a "help" summary of +.I flex's +options to +.I stderr +and then exits. +.TP +.B \-i +instructs +.I flex +to generate a +.I case-insensitive +scanner. The case of letters given in the +.I flex +input patterns will +be ignored, and tokens in the input will be matched regardless of case. The +matched text given in +.I yytext +will have the preserved case (i.e., it will not be folded). +.TP +.B \-l +turns on maximum compatibility with the original AT&T lex implementation, +at a considerable performance cost. This option is incompatible with +.B \-+, \-f, \-F, \-Cf, +or +.B \-CF. +See +.I flexdoc(1) +for details. +.TP +.B \-n +is another do-nothing, deprecated option included only for +POSIX compliance. +.TP +.B \-p +generates a performance report to stderr. The report +consists of comments regarding features of the +.I flex +input file which will cause a loss of performance in the resulting scanner. +If you give the flag twice, you will also get comments regarding +features that lead to minor performance losses. +.TP +.B \-s +causes the +.I default rule +(that unmatched scanner input is echoed to +.I stdout) +to be suppressed. If the scanner encounters input that does not +match any of its rules, it aborts with an error. +.TP +.B \-t +instructs +.I flex +to write the scanner it generates to standard output instead +of +.B lex.yy.c. +.TP +.B \-v +specifies that +.I flex +should write to +.I stderr +a summary of statistics regarding the scanner it generates. +.TP +.B \-w +suppresses warning messages. +.TP +.B \-B +instructs +.I flex +to generate a +.I batch +scanner instead of an +.I interactive +scanner (see +.B \-I +below). See +.I flexdoc(1) +for details. Scanners using +.B \-Cf +or +.B \-CF +compression options automatically specify this option, too. +.TP +.B \-F +specifies that the +.ul +fast +scanner table representation should be used (and stdio bypassed). +This representation is about as fast as the full table representation +.B (-f), +and for some sets of patterns will be considerably smaller (and for +others, larger). It cannot be used with the +.B \-+ +option. See +.B flexdoc(1) +for more details. +.IP +This option is equivalent to +.B \-CFr +(see below). +.TP +.B \-I +instructs +.I flex +to generate an +.I interactive +scanner, that is, a scanner which stops immediately rather than +looking ahead if it knows +that the currently scanned text cannot be part of a longer rule's match. +This is the opposite of +.I batch +scanners (see +.B \-B +above). See +.B flexdoc(1) +for details. +.IP +Note, +.B \-I +cannot be used in conjunction with +.I full +or +.I fast tables, +i.e., the +.B \-f, \-F, \-Cf, +or +.B \-CF +flags. For other table compression options, +.B \-I +is the default. +.TP +.B \-L +instructs +.I flex +not to generate +.B #line +directives in +.B lex.yy.c. +The default is to generate such directives so error +messages in the actions will be correctly +located with respect to the original +.I flex +input file, and not to +the fairly meaningless line numbers of +.B lex.yy.c. +.TP +.B \-T +makes +.I flex +run in +.I trace +mode. It will generate a lot of messages to +.I stderr +concerning +the form of the input and the resultant non-deterministic and deterministic +finite automata. This option is mostly for use in maintaining +.I flex. +.TP +.B \-V +prints the version number to +.I stderr +and exits. +.TP +.B \-7 +instructs +.I flex +to generate a 7-bit scanner, which can save considerable table space, +especially when using +.B \-Cf +or +.B \-CF +(and, at most sites, +.B \-7 +is on by default for these options. To see if this is the case, use the +.B -v +verbose flag and check the flag summary it reports). +.TP +.B \-8 +instructs +.I flex +to generate an 8-bit scanner. This is the default except for the +.B \-Cf +and +.B \-CF +compression options, for which the default is site-dependent, and +can be checked by inspecting the flag summary generated by the +.B \-v +option. +.TP +.B \-+ +specifies that you want flex to generate a C++ +scanner class. See the section on Generating C++ Scanners in +.I flexdoc(1) +for details. +.TP +.B \-C[aefFmr] +controls the degree of table compression and scanner optimization. +.IP +.B \-Ca +trade off larger tables in the generated scanner for faster performance +because the elements of the tables are better aligned for memory access +and computation. This option can double the size of the tables used by +your scanner. +.IP +.B \-Ce +directs +.I flex +to construct +.I equivalence classes, +i.e., sets of characters +which have identical lexical properties. +Equivalence classes usually give +dramatic reductions in the final table/object file sizes (typically +a factor of 2-5) and are pretty cheap performance-wise (one array +look-up per character scanned). +.IP +.B \-Cf +specifies that the +.I full +scanner tables should be generated - +.I flex +should not compress the +tables by taking advantages of similar transition functions for +different states. +.IP +.B \-CF +specifies that the alternate fast scanner representation (described in +.B flexdoc(1)) +should be used. This option cannot be used with +.B \-+. +.IP +.B \-Cm +directs +.I flex +to construct +.I meta-equivalence classes, +which are sets of equivalence classes (or characters, if equivalence +classes are not being used) that are commonly used together. Meta-equivalence +classes are often a big win when using compressed tables, but they +have a moderate performance impact (one or two "if" tests and one +array look-up per character scanned). +.IP +.B \-Cr +causes the generated scanner to +.I bypass +using stdio for input. In general this option results in a minor +performance gain only worthwhile if used in conjunction with +.B \-Cf +or +.B \-CF. +It can cause surprising behavior if you use stdio yourself to +read from +.I yyin +prior to calling the scanner. +.IP +A lone +.B \-C +specifies that the scanner tables should be compressed but neither +equivalence classes nor meta-equivalence classes should be used. +.IP +The options +.B \-Cf +or +.B \-CF +and +.B \-Cm +do not make sense together - there is no opportunity for meta-equivalence +classes if the table is not being compressed. Otherwise the options +may be freely mixed. +.IP +The default setting is +.B \-Cem, +which specifies that +.I flex +should generate equivalence classes +and meta-equivalence classes. This setting provides the highest +degree of table compression. You can trade off +faster-executing scanners at the cost of larger tables with +the following generally being true: +.nf + + slowest & smallest + -Cem + -Cm + -Ce + -C + -C{f,F}e + -C{f,F} + -C{f,F}a + fastest & largest + +.fi +.IP +.B \-C +options are cumulative. +.TP +.B \-Pprefix +changes the default +.I "yy" +prefix used by +.I flex +to be +.I prefix +instead. See +.I flexdoc(1) +for a description of all the global variables and file names that +this affects. +.TP +.B \-Sskeleton_file +overrides the default skeleton file from which +.I flex +constructs its scanners. You'll never need this option unless you are doing +.I flex +maintenance or development. +.SH SUMMARY OF FLEX REGULAR EXPRESSIONS +The patterns in the input are written using an extended set of regular +expressions. These are: +.nf + + x match the character 'x' + . any character except newline + [xyz] a "character class"; in this case, the pattern + matches either an 'x', a 'y', or a 'z' + [abj-oZ] a "character class" with a range in it; matches + an 'a', a 'b', any letter from 'j' through 'o', + or a 'Z' + [^A-Z] a "negated character class", i.e., any character + but those in the class. In this case, any + character EXCEPT an uppercase letter. + [^A-Z\\n] any character EXCEPT an uppercase letter or + a newline + r* zero or more r's, where r is any regular expression + r+ one or more r's + r? zero or one r's (that is, "an optional r") + r{2,5} anywhere from two to five r's + r{2,} two or more r's + r{4} exactly 4 r's + {name} the expansion of the "name" definition + (see above) + "[xyz]\\"foo" + the literal string: [xyz]"foo + \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', + then the ANSI-C interpretation of \\x. + Otherwise, a literal 'X' (used to escape + operators such as '*') + \\123 the character with octal value 123 + \\x2a the character with hexadecimal value 2a + (r) match an r; parentheses are used to override + precedence (see below) + + + rs the regular expression r followed by the + regular expression s; called "concatenation" + + + r|s either an r or an s + + + r/s an r but only if it is followed by an s. The + s is not part of the matched text. This type + of pattern is called as "trailing context". + ^r an r, but only at the beginning of a line + r$ an r, but only at the end of a line. Equivalent + to "r/\\n". + + + r an r, but only in start condition s (see + below for discussion of start conditions) + r + same, but in any of start conditions s1, + s2, or s3 + <*>r an r in any start condition, even an exclusive one. + + + <> an end-of-file + <> + an end-of-file when in start condition s1 or s2 + +.fi +The regular expressions listed above are grouped according to +precedence, from highest precedence at the top to lowest at the bottom. +Those grouped together have equal precedence. +.PP +Some notes on patterns: +.IP - +Negated character classes +.I match newlines +unless "\\n" (or an equivalent escape sequence) is one of the +characters explicitly present in the negated character class +(e.g., "[^A-Z\\n]"). +.IP - +A rule can have at most one instance of trailing context (the '/' operator +or the '$' operator). The start condition, '^', and "<>" patterns +can only occur at the beginning of a pattern, and, as well as with '/' and '$', +cannot be grouped inside parentheses. The following are all illegal: +.nf + + foo/bar$ + foo|(bar$) + foo|^bar + foobar + +.fi +.SH SUMMARY OF SPECIAL ACTIONS +In addition to arbitrary C code, the following can appear in actions: +.IP - +.B ECHO +copies yytext to the scanner's output. +.IP - +.B BEGIN +followed by the name of a start condition places the scanner in the +corresponding start condition. +.IP - +.B REJECT +directs the scanner to proceed on to the "second best" rule which matched the +input (or a prefix of the input). +.B yytext +and +.B yyleng +are set up appropriately. Note that +.B REJECT +is a particularly expensive feature in terms scanner performance; +if it is used in +.I any +of the scanner's actions it will slow down +.I all +of the scanner's matching. Furthermore, +.B REJECT +cannot be used with the +.B \-f +or +.B \-F +options. +.IP +Note also that unlike the other special actions, +.B REJECT +is a +.I branch; +code immediately following it in the action will +.I not +be executed. +.IP - +.B yymore() +tells the scanner that the next time it matches a rule, the corresponding +token should be +.I appended +onto the current value of +.B yytext +rather than replacing it. +.IP - +.B yyless(n) +returns all but the first +.I n +characters of the current token back to the input stream, where they +will be rescanned when the scanner looks for the next match. +.B yytext +and +.B yyleng +are adjusted appropriately (e.g., +.B yyleng +will now be equal to +.I n +). +.IP - +.B unput(c) +puts the character +.I c +back onto the input stream. It will be the next character scanned. +.IP - +.B input() +reads the next character from the input stream (this routine is called +.B yyinput() +if the scanner is compiled using +.B C++). +.IP - +.B yyterminate() +can be used in lieu of a return statement in an action. It terminates +the scanner and returns a 0 to the scanner's caller, indicating "all done". +.IP +By default, +.B yyterminate() +is also called when an end-of-file is encountered. It is a macro and +may be redefined. +.IP - +.B YY_NEW_FILE +is an action available only in <> rules. It means "Okay, I've +set up a new input file, continue scanning". It is no longer required; +you can just assign +.I yyin +to point to a new file in the <> action. +.IP - +.B yy_create_buffer( file, size ) +takes a +.I FILE +pointer and an integer +.I size. +It returns a YY_BUFFER_STATE +handle to a new input buffer large enough to accomodate +.I size +characters and associated with the given file. When in doubt, use +.B YY_BUF_SIZE +for the size. +.IP - +.B yy_switch_to_buffer( new_buffer ) +switches the scanner's processing to scan for tokens from +the given buffer, which must be a YY_BUFFER_STATE. +.IP - +.B yy_delete_buffer( buffer ) +deletes the given buffer. +.SH VALUES AVAILABLE TO THE USER +.IP - +.B char *yytext +holds the text of the current token. It may be modified but not lengthened +(you cannot append characters to the end). Modifying the last character +may affect the activity of rules anchored using '^' during the next scan; +see +.B flexdoc(1) +for details. +.IP +If the special directive +.B %array +appears in the first section of the scanner description, then +.B yytext +is instead declared +.B char yytext[YYLMAX], +where +.B YYLMAX +is a macro definition that you can redefine in the first section +if you don't like the default value (generally 8KB). Using +.B %array +results in somewhat slower scanners, but the value of +.B yytext +becomes immune to calls to +.I input() +and +.I unput(), +which potentially destroy its value when +.B yytext +is a character pointer. The opposite of +.B %array +is +.B %pointer, +which is the default. +.IP +You cannot use +.B %array +when generating C++ scanner classes +(the +.B \-+ +flag). +.IP - +.B int yyleng +holds the length of the current token. +.IP - +.B FILE *yyin +is the file which by default +.I flex +reads from. It may be redefined but doing so only makes sense before +scanning begins or after an EOF has been encountered. Changing it in +the midst of scanning will have unexpected results since +.I flex +buffers its input; use +.B yyrestart() +instead. +Once scanning terminates because an end-of-file +has been seen, +.B +you can assign +.I yyin +at the new input file and then call the scanner again to continue scanning. +.IP - +.B void yyrestart( FILE *new_file ) +may be called to point +.I yyin +at the new input file. The switch-over to the new file is immediate +(any previously buffered-up input is lost). Note that calling +.B yyrestart() +with +.I yyin +as an argument thus throws away the current input buffer and continues +scanning the same input file. +.IP - +.B FILE *yyout +is the file to which +.B ECHO +actions are done. It can be reassigned by the user. +.IP - +.B YY_CURRENT_BUFFER +returns a +.B YY_BUFFER_STATE +handle to the current buffer. +.IP - +.B YY_START +returns an integer value corresponding to the current start +condition. You can subsequently use this value with +.B BEGIN +to return to that start condition. +.SH MACROS AND FUNCTIONS YOU CAN REDEFINE +.IP - +.B YY_DECL +controls how the scanning routine is declared. +By default, it is "int yylex()", or, if prototypes are being +used, "int yylex(void)". This definition may be changed by redefining +the "YY_DECL" macro. Note that +if you give arguments to the scanning routine using a +K&R-style/non-prototyped function declaration, you must terminate +the definition with a semi-colon (;). +.IP - +The nature of how the scanner +gets its input can be controlled by redefining the +.B YY_INPUT +macro. +YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its +action is to place up to +.I max_size +characters in the character array +.I buf +and return in the integer variable +.I result +either the +number of characters read or the constant YY_NULL (0 on Unix systems) +to indicate EOF. The default YY_INPUT reads from the +global file-pointer "yyin". +A sample redefinition of YY_INPUT (in the definitions +section of the input file): +.nf + + %{ + #undef YY_INPUT + #define YY_INPUT(buf,result,max_size) \\ + { \\ + int c = getchar(); \\ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ + } + %} + +.fi +.IP - +When the scanner receives an end-of-file indication from YY_INPUT, +it then checks the function +.B yywrap() +function. If +.B yywrap() +returns false (zero), then it is assumed that the +function has gone ahead and set up +.I yyin +to point to another input file, and scanning continues. If it returns +true (non-zero), then the scanner terminates, returning 0 to its +caller. +.IP +The default +.B yywrap() +always returns 1. +.IP - +YY_USER_ACTION +can be redefined to provide an action +which is always executed prior to the matched rule's action. +.IP - +The macro +.B YY_USER_INIT +may be redefined to provide an action which is always executed before +the first scan. +.IP - +In the generated scanner, the actions are all gathered in one large +switch statement and separated using +.B YY_BREAK, +which may be redefined. By default, it is simply a "break", to separate +each rule's action from the following rule's. +.SH FILES +.TP +.B \-lfl +library with which to link scanners to obtain the default versions +of +.I yywrap() +and/or +.I main(). +.TP +.I lex.yy.c +generated scanner (called +.I lexyy.c +on some systems). +.TP +.I lex.yy.cc +generated C++ scanner class, when using +.B -+. +.TP +.I +header file defining the C++ scanner base class, +.B FlexLexer, +and its derived class, +.B yyFlexLexer. +.TP +.I flex.skl +skeleton scanner. This file is only used when building flex, not when +flex executes. +.TP +.I lex.backup +backing-up information for +.B \-b +flag (called +.I lex.bck +on some systems). +.SH "SEE ALSO" +.PP +flexdoc(1), lex(1), yacc(1), sed(1), awk(1). +.PP +M. E. Lesk and E. Schmidt, +.I LEX \- Lexical Analyzer Generator +.SH DIAGNOSTICS +.PP +.I reject_used_but_not_detected undefined +or +.PP +.I yymore_used_but_not_detected undefined - +These errors can occur at compile time. They indicate that the +scanner uses +.B REJECT +or +.B yymore() +but that +.I flex +failed to notice the fact, meaning that +.I flex +scanned the first two sections looking for occurrences of these actions +and failed to find any, but somehow you snuck some in (via a #include +file, for example). Make an explicit reference to the action in your +.I flex +input file. (Note that previously +.I flex +supported a +.B %used/%unused +mechanism for dealing with this problem; this feature is still supported +but now deprecated, and will go away soon unless the author hears from +people who can argue compellingly that they need it.) +.PP +.I flex scanner jammed - +a scanner compiled with +.B \-s +has encountered an input string which wasn't matched by +any of its rules. +.PP +.I warning, rule cannot be matched +indicates that the given rule +cannot be matched because it follows other rules that will +always match the same text as it. See +.I flexdoc(1) +for an example. +.PP +.I warning, +.B \-s +.I +option given but default rule can be matched +means that it is possible (perhaps only in a particular start condition) +that the default rule (match any single character) is the only one +that will match a particular input. Since +.PP +.I scanner input buffer overflowed - +a scanner rule matched more text than the available dynamic memory. +.PP +.I token too large, exceeds YYLMAX - +your scanner uses +.B %array +and one of its rules matched a string longer than the +.B YYLMAX +constant (8K bytes by default). You can increase the value by +#define'ing +.B YYLMAX +in the definitions section of your +.I flex +input. +.PP +.I scanner requires \-8 flag to +.I use the character 'x' - +Your scanner specification includes recognizing the 8-bit character +.I 'x' +and you did not specify the \-8 flag, and your scanner defaulted to 7-bit +because you used the +.B \-Cf +or +.B \-CF +table compression options. +.PP +.I flex scanner push-back overflow - +you used +.B unput() +to push back so much text that the scanner's buffer could not hold +both the pushed-back text and the current token in +.B yytext. +Ideally the scanner should dynamically resize the buffer in this case, but at +present it does not. +.PP +.I +input buffer overflow, can't enlarge buffer because scanner uses REJECT - +the scanner was working on matching an extremely large token and needed +to expand the input buffer. This doesn't work with scanners that use +.B +REJECT. +.PP +.I +fatal flex scanner internal error--end of buffer missed - +This can occur in an scanner which is reentered after a long-jump +has jumped out (or over) the scanner's activation frame. Before +reentering the scanner, use: +.nf + + yyrestart( yyin ); + +.fi +or use C++ scanner classes (the +.B \-+ +option), which are fully reentrant. +.SH AUTHOR +Vern Paxson, with the help of many ideas and much inspiration from +Van Jacobson. Original version by Jef Poskanzer. +.PP +See flexdoc(1) for additional credits and the address to send comments to. +.SH DEFICIENCIES / BUGS +.PP +Some trailing context +patterns cannot be properly matched and generate +warning messages ("dangerous trailing context"). These are +patterns where the ending of the +first part of the rule matches the beginning of the second +part, such as "zx*/xy*", where the 'x*' matches the 'x' at +the beginning of the trailing context. (Note that the POSIX draft +states that the text matched by such patterns is undefined.) +.PP +For some trailing context rules, parts which are actually fixed-length are +not recognized as such, leading to the abovementioned performance loss. +In particular, parts using '|' or {n} (such as "foo{3}") are always +considered variable-length. +.PP +Combining trailing context with the special '|' action can result in +.I fixed +trailing context being turned into the more expensive +.I variable +trailing context. For example, in the following: +.nf + + %% + abc | + xyz/def + +.fi +.PP +Use of +.B unput() +or +.B input() +invalidates yytext and yyleng, unless the +.B %array +directive +or the +.B \-l +option has been used. +.PP +Use of unput() to push back more text than was matched can +result in the pushed-back text matching a beginning-of-line ('^') +rule even though it didn't come at the beginning of the line +(though this is rare!). +.PP +Pattern-matching of NUL's is substantially slower than matching other +characters. +.PP +Dynamic resizing of the input buffer is slow, as it entails rescanning +all the text matched so far by the current (generally huge) token. +.PP +.I flex +does not generate correct #line directives for code internal +to the scanner; thus, bugs in +.I flex.skl +yield bogus line numbers. +.PP +Due to both buffering of input and read-ahead, you cannot intermix +calls to routines, such as, for example, +.B getchar(), +with +.I flex +rules and expect it to work. Call +.B input() +instead. +.PP +The total table entries listed by the +.B \-v +flag excludes the number of table entries needed to determine +what rule has been matched. The number of entries is equal +to the number of DFA states if the scanner does not use +.B REJECT, +and somewhat greater than the number of states if it does. +.PP +.B REJECT +cannot be used with the +.B \-f +or +.B \-F +options. +.PP +The +.I flex +internal algorithms need documentation. diff --git a/usr.bin/lex/lexdoc.1 b/usr.bin/lex/lexdoc.1 new file mode 100644 index 000000000000..b80d5699abca --- /dev/null +++ b/usr.bin/lex/lexdoc.1 @@ -0,0 +1,3045 @@ +.TH FLEXDOC 1 "November 1993" "Version 2.4" +.SH NAME +flexdoc \- documentation for flex, fast lexical analyzer generator +.SH SYNOPSIS +.B flex +.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton] +.I [filename ...] +.SH DESCRIPTION +.I flex +is a tool for generating +.I scanners: +programs which recognized lexical patterns in text. +.I flex +reads +the given input files, or its standard input if no file names are given, +for a description of a scanner to generate. The description is in +the form of pairs +of regular expressions and C code, called +.I rules. flex +generates as output a C source file, +.B lex.yy.c, +which defines a routine +.B yylex(). +This file is compiled and linked with the +.B \-lfl +library to produce an executable. When the executable is run, +it analyzes its input for occurrences +of the regular expressions. Whenever it finds one, it executes +the corresponding C code. +.SH SOME SIMPLE EXAMPLES +.PP +First some simple examples to get the flavor of how one uses +.I flex. +The following +.I flex +input specifies a scanner which whenever it encounters the string +"username" will replace it with the user's login name: +.nf + + %% + username printf( "%s", getlogin() ); + +.fi +By default, any text not matched by a +.I flex +scanner +is copied to the output, so the net effect of this scanner is +to copy its input file to its output with each occurrence +of "username" expanded. +In this input, there is just one rule. "username" is the +.I pattern +and the "printf" is the +.I action. +The "%%" marks the beginning of the rules. +.PP +Here's another simple example: +.nf + + int num_lines = 0, num_chars = 0; + + %% + \\n ++num_lines; ++num_chars; + . ++num_chars; + + %% + main() + { + yylex(); + printf( "# of lines = %d, # of chars = %d\\n", + num_lines, num_chars ); + } + +.fi +This scanner counts the number of characters and the number +of lines in its input (it produces no output other than the +final report on the counts). The first line +declares two globals, "num_lines" and "num_chars", which are accessible +both inside +.B yylex() +and in the +.B main() +routine declared after the second "%%". There are two rules, one +which matches a newline ("\\n") and increments both the line count and +the character count, and one which matches any character other than +a newline (indicated by the "." regular expression). +.PP +A somewhat more complicated example: +.nf + + /* scanner for a toy Pascal-like language */ + + %{ + /* need this for the call to atof() below */ + #include + %} + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + %% + + {DIGIT}+ { + printf( "An integer: %s (%d)\\n", yytext, + atoi( yytext ) ); + } + + {DIGIT}+"."{DIGIT}* { + printf( "A float: %s (%g)\\n", yytext, + atof( yytext ) ); + } + + if|then|begin|end|procedure|function { + printf( "A keyword: %s\\n", yytext ); + } + + {ID} printf( "An identifier: %s\\n", yytext ); + + "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); + + "{"[^}\\n]*"}" /* eat up one-line comments */ + + [ \\t\\n]+ /* eat up whitespace */ + + . printf( "Unrecognized character: %s\\n", yytext ); + + %% + + main( argc, argv ) + int argc; + char **argv; + { + ++argv, --argc; /* skip over program name */ + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + } + +.fi +This is the beginnings of a simple scanner for a language like +Pascal. It identifies different types of +.I tokens +and reports on what it has seen. +.PP +The details of this example will be explained in the following +sections. +.SH FORMAT OF THE INPUT FILE +The +.I flex +input file consists of three sections, separated by a line with just +.B %% +in it: +.nf + + definitions + %% + rules + %% + user code + +.fi +The +.I definitions +section contains declarations of simple +.I name +definitions to simplify the scanner specification, and declarations of +.I start conditions, +which are explained in a later section. +.PP +Name definitions have the form: +.nf + + name definition + +.fi +The "name" is a word beginning with a letter or an underscore ('_') +followed by zero or more letters, digits, '_', or '-' (dash). +The definition is taken to begin at the first non-white-space character +following the name and continuing to the end of the line. +The definition can subsequently be referred to using "{name}", which +will expand to "(definition)". For example, +.nf + + DIGIT [0-9] + ID [a-z][a-z0-9]* + +.fi +defines "DIGIT" to be a regular expression which matches a +single digit, and +"ID" to be a regular expression which matches a letter +followed by zero-or-more letters-or-digits. +A subsequent reference to +.nf + + {DIGIT}+"."{DIGIT}* + +.fi +is identical to +.nf + + ([0-9])+"."([0-9])* + +.fi +and matches one-or-more digits followed by a '.' followed +by zero-or-more digits. +.PP +The +.I rules +section of the +.I flex +input contains a series of rules of the form: +.nf + + pattern action + +.fi +where the pattern must be unindented and the action must begin +on the same line. +.PP +See below for a further description of patterns and actions. +.PP +Finally, the user code section is simply copied to +.B lex.yy.c +verbatim. +It is used for companion routines which call or are called +by the scanner. The presence of this section is optional; +if it is missing, the second +.B %% +in the input file may be skipped, too. +.PP +In the definitions and rules sections, any +.I indented +text or text enclosed in +.B %{ +and +.B %} +is copied verbatim to the output (with the %{}'s removed). +The %{}'s must appear unindented on lines by themselves. +.PP +In the rules section, +any indented or %{} text appearing before the +first rule may be used to declare variables +which are local to the scanning routine and (after the declarations) +code which is to be executed whenever the scanning routine is entered. +Other indented or %{} text in the rule section is still copied to the output, +but its meaning is not well-defined and it may well cause compile-time +errors (this feature is present for +.I POSIX +compliance; see below for other such features). +.PP +In the definitions section (but not in the rules section), +an unindented comment (i.e., a line +beginning with "/*") is also copied verbatim to the output up +to the next "*/". +.SH PATTERNS +The patterns in the input are written using an extended set of regular +expressions. These are: +.nf + + x match the character 'x' + . any character except newline + [xyz] a "character class"; in this case, the pattern + matches either an 'x', a 'y', or a 'z' + [abj-oZ] a "character class" with a range in it; matches + an 'a', a 'b', any letter from 'j' through 'o', + or a 'Z' + [^A-Z] a "negated character class", i.e., any character + but those in the class. In this case, any + character EXCEPT an uppercase letter. + [^A-Z\\n] any character EXCEPT an uppercase letter or + a newline + r* zero or more r's, where r is any regular expression + r+ one or more r's + r? zero or one r's (that is, "an optional r") + r{2,5} anywhere from two to five r's + r{2,} two or more r's + r{4} exactly 4 r's + {name} the expansion of the "name" definition + (see above) + "[xyz]\\"foo" + the literal string: [xyz]"foo + \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', + then the ANSI-C interpretation of \\x. + Otherwise, a literal 'X' (used to escape + operators such as '*') + \\123 the character with octal value 123 + \\x2a the character with hexadecimal value 2a + (r) match an r; parentheses are used to override + precedence (see below) + + + rs the regular expression r followed by the + regular expression s; called "concatenation" + + + r|s either an r or an s + + + r/s an r but only if it is followed by an s. The + s is not part of the matched text. This type + of pattern is called as "trailing context". + ^r an r, but only at the beginning of a line + r$ an r, but only at the end of a line. Equivalent + to "r/\\n". + + + r an r, but only in start condition s (see + below for discussion of start conditions) + r + same, but in any of start conditions s1, + s2, or s3 + <*>r an r in any start condition, even an exclusive one. + + + <> an end-of-file + <> + an end-of-file when in start condition s1 or s2 + +.fi +Note that inside of a character class, all regular expression operators +lose their special meaning except escape ('\\') and the character class +operators, '-', ']', and, at the beginning of the class, '^'. +.PP +The regular expressions listed above are grouped according to +precedence, from highest precedence at the top to lowest at the bottom. +Those grouped together have equal precedence. For example, +.nf + + foo|bar* + +.fi +is the same as +.nf + + (foo)|(ba(r*)) + +.fi +since the '*' operator has higher precedence than concatenation, +and concatenation higher than alternation ('|'). This pattern +therefore matches +.I either +the string "foo" +.I or +the string "ba" followed by zero-or-more r's. +To match "foo" or zero-or-more "bar"'s, use: +.nf + + foo|(bar)* + +.fi +and to match zero-or-more "foo"'s-or-"bar"'s: +.nf + + (foo|bar)* + +.fi +.PP +Some notes on patterns: +.IP - +A negated character class such as the example "[^A-Z]" +above +.I will match a newline +unless "\\n" (or an equivalent escape sequence) is one of the +characters explicitly present in the negated character class +(e.g., "[^A-Z\\n]"). This is unlike how many other regular +expression tools treat negated character classes, but unfortunately +the inconsistency is historically entrenched. +Matching newlines means that a pattern like [^"]* can match the entire +input unless there's another quote in the input. +.IP - +A rule can have at most one instance of trailing context (the '/' operator +or the '$' operator). The start condition, '^', and "<>" patterns +can only occur at the beginning of a pattern, and, as well as with '/' and '$', +cannot be grouped inside parentheses. A '^' which does not occur at +the beginning of a rule or a '$' which does not occur at the end of +a rule loses its special properties and is treated as a normal character. +.IP +The following are illegal: +.nf + + foo/bar$ + foobar + +.fi +Note that the first of these, can be written "foo/bar\\n". +.IP +The following will result in '$' or '^' being treated as a normal character: +.nf + + foo|(bar$) + foo|^bar + +.fi +If what's wanted is a "foo" or a bar-followed-by-a-newline, the following +could be used (the special '|' action is explained below): +.nf + + foo | + bar$ /* action goes here */ + +.fi +A similar trick will work for matching a foo or a +bar-at-the-beginning-of-a-line. +.SH HOW THE INPUT IS MATCHED +When the generated scanner is run, it analyzes its input looking +for strings which match any of its patterns. If it finds more than +one match, it takes the one matching the most text (for trailing +context rules, this includes the length of the trailing part, even +though it will then be returned to the input). If it finds two +or more matches of the same length, the +rule listed first in the +.I flex +input file is chosen. +.PP +Once the match is determined, the text corresponding to the match +(called the +.I token) +is made available in the global character pointer +.B yytext, +and its length in the global integer +.B yyleng. +The +.I action +corresponding to the matched pattern is then executed (a more +detailed description of actions follows), and then the remaining +input is scanned for another match. +.PP +If no match is found, then the +.I default rule +is executed: the next character in the input is considered matched and +copied to the standard output. Thus, the simplest legal +.I flex +input is: +.nf + + %% + +.fi +which generates a scanner that simply copies its input (one character +at a time) to its output. +.PP +Note that +.B yytext +can be defined in two different ways: either as a character +.I pointer +or as a character +.I array. +You can control which definition +.I flex +uses by including one of the special directives +.B %pointer +or +.B %array +in the first (definitions) section of your flex input. The default is +.B %pointer, +unless you use the +.B -l +lex compatibility option, in which case +.B yytext +will be an array. +The advantage of using +.B %pointer +is substantially faster scanning and no buffer overflow when matching +very large tokens (unless you run out of dynamic memory). The disadvantage +is that you are restricted in how your actions can modify +.B yytext +(see the next section), and calls to the +.B input() +and +.B unput() +functions destroy the present contents of +.B yytext, +which can be a considerable porting headache when moving between different +.I lex +versions. +.PP +The advantage of +.B %array +is that you can then modify +.B yytext +to your heart's content, and calls to +.B input() +and +.B unput() +do not destroy +.B yytext +(see below). Furthermore, existing +.I lex +programs sometimes access +.B yytext +externally using declarations of the form: +.nf + extern char yytext[]; +.fi +This definition is erroneous when used with +.B %pointer, +but correct for +.B %array. +.PP +.B %array +defines +.B yytext +to be an array of +.B YYLMAX +characters, which defaults to a fairly large value. You can change +the size by simply #define'ing +.B YYLMAX +to a different value in the first section of your +.I flex +input. As mentioned above, with +.B %pointer +yytext grows dynamically to accomodate large tokens. While this means your +.B %pointer +scanner can accomodate very large tokens (such as matching entire blocks +of comments), bear in mind that each time the scanner must resize +.B yytext +it also must rescan the entire token from the beginning, so matching such +tokens can prove slow. +.B yytext +presently does +.I not +dynamically grow if a call to +.B unput() +results in too much text being pushed back; instead, a run-time error results. +.PP +Also note that you cannot use +.B %array +with C++ scanner classes +(the +.B \-+ +option; see below). +.SH ACTIONS +Each pattern in a rule has a corresponding action, which can be any +arbitrary C statement. The pattern ends at the first non-escaped +whitespace character; the remainder of the line is its action. If the +action is empty, then when the pattern is matched the input token +is simply discarded. For example, here is the specification for a program +which deletes all occurrences of "zap me" from its input: +.nf + + %% + "zap me" + +.fi +(It will copy all other characters in the input to the output since +they will be matched by the default rule.) +.PP +Here is a program which compresses multiple blanks and tabs down to +a single blank, and throws away whitespace found at the end of a line: +.nf + + %% + [ \\t]+ putchar( ' ' ); + [ \\t]+$ /* ignore this token */ + +.fi +.PP +If the action contains a '{', then the action spans till the balancing '}' +is found, and the action may cross multiple lines. +.I flex +knows about C strings and comments and won't be fooled by braces found +within them, but also allows actions to begin with +.B %{ +and will consider the action to be all the text up to the next +.B %} +(regardless of ordinary braces inside the action). +.PP +An action consisting solely of a vertical bar ('|') means "same as +the action for the next rule." See below for an illustration. +.PP +Actions can include arbitrary C code, including +.B return +statements to return a value to whatever routine called +.B yylex(). +Each time +.B yylex() +is called it continues processing tokens from where it last left +off until it either reaches +the end of the file or executes a return. +.PP +Actions are free to modify +.B yytext +except for lengthening it (adding +characters to its end--these will overwrite later characters in the +input stream). Modifying the final character of yytext may alter +whether when scanning resumes rules anchored with '^' are active. +Specifically, changing the final character of yytext to a newline will +activate such rules on the next scan, and changing it to anything else +will deactivate the rules. Users should not rely on this behavior being +present in future releases. Finally, note that none of this paragraph +applies when using +.B %array +(see above). +.PP +Actions are free to modify +.B yyleng +except they should not do so if the action also includes use of +.B yymore() +(see below). +.PP +There are a number of special directives which can be included within +an action: +.IP - +.B ECHO +copies yytext to the scanner's output. +.IP - +.B BEGIN +followed by the name of a start condition places the scanner in the +corresponding start condition (see below). +.IP - +.B REJECT +directs the scanner to proceed on to the "second best" rule which matched the +input (or a prefix of the input). The rule is chosen as described +above in "How the Input is Matched", and +.B yytext +and +.B yyleng +set up appropriately. +It may either be one which matched as much text +as the originally chosen rule but came later in the +.I flex +input file, or one which matched less text. +For example, the following will both count the +words in the input and call the routine special() whenever "frob" is seen: +.nf + + int word_count = 0; + %% + + frob special(); REJECT; + [^ \\t\\n]+ ++word_count; + +.fi +Without the +.B REJECT, +any "frob"'s in the input would not be counted as words, since the +scanner normally executes only one action per token. +Multiple +.B REJECT's +are allowed, each one finding the next best choice to the currently +active rule. For example, when the following scanner scans the token +"abcd", it will write "abcdabcaba" to the output: +.nf + + %% + a | + ab | + abc | + abcd ECHO; REJECT; + .|\\n /* eat up any unmatched character */ + +.fi +(The first three rules share the fourth's action since they use +the special '|' action.) +.B REJECT +is a particularly expensive feature in terms scanner performance; +if it is used in +.I any +of the scanner's actions it will slow down +.I all +of the scanner's matching. Furthermore, +.B REJECT +cannot be used with the +.I -Cf +or +.I -CF +options (see below). +.IP +Note also that unlike the other special actions, +.B REJECT +is a +.I branch; +code immediately following it in the action will +.I not +be executed. +.IP - +.B yymore() +tells the scanner that the next time it matches a rule, the corresponding +token should be +.I appended +onto the current value of +.B yytext +rather than replacing it. For example, given the input "mega-kludge" +the following will write "mega-mega-kludge" to the output: +.nf + + %% + mega- ECHO; yymore(); + kludge ECHO; + +.fi +First "mega-" is matched and echoed to the output. Then "kludge" +is matched, but the previous "mega-" is still hanging around at the +beginning of +.B yytext +so the +.B ECHO +for the "kludge" rule will actually write "mega-kludge". +The presence of +.B yymore() +in the scanner's action entails a minor performance penalty in the +scanner's matching speed. +.IP - +.B yyless(n) +returns all but the first +.I n +characters of the current token back to the input stream, where they +will be rescanned when the scanner looks for the next match. +.B yytext +and +.B yyleng +are adjusted appropriately (e.g., +.B yyleng +will now be equal to +.I n +). For example, on the input "foobar" the following will write out +"foobarbar": +.nf + + %% + foobar ECHO; yyless(3); + [a-z]+ ECHO; + +.fi +An argument of 0 to +.B yyless +will cause the entire current input string to be scanned again. Unless you've +changed how the scanner will subsequently process its input (using +.B BEGIN, +for example), this will result in an endless loop. +.PP +Note that +.B yyless +is a macro and can only be used in the flex input file, not from +other source files. +.IP - +.B unput(c) +puts the character +.I c +back onto the input stream. It will be the next character scanned. +The following action will take the current token and cause it +to be rescanned enclosed in parentheses. +.nf + + { + int i; + unput( ')' ); + for ( i = yyleng - 1; i >= 0; --i ) + unput( yytext[i] ); + unput( '(' ); + } + +.fi +Note that since each +.B unput() +puts the given character back at the +.I beginning +of the input stream, pushing back strings must be done back-to-front. +Also note that you cannot put back +.B EOF +to attempt to mark the input stream with an end-of-file. +.IP - +.B input() +reads the next character from the input stream. For example, +the following is one way to eat up C comments: +.nf + + %% + "/*" { + register int c; + + for ( ; ; ) + { + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + { + while ( (c = input()) == '*' ) + ; + if ( c == '/' ) + break; /* found the end */ + } + + if ( c == EOF ) + { + error( "EOF in comment" ); + break; + } + } + } + +.fi +(Note that if the scanner is compiled using +.B C++, +then +.B input() +is instead referred to as +.B yyinput(), +in order to avoid a name clash with the +.B C++ +stream by the name of +.I input.) +.IP - +.B yyterminate() +can be used in lieu of a return statement in an action. It terminates +the scanner and returns a 0 to the scanner's caller, indicating "all done". +By default, +.B yyterminate() +is also called when an end-of-file is encountered. It is a macro and +may be redefined. +.SH THE GENERATED SCANNER +The output of +.I flex +is the file +.B lex.yy.c, +which contains the scanning routine +.B yylex(), +a number of tables used by it for matching tokens, and a number +of auxiliary routines and macros. By default, +.B yylex() +is declared as follows: +.nf + + int yylex() + { + ... various definitions and the actions in here ... + } + +.fi +(If your environment supports function prototypes, then it will +be "int yylex( void )".) This definition may be changed by defining +the "YY_DECL" macro. For example, you could use: +.nf + + #define YY_DECL float lexscan( a, b ) float a, b; + +.fi +to give the scanning routine the name +.I lexscan, +returning a float, and taking two floats as arguments. Note that +if you give arguments to the scanning routine using a +K&R-style/non-prototyped function declaration, you must terminate +the definition with a semi-colon (;). +.PP +Whenever +.B yylex() +is called, it scans tokens from the global input file +.I yyin +(which defaults to stdin). It continues until it either reaches +an end-of-file (at which point it returns the value 0) or +one of its actions executes a +.I return +statement. +.PP +If the scanner reaches an end-of-file, subsequent calls are undefined +unless either +.I yyin +is pointed at a new input file (in which case scanning continues from +that file), or +.B yyrestart() +is called. +.B yyrestart() +takes one argument, a +.B FILE * +pointer, and initializes +.I yyin +for scanning from that file. Essentially there is no difference between +just assigning +.I yyin +to a new input file or using +.B yyrestart() +to do so; the latter is available for compatibility with previous versions +of +.I flex, +and because it can be used to switch input files in the middle of scanning. +It can also be used to throw away the current input buffer, by calling +it with an argument of +.I yyin. +.PP +If +.B yylex() +stops scanning due to executing a +.I return +statement in one of the actions, the scanner may then be called again and it +will resume scanning where it left off. +.PP +By default (and for purposes of efficiency), the scanner uses +block-reads rather than simple +.I getc() +calls to read characters from +.I yyin. +The nature of how it gets its input can be controlled by defining the +.B YY_INPUT +macro. +YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its +action is to place up to +.I max_size +characters in the character array +.I buf +and return in the integer variable +.I result +either the +number of characters read or the constant YY_NULL (0 on Unix systems) +to indicate EOF. The default YY_INPUT reads from the +global file-pointer "yyin". +.PP +A sample definition of YY_INPUT (in the definitions +section of the input file): +.nf + + %{ + #define YY_INPUT(buf,result,max_size) \\ + { \\ + int c = getchar(); \\ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ + } + %} + +.fi +This definition will change the input processing to occur +one character at a time. +.PP +You also can add in things like keeping track of the +input line number this way; but don't expect your scanner to +go very fast. +.PP +When the scanner receives an end-of-file indication from YY_INPUT, +it then checks the +.B yywrap() +function. If +.B yywrap() +returns false (zero), then it is assumed that the +function has gone ahead and set up +.I yyin +to point to another input file, and scanning continues. If it returns +true (non-zero), then the scanner terminates, returning 0 to its +caller. +.PP +The default +.B yywrap() +always returns 1. +.PP +The scanner writes its +.B ECHO +output to the +.I yyout +global (default, stdout), which may be redefined by the user simply +by assigning it to some other +.B FILE +pointer. +.SH START CONDITIONS +.I flex +provides a mechanism for conditionally activating rules. Any rule +whose pattern is prefixed with "" will only be active when +the scanner is in the start condition named "sc". For example, +.nf + + [^"]* { /* eat up the string body ... */ + ... + } + +.fi +will be active only when the scanner is in the "STRING" start +condition, and +.nf + + \\. { /* handle an escape ... */ + ... + } + +.fi +will be active only when the current start condition is +either "INITIAL", "STRING", or "QUOTE". +.PP +Start conditions +are declared in the definitions (first) section of the input +using unindented lines beginning with either +.B %s +or +.B %x +followed by a list of names. +The former declares +.I inclusive +start conditions, the latter +.I exclusive +start conditions. A start condition is activated using the +.B BEGIN +action. Until the next +.B BEGIN +action is executed, rules with the given start +condition will be active and +rules with other start conditions will be inactive. +If the start condition is +.I inclusive, +then rules with no start conditions at all will also be active. +If it is +.I exclusive, +then +.I only +rules qualified with the start condition will be active. +A set of rules contingent on the same exclusive start condition +describe a scanner which is independent of any of the other rules in the +.I flex +input. Because of this, +exclusive start conditions make it easy to specify "mini-scanners" +which scan portions of the input that are syntactically different +from the rest (e.g., comments). +.PP +If the distinction between inclusive and exclusive start conditions +is still a little vague, here's a simple example illustrating the +connection between the two. The set of rules: +.nf + + %s example + %% + foo /* do something */ + +.fi +is equivalent to +.nf + + %x example + %% + foo /* do something */ + +.fi +.PP +Also note that the special start-condition specifier +.B <*> +matches every start condition. Thus, the above example could also +have been written; +.nf + + %x example + %% + <*>foo /* do something */ + +.fi +.PP +The default rule (to +.B ECHO +any unmatched character) remains active in start conditions. +.PP +.B BEGIN(0) +returns to the original state where only the rules with +no start conditions are active. This state can also be +referred to as the start-condition "INITIAL", so +.B BEGIN(INITIAL) +is equivalent to +.B BEGIN(0). +(The parentheses around the start condition name are not required but +are considered good style.) +.PP +.B BEGIN +actions can also be given as indented code at the beginning +of the rules section. For example, the following will cause +the scanner to enter the "SPECIAL" start condition whenever +.I yylex() +is called and the global variable +.I enter_special +is true: +.nf + + int enter_special; + + %x SPECIAL + %% + if ( enter_special ) + BEGIN(SPECIAL); + + blahblahblah + ...more rules follow... + +.fi +.PP +To illustrate the uses of start conditions, +here is a scanner which provides two different interpretations +of a string like "123.456". By default it will treat it as +as three tokens, the integer "123", a dot ('.'), and the integer "456". +But if the string is preceded earlier in the line by the string +"expect-floats" +it will treat it as a single token, the floating-point number +123.456: +.nf + + %{ + #include + %} + %s expect + + %% + expect-floats BEGIN(expect); + + [0-9]+"."[0-9]+ { + printf( "found a float, = %f\\n", + atof( yytext ) ); + } + \\n { + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + } + + [0-9]+ { + printf( "found an integer, = %d\\n", + atoi( yytext ) ); + } + + "." printf( "found a dot\\n" ); + +.fi +Here is a scanner which recognizes (and discards) C comments while +maintaining a count of the current input line. +.nf + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\\n]* /* eat anything that's not a '*' */ + "*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ + \\n ++line_num; + "*"+"/" BEGIN(INITIAL); + +.fi +This scanner goes to a bit of trouble to match as much +text as possible with each rule. In general, when attempting to write +a high-speed scanner try to match as much possible in each rule, as +it's a big win. +.PP +Note that start-conditions names are really integer values and +can be stored as such. Thus, the above could be extended in the +following fashion: +.nf + + %x comment foo + %% + int line_num = 1; + int comment_caller; + + "/*" { + comment_caller = INITIAL; + BEGIN(comment); + } + + ... + + "/*" { + comment_caller = foo; + BEGIN(comment); + } + + [^*\\n]* /* eat anything that's not a '*' */ + "*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ + \\n ++line_num; + "*"+"/" BEGIN(comment_caller); + +.fi +Furthermore, you can access the current start condition using +the integer-valued +.B YY_START +macro. For example, the above assignments to +.I comment_caller +could instead be written +.nf + + comment_caller = YY_START; +.fi +.PP +Note that start conditions do not have their own name-space; %s's and %x's +declare names in the same fashion as #define's. +.PP +Finally, here's an example of how to match C-style quoted strings using +exclusive start conditions, including expanded escape sequences (but +not including checking for a string that's too long): +.nf + + %x str + + %% + char string_buf[MAX_STR_CONST]; + char *string_buf_ptr; + + + \\" string_buf_ptr = string_buf; BEGIN(str); + + \\" { /* saw closing quote - all done */ + BEGIN(INITIAL); + *string_buf_ptr = '\\0'; + /* return string constant token type and + * value to parser + */ + } + + \\n { + /* error - unterminated string constant */ + /* generate error message */ + } + + \\\\[0-7]{1,3} { + /* octal escape sequence */ + int result; + + (void) sscanf( yytext + 1, "%o", &result ); + + if ( result > 0xff ) + /* error, constant is out-of-bounds */ + + *string_buf_ptr++ = result; + } + + \\\\[0-9]+ { + /* generate error - bad escape sequence; something + * like '\\48' or '\\0777777' + */ + } + + \\\\n *string_buf_ptr++ = '\\n'; + \\\\t *string_buf_ptr++ = '\\t'; + \\\\r *string_buf_ptr++ = '\\r'; + \\\\b *string_buf_ptr++ = '\\b'; + \\\\f *string_buf_ptr++ = '\\f'; + + \\\\(.|\\n) *string_buf_ptr++ = yytext[1]; + + [^\\\\\\n\\"]+ { + char *yytext_ptr = yytext; + + while ( *yytext_ptr ) + *string_buf_ptr++ = *yytext_ptr++; + } + +.fi +.SH MULTIPLE INPUT BUFFERS +Some scanners (such as those which support "include" files) +require reading from several input streams. As +.I flex +scanners do a large amount of buffering, one cannot control +where the next input will be read from by simply writing a +.B YY_INPUT +which is sensitive to the scanning context. +.B YY_INPUT +is only called when the scanner reaches the end of its buffer, which +may be a long time after scanning a statement such as an "include" +which requires switching the input source. +.PP +To negotiate these sorts of problems, +.I flex +provides a mechanism for creating and switching between multiple +input buffers. An input buffer is created by using: +.nf + + YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) + +.fi +which takes a +.I FILE +pointer and a size and creates a buffer associated with the given +file and large enough to hold +.I size +characters (when in doubt, use +.B YY_BUF_SIZE +for the size). It returns a +.B YY_BUFFER_STATE +handle, which may then be passed to other routines: +.nf + + void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) + +.fi +switches the scanner's input buffer so subsequent tokens will +come from +.I new_buffer. +Note that +.B yy_switch_to_buffer() +may be used by yywrap() to set things up for continued scanning, instead +of opening a new file and pointing +.I yyin +at it. +.nf + + void yy_delete_buffer( YY_BUFFER_STATE buffer ) + +.fi +is used to reclaim the storage associated with a buffer. +.PP +.B yy_new_buffer() +is an alias for +.B yy_create_buffer(), +provided for compatibility with the C++ use of +.I new +and +.I delete +for creating and destroying dynamic objects. +.PP +Finally, the +.B YY_CURRENT_BUFFER +macro returns a +.B YY_BUFFER_STATE +handle to the current buffer. +.PP +Here is an example of using these features for writing a scanner +which expands include files (the +.B <> +feature is discussed below): +.nf + + /* the "incl" state is used for picking up the name + * of an include file + */ + %x incl + + %{ + #define MAX_INCLUDE_DEPTH 10 + YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; + int include_stack_ptr = 0; + %} + + %% + include BEGIN(incl); + + [a-z]+ ECHO; + [^a-z\\n]*\\n? ECHO; + + [ \\t]* /* eat the whitespace */ + [^ \\t\\n]+ { /* got the include file name */ + if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) + { + fprintf( stderr, "Includes nested too deeply" ); + exit( 1 ); + } + + include_stack[include_stack_ptr++] = + YY_CURRENT_BUFFER; + + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( ... ); + + yy_switch_to_buffer( + yy_create_buffer( yyin, YY_BUF_SIZE ) ); + + BEGIN(INITIAL); + } + + <> { + if ( --include_stack_ptr < 0 ) + { + yyterminate(); + } + + else + { + yy_delete_buffer( YY_CURRENT_BUFFER ); + yy_switch_to_buffer( + include_stack[include_stack_ptr] ); + } + } + +.fi +.SH END-OF-FILE RULES +The special rule "<>" indicates +actions which are to be taken when an end-of-file is +encountered and yywrap() returns non-zero (i.e., indicates +no further files to process). The action must finish +by doing one of four things: +.IP - +assigning +.I yyin +to a new input file (in previous versions of flex, after doing the +assignment you had to call the special action +.B YY_NEW_FILE; +this is no longer necessary); +.IP - +executing a +.I return +statement; +.IP - +executing the special +.B yyterminate() +action; +.IP - +or, switching to a new buffer using +.B yy_switch_to_buffer() +as shown in the example above. +.PP +<> rules may not be used with other +patterns; they may only be qualified with a list of start +conditions. If an unqualified <> rule is given, it +applies to +.I all +start conditions which do not already have <> actions. To +specify an <> rule for only the initial start condition, use +.nf + + <> + +.fi +.PP +These rules are useful for catching things like unclosed comments. +An example: +.nf + + %x quote + %% + + ...other rules for dealing with quotes... + + <> { + error( "unterminated quote" ); + yyterminate(); + } + <> { + if ( *++filelist ) + yyin = fopen( *filelist, "r" ); + else + yyterminate(); + } + +.fi +.SH MISCELLANEOUS MACROS +The macro +.bd +YY_USER_ACTION +can be defined to provide an action +which is always executed prior to the matched rule's action. For example, +it could be #define'd to call a routine to convert yytext to lower-case. +.PP +The macro +.B YY_USER_INIT +may be defined to provide an action which is always executed before +the first scan (and before the scanner's internal initializations are done). +For example, it could be used to call a routine to read +in a data table or open a logging file. +.PP +In the generated scanner, the actions are all gathered in one large +switch statement and separated using +.B YY_BREAK, +which may be redefined. By default, it is simply a "break", to separate +each rule's action from the following rule's. +Redefining +.B YY_BREAK +allows, for example, C++ users to +#define YY_BREAK to do nothing (while being very careful that every +rule ends with a "break" or a "return"!) to avoid suffering from +unreachable statement warnings where because a rule's action ends with +"return", the +.B YY_BREAK +is inaccessible. +.SH INTERFACING WITH YACC +One of the main uses of +.I flex +is as a companion to the +.I yacc +parser-generator. +.I yacc +parsers expect to call a routine named +.B yylex() +to find the next input token. The routine is supposed to +return the type of the next token as well as putting any associated +value in the global +.B yylval. +To use +.I flex +with +.I yacc, +one specifies the +.B \-d +option to +.I yacc +to instruct it to generate the file +.B y.tab.h +containing definitions of all the +.B %tokens +appearing in the +.I yacc +input. This file is then included in the +.I flex +scanner. For example, if one of the tokens is "TOK_NUMBER", +part of the scanner might look like: +.nf + + %{ + #include "y.tab.h" + %} + + %% + + [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; + +.fi +.SH OPTIONS +.I flex +has the following options: +.TP +.B \-b +Generate backing-up information to +.I lex.backup. +This is a list of scanner states which require backing up +and the input characters on which they do so. By adding rules one +can remove backing-up states. If all backing-up states +are eliminated and +.B \-Cf +or +.B \-CF +is used, the generated scanner will run faster (see the +.B \-p +flag). Only users who wish to squeeze every last cycle out of their +scanners need worry about this option. (See the section on Performance +Considerations below.) +.TP +.B \-c +is a do-nothing, deprecated option included for POSIX compliance. +.IP +.B NOTE: +in previous releases of +.I flex +.B \-c +specified table-compression options. This functionality is +now given by the +.B \-C +flag. To ease the the impact of this change, when +.I flex +encounters +.B \-c, +it currently issues a warning message and assumes that +.B \-C +was desired instead. In the future this "promotion" of +.B \-c +to +.B \-C +will go away in the name of full POSIX compliance (unless +the POSIX meaning is removed first). +.TP +.B \-d +makes the generated scanner run in +.I debug +mode. Whenever a pattern is recognized and the global +.B yy_flex_debug +is non-zero (which is the default), +the scanner will write to +.I stderr +a line of the form: +.nf + + --accepting rule at line 53 ("the matched text") + +.fi +The line number refers to the location of the rule in the file +defining the scanner (i.e., the file that was fed to flex). Messages +are also generated when the scanner backs up, accepts the +default rule, reaches the end of its input buffer (or encounters +a NUL; at this point, the two look the same as far as the scanner's concerned), +or reaches an end-of-file. +.TP +.B \-f +specifies +.I fast scanner. +No table compression is done and stdio is bypassed. +The result is large but fast. This option is equivalent to +.B \-Cfr +(see below). +.TP +.B \-h +generates a "help" summary of +.I flex's +options to +.I stderr +and then exits. +.TP +.B \-i +instructs +.I flex +to generate a +.I case-insensitive +scanner. The case of letters given in the +.I flex +input patterns will +be ignored, and tokens in the input will be matched regardless of case. The +matched text given in +.I yytext +will have the preserved case (i.e., it will not be folded). +.TP +.B \-l +turns on maximum compatibility with the original AT&T +.I lex +implementation. Note that this does not mean +.I full +compatibility. Use of this option costs a considerable amount of +performance, and it cannot be used with the +.B \-+, -f, -F, -Cf, +or +.B -CF +options. For details on the compatibilities it provides, see the section +"Incompatibilities With Lex And POSIX" below. +.TP +.B \-n +is another do-nothing, deprecated option included only for +POSIX compliance. +.TP +.B \-p +generates a performance report to stderr. The report +consists of comments regarding features of the +.I flex +input file which will cause a serious loss of performance in the resulting +scanner. If you give the flag twice, you will also get comments regarding +features that lead to minor performance losses. +.IP +Note that the use of +.B REJECT +and variable trailing context (see the Bugs section in flex(1)) +entails a substantial performance penalty; use of +.I yymore(), +the +.B ^ +operator, +and the +.B \-I +flag entail minor performance penalties. +.TP +.B \-s +causes the +.I default rule +(that unmatched scanner input is echoed to +.I stdout) +to be suppressed. If the scanner encounters input that does not +match any of its rules, it aborts with an error. This option is +useful for finding holes in a scanner's rule set. +.TP +.B \-t +instructs +.I flex +to write the scanner it generates to standard output instead +of +.B lex.yy.c. +.TP +.B \-v +specifies that +.I flex +should write to +.I stderr +a summary of statistics regarding the scanner it generates. +Most of the statistics are meaningless to the casual +.I flex +user, but the first line identifies the version of +.I flex +(same as reported by +.B \-V), +and the next line the flags used when generating the scanner, including +those that are on by default. +.TP +.B \-w +suppresses warning messages. +.TP +.B \-B +instructs +.I flex +to generate a +.I batch +scanner, the opposite of +.I interactive +scanners generated by +.B \-I +(see below). In general, you use +.B \-B +when you are +.I certain +that your scanner will never be used interactively, and you want to +squeeze a +.I little +more performance out of it. If your goal is instead to squeeze out a +.I lot +more performance, you should be using the +.B \-Cf +or +.B \-CF +options (discussed below), which turn on +.B \-B +automatically anyway. +.TP +.B \-F +specifies that the +.ul +fast +scanner table representation should be used (and stdio +bypassed). This representation is +about as fast as the full table representation +.B (-f), +and for some sets of patterns will be considerably smaller (and for +others, larger). In general, if the pattern set contains both "keywords" +and a catch-all, "identifier" rule, such as in the set: +.nf + + "case" return TOK_CASE; + "switch" return TOK_SWITCH; + ... + "default" return TOK_DEFAULT; + [a-z]+ return TOK_ID; + +.fi +then you're better off using the full table representation. If only +the "identifier" rule is present and you then use a hash table or some such +to detect the keywords, you're better off using +.B -F. +.IP +This option is equivalent to +.B \-CFr +(see below). It cannot be used with +.B \-+. +.TP +.B \-I +instructs +.I flex +to generate an +.I interactive +scanner. An interactive scanner is one that only looks ahead to decide +what token has been matched if it absolutely must. It turns out that +always looking one extra character ahead, even if the scanner has already +seen enough text to disambiguate the current token, is a bit faster than +only looking ahead when necessary. But scanners that always look ahead +give dreadful interactive performance; for example, when a user types +a newline, it is not recognized as a newline token until they enter +.I another +token, which often means typing in another whole line. +.IP +.I Flex +scanners default to +.I interactive +unless you use the +.B \-Cf +or +.B \-CF +table-compression options (see below). That's because if you're looking +for high-performance you should be using one of these options, so if you +didn't, +.I flex +assumes you'd rather trade off a bit of run-time performance for intuitive +interactive behavior. Note also that you +.I cannot +use +.B \-I +in conjunction with +.B \-Cf +or +.B \-CF. +Thus, this option is not really needed; it is on by default for all those +cases in which it is allowed. +.IP +You can force a scanner to +.I not +be interactive by using +.B \-B +(see above). +.TP +.B \-L +instructs +.I flex +not to generate +.B #line +directives. Without this option, +.I flex +peppers the generated scanner +with #line directives so error messages in the actions will be correctly +located with respect to the original +.I flex +input file, and not to +the fairly meaningless line numbers of +.B lex.yy.c. +(Unfortunately +.I flex +does not presently generate the necessary directives +to "retarget" the line numbers for those parts of +.B lex.yy.c +which it generated. So if there is an error in the generated code, +a meaningless line number is reported.) +.TP +.B \-T +makes +.I flex +run in +.I trace +mode. It will generate a lot of messages to +.I stderr +concerning +the form of the input and the resultant non-deterministic and deterministic +finite automata. This option is mostly for use in maintaining +.I flex. +.TP +.B \-V +prints the version number to +.I stderr +and exits. +.TP +.B \-7 +instructs +.I flex +to generate a 7-bit scanner, i.e., one which can only recognized 7-bit +characters in its input. The advantage of using +.B \-7 +is that the scanner's tables can be up to half the size of those generated +using the +.B \-8 +option (see below). The disadvantage is that such scanners often hang +or crash if their input contains an 8-bit character. +.IP +Note, however, that unless you generate your scanner using the +.B \-Cf +or +.B \-CF +table compression options, use of +.B \-7 +will save only a small amount of table space, and make your scanner +considerably less portable. +.I Flex's +default behavior is to generate an 8-bit scanner unless you use the +.B \-Cf +or +.B \-CF, +in which case +.I flex +defaults to generating 7-bit scanners unless your site was always +configured to generate 8-bit scanners (as will often be the case +with non-USA sites). You can tell whether flex generated a 7-bit +or an 8-bit scanner by inspecting the flag summary in the +.B \-v +output as described above. +.IP +Note that if you use +.B \-Cfe +or +.B \-CFe +(those table compression options, but also using equivalence classes as +discussed see below), flex still defaults to generating an 8-bit +scanner, since usually with these compression options full 8-bit tables +are not much more expensive than 7-bit tables. +.TP +.B \-8 +instructs +.I flex +to generate an 8-bit scanner, i.e., one which can recognize 8-bit +characters. This flag is only needed for scanners generated using +.B \-Cf +or +.B \-CF, +as otherwise flex defaults to generating an 8-bit scanner anyway. +.IP +See the discussion of +.B \-7 +above for flex's default behavior and the tradeoffs between 7-bit +and 8-bit scanners. +.TP +.B \-+ +specifies that you want flex to generate a C++ +scanner class. See the section on Generating C++ Scanners below for +details. +.TP +.B \-C[aefFmr] +controls the degree of table compression and, more generally, trade-offs +between small scanners and fast scanners. +.IP +.B \-Ca +("align") instructs flex to trade off larger tables in the +generated scanner for faster performance because the elements of +the tables are better aligned for memory access and computation. On some +RISC architectures, fetching and manipulating longwords is more efficient +than with smaller-sized datums such as shortwords. This option can +double the size of the tables used by your scanner. +.IP +.B \-Ce +directs +.I flex +to construct +.I equivalence classes, +i.e., sets of characters +which have identical lexical properties (for example, if the only +appearance of digits in the +.I flex +input is in the character class +"[0-9]" then the digits '0', '1', ..., '9' will all be put +in the same equivalence class). Equivalence classes usually give +dramatic reductions in the final table/object file sizes (typically +a factor of 2-5) and are pretty cheap performance-wise (one array +look-up per character scanned). +.IP +.B \-Cf +specifies that the +.I full +scanner tables should be generated - +.I flex +should not compress the +tables by taking advantages of similar transition functions for +different states. +.IP +.B \-CF +specifies that the alternate fast scanner representation (described +above under the +.B \-F +flag) +should be used. This option cannot be used with +.B \-+. +.IP +.B \-Cm +directs +.I flex +to construct +.I meta-equivalence classes, +which are sets of equivalence classes (or characters, if equivalence +classes are not being used) that are commonly used together. Meta-equivalence +classes are often a big win when using compressed tables, but they +have a moderate performance impact (one or two "if" tests and one +array look-up per character scanned). +.IP +.B \-Cr +causes the generated scanner to +.I bypass +use of the standard I/O library (stdio) for input. Instead of calling +.B fread() +or +.B getc(), +the scanner will use the +.B read() +system call, resulting in a performance gain which varies from system +to system, but in general is probably negligible unless you are also using +.B \-Cf +or +.B \-CF. +Using +.B \-Cr +can cause strange behavior if, for example, you read from +.I yyin +using stdio prior to calling the scanner (because the scanner will miss +whatever text your previous reads left in the stdio input buffer). +.IP +.B \-Cr +has no effect if you define +.B YY_INPUT +(see The Generated Scanner above). +.IP +A lone +.B \-C +specifies that the scanner tables should be compressed but neither +equivalence classes nor meta-equivalence classes should be used. +.IP +The options +.B \-Cf +or +.B \-CF +and +.B \-Cm +do not make sense together - there is no opportunity for meta-equivalence +classes if the table is not being compressed. Otherwise the options +may be freely mixed, and are cumulative. +.IP +The default setting is +.B \-Cem, +which specifies that +.I flex +should generate equivalence classes +and meta-equivalence classes. This setting provides the highest +degree of table compression. You can trade off +faster-executing scanners at the cost of larger tables with +the following generally being true: +.nf + + slowest & smallest + -Cem + -Cm + -Ce + -C + -C{f,F}e + -C{f,F} + -C{f,F}a + fastest & largest + +.fi +Note that scanners with the smallest tables are usually generated and +compiled the quickest, so +during development you will usually want to use the default, maximal +compression. +.IP +.B \-Cfe +is often a good compromise between speed and size for production +scanners. +.TP +.B \-Pprefix +changes the default +.I "yy" +prefix used by +.I flex +for all globally-visible variable and function names to instead be +.I prefix. +For example, +.B \-Pfoo +changes the name of +.B yytext +to +.B footext. +It also changes the name of the default output file from +.B lex.yy.c +to +.B lex.foo.c. +Here are all of the names affected: +.nf + + yyFlexLexer + yy_create_buffer + yy_delete_buffer + yy_flex_debug + yy_init_buffer + yy_load_buffer_state + yy_switch_to_buffer + yyin + yyleng + yylex + yyout + yyrestart + yytext + yywrap + +.fi +Within your scanner itself, you can still refer to the global variables +and functions using either version of their name; but eternally, they +have the modified name. +.IP +This option lets you easily link together multiple +.I flex +programs into the same executable. Note, though, that using this +option also renames +.B yywrap(), +so you now +.I must +provide your own (appropriately-named) version of the routine for your +scanner, as linking with +.B \-lfl +no longer provides one for you by default. +.TP +.B \-Sskeleton_file +overrides the default skeleton file from which +.I flex +constructs its scanners. You'll never need this option unless you are doing +.I flex +maintenance or development. +.SH PERFORMANCE CONSIDERATIONS +The main design goal of +.I flex +is that it generate high-performance scanners. It has been optimized +for dealing well with large sets of rules. Aside from the effects on +scanner speed of the table compression +.B \-C +options outlined above, +there are a number of options/actions which degrade performance. These +are, from most expensive to least: +.nf + + REJECT + + pattern sets that require backing up + arbitrary trailing context + + yymore() + '^' beginning-of-line operator + +.fi +with the first three all being quite expensive and the last two +being quite cheap. Note also that +.B unput() +is implemented as a routine call that potentially does quite a bit of +work, while +.B yyless() +is a quite-cheap macro; so if just putting back some excess text you +scanned, use +.B yyless(). +.PP +.B REJECT +should be avoided at all costs when performance is important. +It is a particularly expensive option. +.PP +Getting rid of backing up is messy and often may be an enormous +amount of work for a complicated scanner. In principal, one begins +by using the +.B \-b +flag to generate a +.I lex.backup +file. For example, on the input +.nf + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + +.fi +the file looks like: +.nf + + State #6 is non-accepting - + associated rule line numbers: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \\001-n p-\\177 ] + + State #8 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \\001-` b-\\177 ] + + State #9 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \\001-q s-\\177 ] + + Compressed tables always back up. + +.fi +The first few lines tell us that there's a scanner state in +which it can make a transition on an 'o' but not on any other +character, and that in that state the currently scanned text does not match +any rule. The state occurs when trying to match the rules found +at lines 2 and 3 in the input file. +If the scanner is in that state and then reads +something other than an 'o', it will have to back up to find +a rule which is matched. With +a bit of headscratching one can see that this must be the +state it's in when it has seen "fo". When this has happened, +if anything other than another 'o' is seen, the scanner will +have to back up to simply match the 'f' (by the default rule). +.PP +The comment regarding State #8 indicates there's a problem +when "foob" has been scanned. Indeed, on any character other +than an 'a', the scanner will have to back up to accept "foo". +Similarly, the comment for State #9 concerns when "fooba" has +been scanned and an 'r' does not follow. +.PP +The final comment reminds us that there's no point going to +all the trouble of removing backing up from the rules unless +we're using +.B \-Cf +or +.B \-CF, +since there's no performance gain doing so with compressed scanners. +.PP +The way to remove the backing up is to add "error" rules: +.nf + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + fooba | + foob | + fo { + /* false alarm, not really a keyword */ + return TOK_ID; + } + +.fi +.PP +Eliminating backing up among a list of keywords can also be +done using a "catch-all" rule: +.nf + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + [a-z]+ return TOK_ID; + +.fi +This is usually the best solution when appropriate. +.PP +Backing up messages tend to cascade. +With a complicated set of rules it's not uncommon to get hundreds +of messages. If one can decipher them, though, it often +only takes a dozen or so rules to eliminate the backing up (though +it's easy to make a mistake and have an error rule accidentally match +a valid token. A possible future +.I flex +feature will be to automatically add rules to eliminate backing up). +.PP +.I Variable +trailing context (where both the leading and trailing parts do not have +a fixed length) entails almost the same performance loss as +.B REJECT +(i.e., substantial). So when possible a rule like: +.nf + + %% + mouse|rat/(cat|dog) run(); + +.fi +is better written: +.nf + + %% + mouse/cat|dog run(); + rat/cat|dog run(); + +.fi +or as +.nf + + %% + mouse|rat/cat run(); + mouse|rat/dog run(); + +.fi +Note that here the special '|' action does +.I not +provide any savings, and can even make things worse (see +.PP +A final note regarding performance: as mentioned above in the section +How the Input is Matched, dynamically resizing +.B yytext +to accomodate huge tokens is a slow process because it presently requires that +the (huge) token be rescanned from the beginning. Thus if performance is +vital, you should attempt to match "large" quantities of text but not +"huge" quantities, where the cutoff between the two is at about 8K +characters/token. +.PP +Another area where the user can increase a scanner's performance +(and one that's easier to implement) arises from the fact that +the longer the tokens matched, the faster the scanner will run. +This is because with long tokens the processing of most input +characters takes place in the (short) inner scanning loop, and +does not often have to go through the additional work of setting up +the scanning environment (e.g., +.B yytext) +for the action. Recall the scanner for C comments: +.nf + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\\n]* + "*"+[^*/\\n]* + \\n ++line_num; + "*"+"/" BEGIN(INITIAL); + +.fi +This could be sped up by writing it as: +.nf + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\\n]* + [^*\\n]*\\n ++line_num; + "*"+[^*/\\n]* + "*"+[^*/\\n]*\\n ++line_num; + "*"+"/" BEGIN(INITIAL); + +.fi +Now instead of each newline requiring the processing of another +action, recognizing the newlines is "distributed" over the other rules +to keep the matched text as long as possible. Note that +.I adding +rules does +.I not +slow down the scanner! The speed of the scanner is independent +of the number of rules or (modulo the considerations given at the +beginning of this section) how complicated the rules are with +regard to operators such as '*' and '|'. +.PP +A final example in speeding up a scanner: suppose you want to scan +through a file containing identifiers and keywords, one per line +and with no other extraneous characters, and recognize all the +keywords. A natural first approach is: +.nf + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + .|\\n /* it's not a keyword */ + +.fi +To eliminate the back-tracking, introduce a catch-all rule: +.nf + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + [a-z]+ | + .|\\n /* it's not a keyword */ + +.fi +Now, if it's guaranteed that there's exactly one word per line, +then we can reduce the total number of matches by a half by +merging in the recognition of newlines with that of the other +tokens: +.nf + + %% + asm\\n | + auto\\n | + break\\n | + ... etc ... + volatile\\n | + while\\n /* it's a keyword */ + + [a-z]+\\n | + .|\\n /* it's not a keyword */ + +.fi +One has to be careful here, as we have now reintroduced backing up +into the scanner. In particular, while +.I we +know that there will never be any characters in the input stream +other than letters or newlines, +.I flex +can't figure this out, and it will plan for possibly needing to back up +when it has scanned a token like "auto" and then the next character +is something other than a newline or a letter. Previously it would +then just match the "auto" rule and be done, but now it has no "auto" +rule, only a "auto\\n" rule. To eliminate the possibility of backing up, +we could either duplicate all rules but without final newlines, or, +since we never expect to encounter such an input and therefore don't +how it's classified, we can introduce one more catch-all rule, this +one which doesn't include a newline: +.nf + + %% + asm\\n | + auto\\n | + break\\n | + ... etc ... + volatile\\n | + while\\n /* it's a keyword */ + + [a-z]+\\n | + [a-z]+ | + .|\\n /* it's not a keyword */ + +.fi +Compiled with +.B \-Cf, +this is about as fast as one can get a +.I flex +scanner to go for this particular problem. +.PP +A final note: +.I flex +is slow when matching NUL's, particularly when a token contains +multiple NUL's. +It's best to write rules which match +.I short +amounts of text if it's anticipated that the text will often include NUL's. +.SH GENERATING C++ SCANNERS +.I flex +provides two different ways to generate scanners for use with C++. The +first way is to simply compile a scanner generated by +.I flex +using a C++ compiler instead of a C compiler. You should not encounter +any compilations errors (please report any you find to the email address +given in the Author section below). You can then use C++ code in your +rule actions instead of C code. Note that the default input source for +your scanner remains +.I yyin, +and default echoing is still done to +.I yyout. +Both of these remain +.I FILE * +variables and not C++ +.I streams. +.PP +You can also use +.I flex +to generate a C++ scanner class, using the +.B \-+ +option, which is automatically specified if the name of the flex +executable ends in a '+', such as +.I flex++. +When using this option, flex defaults to generating the scanner to the file +.B lex.yy.cc +instead of +.B lex.yy.c. +The generated scanner includes the header file +.I FlexLexer.h, +which defines the interface to two C++ classes. +.PP +The first class, +.B FlexLexer, +provides an abstract base class defining the general scanner class +interface. It provides the following member functions: +.TP +.B const char* YYText() +returns the text of the most recently matched token, the equivalent of +.B yytext. +.TP +.B int YYLeng() +returns the length of the most recently matched token, the equivalent of +.B yyleng. +.PP +Also provided are member functions equivalent to +.B yy_switch_to_buffer(), +.B yy_create_buffer() +(though the first argument is an +.B istream* +object pointer and not a +.B FILE*), +.B yy_delete_buffer(), +and +.B yyrestart() +(again, the first argument is a +.B istream* +object pointer). +.PP +The second class defined in +.I FlexLexer.h +is +.B yyFlexLexer, +which is derived from +.B FlexLexer. +It defines the following additional member functions: +.TP +.B +yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) +constructs a +.B yyFlexLexer +object using the given streams for input and output. If not specified, +the streams default to +.B cin +and +.B cout, +respectively. +.TP +.B virtual int yylex() +performs the same role is +.B yylex() +does for ordinary flex scanners: it scans the input stream, consuming +tokens, until a rule's action returns a value. +.PP +In addition, +.B yyFlexLexer +defines the following protected virtual functions which you can redefine +in derived classes to tailor the scanner: +.TP +.B +virtual int LexerInput( char* buf, int max_size ) +reads up to +.B max_size +characters into +.B buf +and returns the number of characters read. To indicate end-of-input, +return 0 characters. Note that "interactive" scanners (see the +.B \-B +and +.B \-I +flags) define the macro +.B YY_INTERACTIVE. +If you redefine +.B LexerInput() +and need to take different actions depending on whether or not +the scanner might be scanning an interactive input source, you can +test for the presence of this name via +.B #ifdef. +.TP +.B +virtual void LexerOutput( const char* buf, int size ) +writes out +.B size +characters from the buffer +.B buf, +which, while NUL-terminated, may also contain "internal" NUL's if +the scanner's rules can match text with NUL's in them. +.TP +.B +virtual void LexerError( const char* msg ) +reports a fatal error message. The default version of this function +writes the message to the stream +.B cerr +and exits. +.PP +Note that a +.B yyFlexLexer +object contains its +.I entire +scanning state. Thus you can use such objects to create reentrant +scanners. You can instantiate multiple instances of the same +.B yyFlexLexer +class, and you can also combine multiple C++ scanner classes together +in the same program using the +.B \-P +option discussed above. +.PP +Finally, note that the +.B %array +feature is not available to C++ scanner classes; you must use +.B %pointer +(the default). +.PP +Here is an example of a simple C++ scanner: +.nf + + // An example of using the flex C++ scanner class. + + %{ + int mylineno = 0; + %} + + string \\"[^\\n"]+\\" + + ws [ \\t]+ + + alpha [A-Za-z] + dig [0-9] + name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])* + num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)? + num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)? + number {num1}|{num2} + + %% + + {ws} /* skip blanks and tabs */ + + "/*" { + int c; + + while((c = yyinput()) != 0) + { + if(c == '\\n') + ++mylineno; + + else if(c == '*') + { + if((c = yyinput()) == '/') + break; + else + unput(c); + } + } + } + + {number} cout << "number " << YYText() << '\\n'; + + \\n mylineno++; + + {name} cout << "name " << YYText() << '\\n'; + + {string} cout << "string " << YYText() << '\\n'; + + %% + + int main( int /* argc */, char** /* argv */ ) + { + FlexLexer* lexer = new yyFlexLexer; + while(lexer->yylex() != 0) + ; + return 0; + } +.fi +IMPORTANT: the present form of the scanning class is +.I experimental +and may change considerably between major releases. +.SH INCOMPATIBILITIES WITH LEX AND POSIX +.I flex +is a rewrite of the AT&T Unix +.I lex +tool (the two implementations do not share any code, though), +with some extensions and incompatibilities, both of which +are of concern to those who wish to write scanners acceptable +to either implementation. The POSIX +.I lex +specification is closer to +.I flex's +behavior than that of the original +.I lex +implementation, but there also remain some incompatibilities between +.I flex +and POSIX. The intent is that ultimately +.I flex +will be fully POSIX-conformant. In this section we discuss all of +the known areas of incompatibility. +.PP +.I flex's +.B \-l +option turns on maximum compatibility with the original AT&T +.I lex +implementation, at the cost of a major loss in the generated scanner's +performance. We note below which incompatibilities can be overcome +using the +.B \-l +option. +.PP +.I flex +is fully compatible with +.I lex +with the following exceptions: +.IP - +The undocumented +.I lex +scanner internal variable +.B yylineno +is not supported unless +.B \-l +is used. +.IP +yylineno is not part of the POSIX specification. +.IP - +The +.B input() +routine is not redefinable, though it may be called to read characters +following whatever has been matched by a rule. If +.B input() +encounters an end-of-file the normal +.B yywrap() +processing is done. A ``real'' end-of-file is returned by +.B input() +as +.I EOF. +.IP +Input is instead controlled by defining the +.B YY_INPUT +macro. +.IP +The +.I flex +restriction that +.B input() +cannot be redefined is in accordance with the POSIX specification, +which simply does not specify any way of controlling the +scanner's input other than by making an initial assignment to +.I yyin. +.IP - +.I flex +scanners are not as reentrant as +.I lex +scanners. In particular, if you have an interactive scanner and +an interrupt handler which long-jumps out of the scanner, and +the scanner is subsequently called again, you may get the following +message: +.nf + + fatal flex scanner internal error--end of buffer missed + +.fi +To reenter the scanner, first use +.nf + + yyrestart( yyin ); + +.fi +Note that this call will throw away any buffered input; usually this +isn't a problem with an interactive scanner. +.IP +Also note that flex C++ scanner classes +.I are +reentrant, so if using C++ is an option for you, you should use +them instead. See "Generating C++ Scanners" above for details. +.IP - +.B output() +is not supported. +Output from the +.B ECHO +macro is done to the file-pointer +.I yyout +(default +.I stdout). +.IP +.B output() +is not part of the POSIX specification. +.IP - +.I lex +does not support exclusive start conditions (%x), though they +are in the POSIX specification. +.IP - +When definitions are expanded, +.I flex +encloses them in parentheses. +With lex, the following: +.nf + + NAME [A-Z][A-Z0-9]* + %% + foo{NAME}? printf( "Found it\\n" ); + %% + +.fi +will not match the string "foo" because when the macro +is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" +and the precedence is such that the '?' is associated with +"[A-Z0-9]*". With +.I flex, +the rule will be expanded to +"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. +.IP +Note that if the definition begins with +.B ^ +or ends with +.B $ +then it is +.I not +expanded with parentheses, to allow these operators to appear in +definitions without losing their special meanings. But the +.B , /, +and +.B <> +operators cannot be used in a +.I flex +definition. +.IP +Using +.B \-l +results in the +.I lex +behavior of no parentheses around the definition. +.IP +The POSIX specification is that the definition be enclosed in parentheses. +.IP - +The +.I lex +.B %r +(generate a Ratfor scanner) option is not supported. It is not part +of the POSIX specification. +.IP - +After a call to +.B unput(), +.I yytext +and +.I yyleng +are undefined until the next token is matched, unless the scanner +was built using +.B %array. +This is not the case with +.I lex +or the POSIX specification. The +.B \-l +option does away with this incompatibility. +.IP - +The precedence of the +.B {} +(numeric range) operator is different. +.I lex +interprets "abc{1,3}" as "match one, two, or +three occurrences of 'abc'", whereas +.I flex +interprets it as "match 'ab' +followed by one, two, or three occurrences of 'c'". The latter is +in agreement with the POSIX specification. +.IP - +The precedence of the +.B ^ +operator is different. +.I lex +interprets "^foo|bar" as "match either 'foo' at the beginning of a line, +or 'bar' anywhere", whereas +.I flex +interprets it as "match either 'foo' or 'bar' if they come at the beginning +of a line". The latter is in agreement with the POSIX specification. +.IP - +.I yyin +is +.I initialized +by +.I lex +to be +.I stdin; +.I flex, +on the other hand, +initializes +.I yyin +to NULL +and then +.I assigns +it to +.I stdin +the first time the scanner is called, providing +.I yyin +has not already been assigned to a non-NULL value. The difference is +subtle, but the net effect is that with +.I flex +scanners, +.I yyin +does not have a valid value until the scanner has been called. +.IP +The +.B \-l +option does away with this incompatibility. +.IP - +The special table-size declarations such as +.B %a +supported by +.I lex +are not required by +.I flex +scanners; +.I flex +ignores them. +.IP - +The name +.bd +FLEX_SCANNER +is #define'd so scanners may be written for use with either +.I flex +or +.I lex. +.PP +The following +.I flex +features are not included in +.I lex +or the POSIX specification: +.nf + + yyterminate() + <> + <*> + YY_DECL + YY_START + YY_USER_ACTION + #line directives + %{}'s around actions + multiple actions on a line + +.fi +plus almost all of the flex flags. +The last feature in the list refers to the fact that with +.I flex +you can put multiple actions on the same line, separated with +semi-colons, while with +.I lex, +the following +.nf + + foo handle_foo(); ++num_foos_seen; + +.fi +is (rather surprisingly) truncated to +.nf + + foo handle_foo(); + +.fi +.I flex +does not truncate the action. Actions that are not enclosed in +braces are simply terminated at the end of the line. +.SH DIAGNOSTICS +.PP +.I warning, rule cannot be matched +indicates that the given rule +cannot be matched because it follows other rules that will +always match the same text as it. For +example, in the following "foo" cannot be matched because it comes after +an identifier "catch-all" rule: +.nf + + [a-z]+ got_identifier(); + foo got_foo(); + +.fi +Using +.B REJECT +in a scanner suppresses this warning. +.PP +.I warning, +.B \-s +.I +option given but default rule can be matched +means that it is possible (perhaps only in a particular start condition) +that the default rule (match any single character) is the only one +that will match a particular input. Since +.B \-s +was given, presumably this is not intended. +.PP +.I reject_used_but_not_detected undefined +or +.I yymore_used_but_not_detected undefined - +These errors can occur at compile time. They indicate that the +scanner uses +.B REJECT +or +.B yymore() +but that +.I flex +failed to notice the fact, meaning that +.I flex +scanned the first two sections looking for occurrences of these actions +and failed to find any, but somehow you snuck some in (via a #include +file, for example). Make an explicit reference to the action in your +.I flex +input file. (Note that previously +.I flex +supported a +.B %used/%unused +mechanism for dealing with this problem; this feature is still supported +but now deprecated, and will go away soon unless the author hears from +people who can argue compellingly that they need it.) +.PP +.I flex scanner jammed - +a scanner compiled with +.B \-s +has encountered an input string which wasn't matched by +any of its rules. This error can also occur due to internal problems. +.PP +.I token too large, exceeds YYLMAX - +your scanner uses +.B %array +and one of its rules matched a string longer than the +.B YYLMAX +constant (8K bytes by default). You can increase the value by +#define'ing +.B YYLMAX +in the definitions section of your +.I flex +input. +.PP +.I scanner requires \-8 flag to +.I use the character 'x' - +Your scanner specification includes recognizing the 8-bit character +.I 'x' +and you did not specify the \-8 flag, and your scanner defaulted to 7-bit +because you used the +.B \-Cf +or +.B \-CF +table compression options. See the discussion of the +.B \-7 +flag for details. +.PP +.I flex scanner push-back overflow - +you used +.B unput() +to push back so much text that the scanner's buffer could not hold +both the pushed-back text and the current token in +.B yytext. +Ideally the scanner should dynamically resize the buffer in this case, but at +present it does not. +.PP +.I +input buffer overflow, can't enlarge buffer because scanner uses REJECT - +the scanner was working on matching an extremely large token and needed +to expand the input buffer. This doesn't work with scanners that use +.B +REJECT. +.PP +.I +fatal flex scanner internal error--end of buffer missed - +This can occur in an scanner which is reentered after a long-jump +has jumped out (or over) the scanner's activation frame. Before +reentering the scanner, use: +.nf + + yyrestart( yyin ); + +.fi +or, as noted above, switch to using the C++ scanner class. +.PP +.I too many start conditions in <> construct! - +you listed more start conditions in a <> construct than exist (so +you must have listed at least one of them twice). +.SH FILES +See flex(1). +.SH DEFICIENCIES / BUGS +Again, see flex(1). +.SH "SEE ALSO" +.PP +flex(1), lex(1), yacc(1), sed(1), awk(1). +.PP +M. E. Lesk and E. Schmidt, +.I LEX \- Lexical Analyzer Generator +.SH AUTHOR +Vern Paxson, with the help of many ideas and much inspiration from +Van Jacobson. Original version by Jef Poskanzer. The fast table +representation is a partial implementation of a design done by Van +Jacobson. The implementation was done by Kevin Gong and Vern Paxson. +.PP +Thanks to the many +.I flex +beta-testers, feedbackers, and contributors, especially Francois Pinard, +Casey Leedom, +Nelson H.F. Beebe, benson@odi.com, Peter A. Bigot, Keith Bostic, Frederic +Brehm, Nick Christopher, Jason Coughlin, Bill Cox, Dave Curtis, Scott David +Daniels, Chris G. Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo +Eskin, Chris Faylor, Jon Forrest, Kaveh R. Ghazi, +Eric Goldman, Ulrich Grepel, Jan Hajic, +Jarkko Hietaniemi, Eric Hughes, John Interrante, +Ceriel Jacobs, Jeffrey R. Jones, Henry +Juengst, Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Marq Kole, Ronald +Lamprecht, Greg Lee, Craig Leres, John Levine, Steve Liddle, +Mohamed el Lozy, Brian Madsen, Chris +Metcalf, Luke Mewburn, Jim Meyering, G.T. Nicol, Landon Noll, Marc Nozell, +Richard Ohnemus, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond +Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Frederic Raimbault, +Rick Richardson, +Kevin Rodgers, Jim Roskind, +Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, +Alex Siegel, Mike Stump, Paul Stuart, Dave Tallman, Chris Thewalt, +Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken +Yap, Nathan Zelle, David Zuhn, and those whose names have slipped my marginal +mail-archiving skills but whose contributions are appreciated all the +same. +.PP +Thanks to Keith Bostic, Jon Forrest, Noah Friedman, +John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. +Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various +distribution headaches. +.PP +Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to +Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom +Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to +Eric Hughes for support of multiple buffers. +.PP +This work was primarily done when I was with the Real Time Systems Group +at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there +for the support I received. +.PP +Send comments to: +.nf + + Vern Paxson + Systems Engineering + Bldg. 46A, Room 1123 + Lawrence Berkeley Laboratory + University of California + Berkeley, CA 94720 + + vern@ee.lbl.gov + +.fi