想用lex&yacc写一个json的解析, 而json的string类型是包含unicode的, 词法解析工具Lex是不直接支持unicode字符匹配的, 那如果要想匹配unicode字符应该怎么办呢, 在stack overflow上看到一个很好的解答: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode.
基本思想就是unicode字符写一个匹配模式,
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
上面匹配模式的意义如下:
UANY
: 匹配unicode和ascii字符
UANYN
: 与UANY
类似, 只是不匹配换行符
UONLY
: 只匹配unicode字符, 不匹配ascii字符
DISCLAIMER: Note that the scanner’s rules use a function called
utf8_dup_from to convert the yytext to wide character strings
containing Unicode codepoints. That function is robust; it detects
problems like overlong sequences and invalid bytes and properly
handles them. I.e. this program is not relying on these lex rules to
do the validation and conversion, just to do the basic lexical
recognition. These rules will recognize an overlong form (like an
ASCII code encoded using several bytes) as valid syntax, but the
conversion function will treat them properly. In any case, I don’t
expect UTF-8 related security issues in the program source code, since
you have to trust source code to be running it anyway (but data
handled by the program may not be trusted!) If you’re writing a
scanner for untrusted UTF-8 data, take care!