cs262 Programming Languages（2）Lexical Analysis

这一讲重要的内容从13-Specifying Tokens开始。但是一开始就出现了这么个东西：

def t_RANGLES(token)
    r‘>‘
    return token

前面完全没提怎么来的，看着有点迷糊，特别是r‘>‘这个，这是什么语法，于是第一次就放弃了。后来知道是在用PLY这个库，也看到文档中是这么说的：

When a function is used, the regular expression rule is specified in the function documentation string.

哦，这才想起来r‘>‘这玩意就是doc string啊，真是穿了个马甲就不认识了。。

token的定义顺序很重要。

对html中comment的处理。

其实这讲没什么好总结的，主要还是PLY。

作业：

# Hexadecimal Numbers
#
# In this exercise you will write a lexical analyzer that breaks strings up
# into whitespace-separated identifiers and numbers. An identifier is a
# sequence of one or more upper- or lower-case letters. In this exercise,
# however, there are two types of numbers: decimal numbers, and
# _hexadecimal_ numbers.
#
# Humans usually write numbers using "decimal" or "base 10" notation. The
# number# 234 means 2*10^2 + 3*10 + 4*1.
#
# It is also possible to write numbers using other "bases", like "base 16"
# or "hexadecimal". Computers often use base 16 because 16 is a convenient
# power of two (i.e., it is a closer fit to the "binary" system that
# computers use internally). A hexadecimal number always starts with the
# two-character prefix "0x" so that you know not to mistake it for a binary
# number. The number 0x234 means
#        2 * 16^2
#     + 3 * 16^1
#     + 4 * 16^0
# = 564 decimal.
#
# Because base 16 is larger than base 10, the letters ‘a‘ through ‘f‘ are
# used to represent the numbers ‘10‘ through ‘15‘. So the hexadecimal
# number 0xb is the same as the decimal number 11. When read out loud, the
# "0x" is often pronounced like "hex". "0x" must always be followed by at
# least one hexadecimal digit to count as a hexadecimal number.
#
# Modern programming languages like Python can understand hexadecimal
# numbers natively! Try it:
#
# print 0x234  # uncomment me to see 564 printed
# print 0xb    # uncomment me to see 11 printed
#
# This provides an easy way to test your knowledge of hexadecimal.
#
# For this assignment you must write token definition rules (e.g., t_ID,
# t_NUM_hex) that will break up a string of whitespace-separated
# identifiers and numbers (either decimal or hexadecimal) into ID and NUM
# tokens. If the token is an ID, you should store its text in the
# token.value field. If the token is a NUM, you must store its numerical
# value (NOT a string) in the token.value field. This means that if a
# hexadecimal string is found, you must convert it to a decimal value.
#
# Hint 1: When presented with a hexadecimal string like "0x2b4", you can
# convert it to a decimal number in stages, reading it from left to right:
#       number = 0              # ‘0x‘
#       number = number * 16
#       number = number + 2     # ‘2‘
#       number = number * 16
#       number = number + 11    # ‘b‘
#       number = number * 16
#       number = number + 4     # ‘4‘
# Of course, since you don‘t know the number of digits in advance, you‘ll
# probably want some sort of loop. There are other ways to convert a
# hexadecimal string to a number. You may use any way that works.
#
# Hint 2: The Python function ord() will convert a single letter into
# an ordered internal numerical representation. This allows you to perform
# simple arithmetic on numbers:
#
# print ord(‘c‘) - ord(‘a‘) == 2 

import ply.lex as lex

tokens = (‘NUM‘, ‘ID‘)

####
# Fill in your code here.
####

def t_NUM_hex(token): #this should be placed before t_NUM_decimal
    r‘0x[0-9a-f]+‘
    token.value = int(token.value, 16)
    token.type = ‘NUM‘
    return token

def t_NUM_decimal(token):
  r‘[0-9]+‘
  token.value = int(token.value) # won‘t work on hex numbers!
  token.type = ‘NUM‘
  return token

def t_ID(token):
    r‘[a-zA-z_]+‘
    return token

t_ignore = ‘ \t\v\r‘

def t_error(t):
  print "Lexer: unexpected character " + t.value[0]
  t.lexer.skip(1) 

# We have included some testing code to help you check your work. You will
# probably want to add your own additional tests.
lexer = lex.lex() 

def test_lexer(input_string):
  lexer.input(input_string)
  result = [ ]
  while True:
    tok = lexer.token()
    if not tok: break
    result = result + [(tok.type, tok.value)]
  return result

question1 = "0x19 equals 25" # 0x19 = (1*16) + 9
answer1 = [(‘NUM‘, 25), (‘ID‘, ‘equals‘), (‘NUM‘, 25) ]

print test_lexer(question1) == answer1

question2 = "0xfeed MY 0xface"
answer2 = [(‘NUM‘, 65261), (‘ID‘, ‘MY‘), (‘NUM‘, 64206) ]

print test_lexer(question2) == answer2

question3 = "tricky 0x0x0x"
answer3 = [(‘ID‘, ‘tricky‘), (‘NUM‘, 0), (‘ID‘, ‘x‘), (‘NUM‘, 0), (‘ID‘, ‘x‘)]
print test_lexer(question3) == answer3

question4 = "in 0xdeed"
print test_lexer(question4)

question5 = "where is the 0xbeef"
print test_lexer(question5)

Hexadecimal Numbers

# Email Addresses & Spam
#
# In this assignment you will write Python code to to extract email
# addresses from a string of text. To avoid unsolicited commercial email
# (commonly known as "spam"), users sometimes add the text NOSPAM to an
# other-wise legal email address, trusting that humans will be smart enough
# to remove it but that machines will not. As we shall see, this provides
# only relatively weak protection.
#
# For the purposes of this exercise, an email address consists of a
# word, an ‘@‘, and a domain name. A word is a non-empty sequence
# of upper- or lower-case letters. A domain name is a sequence of two or
# more words, separated by periods.
#
# Example: [email protected]
# Example: [email protected]
# Example: [email protected]
#
# If an email address has the text NOSPAM (uppercase only) anywhere in it,
# you should remove all such text. Example:
# ‘[email protected]‘ -> ‘[email protected]‘
# ‘[email protected]‘ -> ‘[email protected]‘
#
# You should write a procedure addresses() that accepts as input a string.
# Your procedure should return a list of valid email addresses found within
# that string -- each of which should have NOSPAM removed, if applicable.
#
# Hint 1: Just as we can FIND a regular expression in a string using
# re.findall(), we can also REPLACE or SUBSTITUTE a regular expression in a
# string using re.sub(regexp, new_text, haystack). Example:
#
# print re.sub(r"[0-9]+", "NUMBER", "22 + 33 = 55")
# "NUMBER + NUMBER = NUMBER"
#
# Hint 2: Don‘t forget to escape special characters.
#
# Hint 3: You don‘t have to write very much code to complete this exercise:
# you just have to put together a few concepts. It is possible to complete
# this exercise without using a lexer at all. You may use any approach that
# works. 

import ply.lex as lex
import re 

# Fill in your answer here. 

def addresses(haystack):
    emails = re.findall(r‘[a-zA-Z][email protected][a-zA-Z]+(?:\.[a-zA-Z]+)+‘, haystack)
    return [re.sub(‘NOSPAM‘, ‘‘, email) for email in emails]

# We have provided a single test case for you. You will probably want to
# write your own.
input1 = """[email protected] (1814-1871) was an advocate for
democracy. [email protected] (1905-1982) wrote about
the early nazi era. [email protected] was honored with a 1994
deutsche bundespost stamp. [email protected] is not actually an email address."""

output1 = [‘[email protected]‘, ‘[email protected]‘, ‘[email protected]‘]

print addresses(input1) == output1

Email Addresses And Spam

时间： 2024-10-03 17:25:47

cs262 Programming Languages（2）Lexical Analysis的相关文章

Java Language Programming Design （One）

Chapter One. Introduction to JAVA (1)Preliminary Knowledge a)Java Language Specification b)Java API c)Java Edition: Java SE,Java EE,Java ME d)Java Environment: JDK,JRE,JVM e)Java Development Tools:eclipse,MyEclipse,NetBeans... Note:You

软件工程术语（上）

[A] B C D E F G H I J K L M N O P Q R S T U V W X Y Z abstract class 抽象类提供一组子类共有行为的类,但它本身并不具有实例.抽象类表示一个概念,从中派生的类代表对这一概念的实施. Abstraction 抽象对视图或模型的创建,其中忽略了不必要的细节,以便专注于一组特定的相关细节. access modifier存取权限对类.方法或属性进行访问控制的关键字.Java 中的存取权限可以是公有.私

.net学习路线（转）

入门篇1. 学习面向对象(OOP)的编程思想许多高级语言都是面向对象的编程,.NET也不例外.如果您第一次接触面向对象的编程,就必须理解类.对象.字段.属性.方法和事件.封装.继承和多态性.重载.重写等概念.需要说明的是,微软的产品帮助文档做得非常完善,而且查询相当方便,入门者不得不看.安装完Visual Studio.NET2003(或者Visual Studio.NET2002)后,在“程序”组里有一个Visual Studio 组合帮助集合.初学者从帮助文档就会获得许多知

.NET学习攻略（一）

此文为转载,特此注明! 在学习编程以前,通过<计算机科学导论>课程的学习,对计算机科学有一个大致全面的了解也是很有必要的.在学习面向对象的C++/JAVA/C#等语言以前,最好具有一定的面向过程的编程经验,比如C语言.另外,在图1中还漏掉了一项比较重要的技术——数据库技术,对于希望今年暑假与大三同学一起实习的2004级学生来说,如果你们认真阅读了我发布的实习手册的话,就会发现数据库的相关知识非常重要,大家可以先阅读一下我网站上发布的数据库教材. (以下文章转载于<CSDN开发高手>

ASP.net 学习路线（详细）

ASP.net 学习路线(详细) 浏览:5632 | 更新:2013-12-13 16:33 | 标签: asp.net 百度经验:jingyan.baidu.com .net学习路线入门篇1. 学习面向对象(OOP)的编程思想许多高级语言都是面向对象的编程,.NET也不例外.如果您第一次接触面向对象的编程,就必须理解类.对象.字段.属性.方法和事件.封装.继承和多态性.重载.重写等概念.需要说明的是,微软的产品帮助文档做得非常完善,而且查询相当方便,入门者不得不看.安装完V

【IOS】IOS开发问题解决方法索引（二）

IOS开发问题解决方法索引(二) 1 不使用ARC编译,-fno-objc-arc ios5 选择了ARC但是不使用ARC编译,-fno-objc-arc http://leobluewing.iteye.com/blog/1384797 http://blog.cnrainbird.com/index.php/2012/03/13/object-c_kai_fa_zhong_hun_he_shi_yong_huo_bu_shi_yong_arc/ 2 SIGABRT错误

SDUT 3165 Round Robina（模拟）

Round Robina Time Limit: 1000ms Memory limit: 65536K 有疑问?点这里^_^ 题目描述 Suppose that N players sit in order and take turns in a game, with the first person following the last person, to continue in cyclic order. While doing so, each player keeps trac

Scrum&Kanban在移动开发团队的实践（二）

Scrum&Kanban在移动开发团队的实践系列: Scrum&Kanban在移动开发团队的实践 (一) Scrum&Kanban在移动开发团队的实践 (二) 在第一篇分享文章中介绍了下Scrum的开发模式,介绍了Scrum中团员的角色.开发阶段.每个阶段中需要做的事情.在这篇分享我会介绍Kanban模式,相对于Scrum,Kanban比较轻量级. 首先分享些干货: Kanban和Scrum对比的Mini书:Kanban and Scrum - making the most of

SDUT3184 Fun House（模拟）

Fun House Time Limit: 1000MS Memory limit: 65536K 题目描述 American Carnival Makers Inc. (ACM) has a long history of designing rides and attractions. One of their more popular attractions is a fun house that includes a room of mirrors. Their trademark is