继续python相关基础的梳理:
1、正则表达式
这里放上一张很实用的表格和一个常见的例子:
例子:电子邮箱验证的正则表达式
1 import re 2 re_email=re.compile(r‘^([\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)+)$‘) 3 re_email.match(‘[email protected]‘).groups() 4 re_email.match([email protected]‘).groups()
2、接下来又谈数据
既然python常用于数据处理,对于不同类型的数据的理解就十分重要了(内容还是参考之前的参考资料)
对于字符串:掌握编码与解码 struct的概念(这个和C或者C+有相似之处),可以处理类似图片等 有意思的还有hashlib函数用来MD5算法
提醒:记得import相应模块
3、CSV (逗号分隔值)
其文件以纯文本形式存储表格数据(数字和文本)。纯文本意味着该文件是一个字符序列,不含必须像二进制数字那样被解读的数据。CSV文件由任意数目的记录组成,记录间以某种换行符分隔;每条记录由字段组成,字段间的分隔符是其它字符或字符串,最常见的是逗号或制表符。通常,所有记录都有完全相同的字段序列。
主要是read和write两大功能
这里分享一个例子:
1)创建csv文件 直接将txt的后缀改成csv即可:例如数据如下
1 "Year", "Country","Sex","Display Value","Numeric" 2 "1990","Andorra","Both sexes","77","77.00000" 3 "2000","Andorra","Both sexes","80","80.00000" 4 "2012","Andorra","Female","28","28.00000" 5 "2000","Andorra","Both sexes","23","23.00000" 6 "2012","United Arab Emirates","Female","78","78.00000" 7 "2000","Antigua and Barbuda","Male","72","72.00000" 8 "1990","Antigua and Barbuda","Male","17","17.00000" 9 "2012","Antigua and Barbuda","Both sexes","22","22.00000" 10 "2012","Australia","Male","81","81.00000"
2)按参考资料编程
首先将地址进行修改: D:\\python\\test.csv 注意都是两个斜杠
然后我的代码如下:
1 import csv 2 csvfile=open(‘D:\\python\\test.csv‘,‘rb‘) 3 reader=csv.reader(csvfile) 4 for row in reader: 5 print(row)
然后 报错了,看一下错误:
Traceback (most recent call last): File "<pyshell#62>", line 1, in <module> for row in reader: _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
百度了一下,发现了原因,将代码修改如下:
1 import csv 2 csvfile=open(‘D:\\python\\test.csv‘,‘r‘) 3 reader=csv.reader(csvfile) 4 for row in reader: 5 print(row)
就是讲‘rb’改成了‘r‘ 然后就成功输出:
[‘Year‘, ‘ "Country"‘, ‘Sex‘, ‘Display Value‘, ‘Numeric‘] [‘1990‘, ‘Andorra‘, ‘Both sexes‘, ‘77‘, ‘77.00000‘] [‘2000‘, ‘Andorra‘, ‘Both sexes‘, ‘80‘, ‘80.00000‘] [‘2012‘, ‘Andorra‘, ‘Female‘, ‘28‘, ‘28.00000‘] [‘2000‘, ‘Andorra‘, ‘Both sexes‘, ‘23‘, ‘23.00000‘] [‘2012‘, ‘United Arab Emirates‘, ‘Female‘, ‘78‘, ‘78.00000‘] [‘2000‘, ‘Antigua and Barbuda‘, ‘Male‘, ‘72‘, ‘72.00000‘] [‘1990‘, ‘Antigua and Barbuda‘, ‘Male‘, ‘17‘, ‘17.00000‘] [‘2012‘, ‘Antigua and Barbuda‘, ‘Both sexes‘, ‘22‘, ‘22.00000‘] [‘2012‘, ‘Australia‘, ‘Male‘, ‘81‘, ‘81.00000‘]
解释一下原因,就是在生成csv时 ,我用了notepad++编辑器,所以输出的类型改变,官方解释如下:
Sorry, folks, we‘ve got an understanding problem here. CSV files are typically NOT created by text editors. They are created e.g. by "save as csv" from a spreadsheet program, or as an output option by some database query program. They can have just about any character in a field, including \r and \n. Fields containing those characters should be quoted (just like a comma) by the csv file producer. A csv reader should be capable of reproducing the original field division. Here for example is a dump of a little file I just created using Excel 2003: ... This sentence in the documentation is NOT an error: """If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference."""
4、xml
接下来又是解决bug的时候。。。
还是按照教程码代码:首先是廖雪峰的parsers.expat
我命名了一个py叫xml.py
然后copy了代码。。最后报错了。。
No module named ‘xml.parsers‘; ‘xml‘ is not a package
猜测了种种可能,但是xml的三种解析方式都是自带的,不需要另加安装包。。最后Google了一下,终于发现了问题。。
先写做法:将xml.py换个名字,test.py即可
解释一下原因:编译器自动查找了最近的包。。。所以默认为xml.py就是写的package了,找不到相应的方法就报错了
然后尝试了一下另一套教程的上的代码,主要问题出在写的xml代码
最后贴出两种代码(处理方式不同)
1)sax流模式
from xml.parsers.expat import ParserCreate class DefaultSaxHandler(object): def start_element(self, name, attrs): print(‘sax:start_element: %s, attrs: %s‘ % (name, str(attrs))) def end_element(self, name): print(‘sax:end_element: %s‘ % name) def char_data(self, text): print(‘sax:char_data: %s‘ % text) xml = r‘‘‘<?xml version="1.0"?> <ol> <li><a href="/python">Python</a></li> <li><a href="/ruby">Ruby</a></li> </ol> ‘‘‘ handler = DefaultSaxHandler() parser = ParserCreate() parser.StartElementHandler = handler.start_element parser.EndElementHandler = handler.end_element parser.CharacterDataHandler = handler.char_data parser.Parse(xml)
2)ElementTree
xml代码
1 <?xml version="1.0"?> 2 <menu> 3 <breakfast hours="7-11"> 4 <item price="$6.00">breakfast burritos</item> 5 <item price="$4.00">pancakes</item> 6 </breakfast> 7 <lunch hours="11-3"> 8 <item price="$5.00">hamburger</item> 9 </lunch> 10 <dinner hours="3-10"> 11 <item price="$8.00">spaghetti</item> 12 </dinner> 13 14 </menu>
python处理
1 >>> import xml.etree.ElementTree as et 2 >>> tree=et.ElementTree(file=‘D:\\python\\data.xml‘) 3 >>> root=tree.getroot() 4 >>> root.tag 5 ‘menu‘ 6 >>> for child in root: 7 ... print(‘tag:‘,child.tag,‘attributes:‘,child.attrib) 8 ... for grandchild in child: 9 ... print(‘\ttag:‘,grandchild.tag,‘attributes:‘,grandchild.attrib)
结果:
tag: breakfast attributes: {‘hours‘: ‘7-11‘} tag: item attributes: {‘price‘: ‘$6.00‘} tag: item attributes: {‘price‘: ‘$4.00‘} tag: lunch attributes: {‘hours‘: ‘11-3‘} tag: item attributes: {‘price‘: ‘$5.00‘} tag: dinner attributes: {‘hours‘: ‘3-10‘} tag: item attributes: {‘price‘: ‘$8.00‘}
其实还有DOM 解析方式,由于DOM会把整个XML读入内存,解析为树,因此占用内存大,解析慢,优点是可以任意遍历树的节点。一般不使用
5、JSON
JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式。它基于ECMAScript的一个子集。 JSON采用完全独立于语言的文本格式,但是也使用了类似于C语言家族的习惯(包括C、C++、C#、Java、JavaScript、Perl、Python等)。这些特性使JSON成为理想的数据交换语言。 易于人阅读和编写,同时也易于机器解析和生成(一般用于提升网络传输速率)。
理解之后就两条:
json.dumps(menu) 和json.loads(menu_json)
注意不能解析datetime类的数据
常见的就这几类,还有比如excel pdf等的读取 后续再做整理