python有专门图片识别的库
我用的是pytesseract
pytesseract说明
Python-tesseract is a wrapper for google’s Tesseract-OCR
( http://code.google.com/p/tesseract-ocr/ ). It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.大意如下:
1.Python-tesseract是一个基于google’s Tesseract-OCR的独立封装包
2.Python-tesseract功能是识别图片文件中文字,并作为返回参数返回识别结果
3.Python-tesseract默认支持tiff、bmp格式图片,只有在安装PIL之后,才能支持jpeg、gif、png等其他图片格式
那么问题来了,PIL是什么?
PIL:Python Imaging Library,已经是Python平台事实上的图像处理标准库了。PIL功能非常强大,但API却非常简单易用。
安装PIL
在Debian/Ubuntu Linux下直接通过apt安装:
1 |
$ sudo apt-get install python-imaging |
Mac和其他版本的Linux可以直接使用easy_install或pip安装,安装前需要把编译环境装好:
1 |
$ sudo easy_install PIL |
如果安装失败,根据提示先把缺失的包(比如openjpeg)装上。
Windows平台就去PIL官方网站下载exe安装包。
直接安装pytesseract
1 |
$ sudo pip install pytesseract |
图文转换测试
ubuntu安装成功后
然后测试一下,随便找了个简单的验证码图片test.png放在同一目录下
1 2 3 4 5 6 7 8 9 |
# -*- coding:utf -8-*- import pytesseract from PIL import Image image = Image.open(‘test.png‘) code = pytesseract.image_to_string(image) print (code) |
出现了报错
OSError: [Errno 2] No such file or directory
黑人问号???
mmp,开始我以为是文件读不到
结果到网上查,是没有安装tesseract-ocr
然后安装下
1 |
apt-get install tesseract-ocr |
可以,很完美
验证时的优化函数
接着写我的脚本,发现验证码全是数字
于是要把一些容易读出字母的数字改过来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
change={ ‘O‘:‘0‘, ‘o‘:‘0‘, ‘I‘:‘1‘, ‘i‘:‘1‘, ‘L‘:‘1‘, ‘l‘:‘1‘, ‘Z‘:‘2‘, ‘z‘:‘2‘, ‘e‘:‘3‘, ‘a‘:‘4‘, ‘S‘:‘5‘, ‘s‘:‘5‘, ‘b‘:‘6‘, ‘T‘:‘7‘, ‘t‘:‘7‘, ‘q‘:‘9‘ }; |
替换的时候
1 2 |
for x in change: text = text.replace(x,change[x]) |
把他们合起来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# -*-coding:utf-8-*- import pytesseract from PIL import Image image = Image.open(‘test.png‘) code = pytesseract.image_to_string(image) change={ ‘O‘:‘0‘, ‘o‘:‘0‘, ‘I‘:‘1‘, ‘i‘:‘1‘, ‘L‘:‘1‘, ‘l‘:‘1‘, ‘Z‘:‘2‘, ‘z‘:‘2‘, ‘e‘:‘3‘, ‘a‘:‘4‘, ‘S‘:‘5‘, ‘s‘:‘5‘, ‘b‘:‘6‘, ‘T‘:‘7‘, ‘t‘:‘7‘, ‘q‘:‘9‘ }; for x in change: code = code.replace(x,change[x]) print code |
python下载图片
python获取图片并写到本地的脚本如下
1 2 3 4 5 6 7 8 |
# -*- coding:utf-8 -*- import requests r = requests.get(url = "http://example/test.php") data = r.content f = file("captchatest.png","wb") f.write(data) f.close() |
登录脚本
最后的登录脚本为
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# -*- coding:utf-8 -*- import requests import pytesseract from PIL import Image s = requests.session() def change_to_string(): image = Image.open(‘captchatest.png‘) code = pytesseract.image_to_string(image) change={ ‘O‘:‘0‘, ‘o‘:‘0‘, ‘I‘:‘1‘, ‘i‘:‘1‘, ‘L‘:‘1‘, ‘l‘:‘1‘, ‘Z‘:‘2‘, ‘z‘:‘2‘, ‘e‘:‘3‘, ‘a‘:‘4‘, ‘S‘:‘5‘, ‘s‘:‘5‘, ‘b‘:‘6‘, ‘T‘:‘7‘, ‘t‘:‘7‘, ‘q‘:‘9‘ }; for x in change: code = code.replace(x,change[x]) return code r = s.get(url = "http://example/login.php") data = r.content f = file("captchatest.png","wb") f.write(data) f.close() # print change_to_string() rr = s.post(url = "http://example/login.php" , data = {‘username‘:‘c014‘,‘password‘:‘c014‘,‘captcha‘:change_to_string()}) print "---------------------" print rr.content |