引言: OCR领域大名鼎鼎的Tesseract,开源项目,可以直接将图片中的文字进行识别,转换成文本信息,本文将简介如何安装以及进行数据的训练操作。
1. Tesseract-OCR
目前最新的tesseract项目已经全部迁移到了github上,我们可以从中获取所有主要的信息。
地址: https://github.com/tesseract-ocr/tesseract
2. Tesseract-OCR安装
windows下的安装非常简单,直接安装可执行程序即可。这里重点介绍centos下的安装。这里提示一下,当你选择安装各类语言之时,则需要一个稍微耗时的等待操作,比如下图中所示的信息:
操作系统: centos 7, JDK 8
step1: yum search tesseract
[[email protected] ~]# yum search tesseract-ocr Loaded plugins: langpacks ========================================================================================================== Matched: tesseract-ocr =========================================================================================================== tesseract.x86_64 : Raw OCR Engine tesseract-devel.x86_64 : Development files for tesseract tesseract-langpack-afr.noarch : Afrikaans language data for tesseract tesseract-langpack-amh.noarch : Amharic language data for tesseract tesseract-langpack-ara.noarch : Arabic language data for tesseract tesseract-langpack-asm.noarch : Assamese language data for tesseract tesseract-langpack-aze.noarch : Azerbaijani language data for tesseract tesseract-langpack-aze_cyrl.noarch : "Azerbaijani language data for tesseract tesseract-langpack-bel.noarch : Belarusian language data for tesseract tesseract-langpack-ben.noarch : Bengali language data for tesseract tesseract-langpack-bod.noarch : "Tibetan language data for tesseract tesseract-langpack-bos.noarch : Bosnian language data for tesseract tesseract-langpack-bul.noarch : Bulgarian language data for tesseract tesseract-langpack-cat.noarch : Catalan language data for tesseract tesseract-langpack-ceb.noarch : Cebuano language data for tesseract ............
step2: yum install tesseract.x86_64
[[email protected] ~]# yum install tesseract.x86_64 Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package tesseract.x86_64 0:3.04.00-3.el7 will be installed --> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el7.x86_64 --> Processing Dependency: libicuuc.so.50()(64bit) for package: tesseract-3.04.00-3.el7.x86_64 --> Processing Dependency: libicui18n.so.50()(64bit) for package: tesseract-3.04.00-3.el7.x86_64 --> Running transaction check ---> Package leptonica.x86_64 0:1.72-2.el7 will be installed ---> Package libicu.x86_64 0:50.1.2-15.el7 will be installed --> Finished Dependency Resolution Dependencies Resolved ============================================================================================================================================================================================================================================= Package Arch Version Repository Size ============================================================================================================================================================================================================================================= Installing: tesseract x86_64 3.04.00-3.el7 epel 11 M Installing for dependencies: leptonica x86_64 1.72-2.el7 epel 928 k libicu x86_64 50.1.2-15.el7 base 6.9 M Transaction Summary ============================================================================================================================================================================================================================================= Install 1 Package (+2 Dependent packages) Total download size: 19 M Installed size: 67 M Is this ok [y/d/N]: y Downloading packages: (1/3): leptonica-1.72-2.el7.x86_64.rpm | 928 kB 00:00:00 (2/3): libicu-50.1.2-15.el7.x86_64.rpm | 6.9 MB 00:00:07 (3/3): tesseract-3.04.00-3.el7.x86_64.rpm | 11 MB 00:00:11 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 1.7 MB/s | 19 MB 00:00:11 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : leptonica-1.72-2.el7.x86_64 1/3 Installing : libicu-50.1.2-15.el7.x86_64 2/3 Installing : tesseract-3.04.00-3.el7.x86_64 3/3 Verifying : tesseract-3.04.00-3.el7.x86_64 1/3 Verifying : libicu-50.1.2-15.el7.x86_64 2/3 Verifying : leptonica-1.72-2.el7.x86_64 3/3 Installed: tesseract.x86_64 0:3.04.00-3.el7 Dependency Installed: leptonica.x86_64 0:1.72-2.el7 libicu.x86_64 0:50.1.2-15.el7 Complete!
step 3: 安装devel
[[email protected] ~]# yum install tesseract-devel.x86_64 tesseract-osd.x86_64 Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package tesseract-devel.x86_64 0:3.04.00-3.el7 will be installed --> Processing Dependency: pkgconfig(lept) for package: tesseract-devel-3.04.00-3.el7.x86_64 --> Running transaction check ---> Package leptonica-devel.x86_64 0:1.72-2.el7 will be installed --> Finished Dependency Resolution Dependencies Resolved ============================================================================================================================================================================================================================================= Package Arch Version Repository Size ============================================================================================================================================================================================================================================= Installing: tesseract-devel x86_64 3.04.00-3.el7 epel 80 k Installing for dependencies: leptonica-devel x86_64 1.72-2.el7 epel 108 k Transaction Summary ============================================================================================================================================================================================================================================= Install 1 Package (+1 Dependent package) Total download size: 188 k Installed size: 1.1 M Is this ok [y/d/N]: y Downloading packages: (1/2): tesseract-devel-3.04.00-3.el7.x86_64.rpm | 80 kB 00:00:00 (2/2): leptonica-devel-1.72-2.el7.x86_64.rpm | 108 kB 00:00:00 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Total 738 kB/s | 188 kB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : leptonica-devel-1.72-2.el7.x86_64 1/2 Installing : tesseract-devel-3.04.00-3.el7.x86_64 2/2 Verifying : leptonica-devel-1.72-2.el7.x86_64 1/2 Verifying : tesseract-devel-3.04.00-3.el7.x86_64 2/2 Installed: tesseract-devel.x86_64 0:3.04.00-3.el7 Dependency Installed: leptonica-devel.x86_64 0:1.72-2.el7 Complete!
step 4: 安装lang package tesseract-langpack-chi_sim.noarch, tesseract-langpack-chi_tra.noarch
[[email protected] ~]# yum install tesseract-langpack-chi_sim.noarch Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package tesseract-langpack-chi_sim.noarch 0:3.04.00-3.el7 will be installed --> Finished Dependency Resolution Dependencies Resolved ============================================================================================================================================================================================================================================= Package Arch Version Repository Size ============================================================================================================================================================================================================================================= Installing: tesseract-langpack-chi_sim noarch 3.04.00-3.el7 epel 15 M Transaction Summary ============================================================================================================================================================================================================================================= Install 1 Package Total download size: 15 M Installed size: 40 M Is this ok [y/d/N]: y Downloading packages: tesseract-langpack-chi_sim-3.04.00-3.el7.noarch.rpm | 15 MB 00:00:15 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : tesseract-langpack-chi_sim-3.04.00-3.el7.noarch 1/1 Verifying : tesseract-langpack-chi_sim-3.04.00-3.el7.noarch 1/1 Installed: tesseract-langpack-chi_sim.noarch 0:3.04.00-3.el7 Complete!
3. Tesseract-OCR的使用
a. 识别图片中的文字信息
命令格式:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
操作: tesseract ttest.png out -l lang-type
这里我们选取了两种图片,中文和英文图片;然后我们来看看OCR的效果如何。
b. 检查tesseract支持的语言
[[email protected] practice]# tesseract --list-langs List of available languages (4): eng osd chi_tra chi_sim
基于上述的信息可知,支持四种类型,三种语言, osd是开发的脚本
c. 进行基于中文的OCR
原图信息:
进行OCR操作,操作命令: tesseract chin-ocr.png chin-out -l chi_sim
运行结果:
[[email protected] practice]# tesseract chin-ocr.png chin-out -l chi_sim Tesseract Open Source OCR Engine v3.04.00 with Leptonica [[email protected] practice]# cat chin-out.txt 11月17日痿言 ′ 文童发文透露租妻子马伊蜊合作的新剧 (剃刀边缘) 快要刮作完 成) 感慨良多′他自称 ″过街者冒″ 租 ″笨人″ ′直言自己虽然忍不任茌片场发脾气′ 但 ″i人亘″ 二字是心安理才寻她受了′
大家可以看到,识别率还是有待提高的,很多的信息并未准确识别出来。这里注意背景中有水印信息,造成了一定干扰。
d. 基于英文的OCR识别
原图信息:
进行OCR操作, tesseract english-ocr.png eng-ocr -l eng
运行的结果信息:
[[email protected] practice]# tesseract english-ocr.png eng-ocr -l eng Tesseract Open Source OCR Engine v3.04.00 with Leptonica [[email protected] practice]# cat eng-ocr.txt I have lived in China for a long time and we all like it very much. We do have it done. It is very funny in a good lucky state.
基于本次的OCR结果还是非常理想的,当然这里是基于干扰非常少的情况下进行的。
4. 总结
这里只是简要介绍了其安装信息与过程,更多的信息还是需要大家自行到tesseract上去获取信息,并自行实践的。