Handbook of Document Image Processing and Recognition文档图像处理与识别手册

编辑:David Doermann(马里兰大学)
Karl Tombre(洛林大学)

前言

In the beginning, there was only OCR. After some false starts, OCR became a competitive commercial enterprise in the 1950’s. A decade later there were more than 50 manufacturers in the US alone. With the advent of microprocessors and inexpensive optical scanners, the price of OCR dropped from tens and hundreds of thousands of dollars to that of a bottle of wine. Software displaced the racks of electronics. By 1985 anybody could program and test their ideas on a PC, and then write a paper about it (and perhaps even patent it).

最初,只有OCR。在经历了一些错误的开始之后,OCR在20世纪50年代成为了一家有竞争力的商业企业。10年后,仅在美国就有50多家制造商。随着微处理器和廉价的光学扫描仪的出现,光学字符识别的价格从几万和几十万美元降到了一瓶酒的价格。软件取代了电子设备的机架。到1985年,任何人都可以在个人电脑上编程和测试他们的想法,然后写一篇关于它的论文(甚至可能申请专利)。

We know, however, very little about current commercial methods or in-house experimental results. Competitive industries have scarce motivation to publish (and their patents may only be part of their legal arsenal). The dearth of industrial authors in our publications is painfully obvious. Herbert Schantz’s book, The History of OCR, was an exception: he traced the growth of REI, which was one of the major success stories of the 1960’s and 1970’s. He also told the story, widely mirrored in sundry wikis and treatises on OCR, of the previous fifty years’ attempts to mechanize reading. Among other manufacturers of the period, IBM may have stood alone in publishing detailed (though often delayed) information about its products.

然而,我们对目前的商业方法或内部实验结果知之甚少。竞争性行业很少有出版的动机(它们的专利可能只是其法律武器库的一部分)。我们的出版物中缺乏工业作者是显而易见的。赫伯特·桑茨的书《OCR的历史》是一个例外:他追溯了REI的成长,REI是60年代和70年代的主要成功案例之一。他还讲述了过去50年中各种各样的wiki和OCR论文中广泛反映的试图机械化阅读的故事。在这一时期的其他制造商中,IBM可能单独发布了有关其产品的详细信息(尽管常常被延迟)。

Of the 4000-8000 articles published since 1900 on character recognition (my estimate), at most a few hundred really bear on OCR (construed as machinery - now software - that converts visible language to a searchable digital format). The rest treat character recognition as a prototypical classification problem. It is, of course, researchers’ universal familiarity with at least some script that turned character recognition into the pre-eminent vehicle for demonstrating and illustrating new ideas in pattern recognition. Even though some of us cannot tell an azalea from a begonia, a sharp sign from a clef, a loop from a tented arch, an erythrocyte from a leukocyte, or an alluvium from an anticline, all of us know how to read.

在1900年以来出版的4000-8000篇关于字符识别(我估计)的文章中,最多有几百篇真正与OCR有关(被理解为将可视语言转换为可搜索数字格式的机器——现在是软件)。其余的将字符识别作为一个典型的分类问题。当然,正是由于研究人员对至少一些脚本的普遍熟悉,才使得字符识别成为展示和说明模式识别新思想的杰出工具。尽管我们中的一些人不能分辨杜鹃花和海棠,不能分辨裂缝的尖锐迹象,也不能分辨帐篷拱的环,不能分辨白细胞的红细胞,也不能分辨背斜的冲积层,但我们都知道如何阅读。

Until about 30 years ago, OCR meant recognizing mono-spaced OCR fonts and typewritten scripts one character at a time – eventually at the rate of several thousand characters per second. Word recognition followed for reading difficult-to-segment typeset matter. The value of language models more elaborate than letter n-gram frequencies and lexicons without word frequencies gradually became clear. Because more than half of the world population is polyglot, OCR too became multilingual (as Henry Baird predicted that it must). This triggered a movement to post all the cultural relics of the past on the Web. Much of the material awaiting conversion,ancient and modern, stretches the limits of human readability. Like humans, OCR must take full advantage of syntax, style, context, and semantics.

直到大约30年前,OCR还意味着一次只识别一个字符的单间距OCR字体和打字脚本,最终达到每秒几千个字符的速度。阅读困难的排版材料时采用的单词识别法。语言模型的价值比字母N-gram频率和没有词频率的词典更为精细。因为世界上一半以上的人口是多语种的,OCR也变成了多语种的(正如Henry Baird所预测的那样)。这引发了一场在网络上发布所有过去文物的运动。许多等待转换的材料,无论是古代还是现代,都超出了人类可读性的极限。与人类一样,OCR必须充分利用语法、样式、上下文和语义。

Although many academic researchers are aware that OCR is much more than classification, they have yet to develop a viable, broad-range, end-to-end OCR system (but they may be getting close). A complete OCR system, with language and script recognition, colored print capability, column and line layout analysis, accurate character/word, numeric, symbol and punctuation recognition, language models, document-wide consistency, tuneability and adaptability, graphics subsystems, effectively embedded interactive error correction, and multiple output formats, is far more than the sum of its parts. Furthermore, specialized systems - for postal address reading, check reading, litigation, and bureaucratic forms processing - also require high throughput and different error-reject trade-offs. Real OCR simply isn’t an appropriate PhD dissertation project.

尽管许多学术研究人员意识到OCR不仅仅是分类,他们还没有开发出一个可行的、范围广泛的、端到端的OCR系统(但他们可能正在接近)。一个完整的OCR系统,具有语言和脚本识别、彩色打印能力、列和行布局分析、精确的字符/词、数字、符号和标点符号识别、语言模型、文档范围一致性、可调性和适应性、图形子系统、有效嵌入的交互纠错和多重输出格式,远远超过其各部分的总和。此外,专门的系统——邮政地址读取、支票读取、诉讼和官僚表格处理——也需要高吞吐量和不同的错误拒绝权衡。真正的OCR根本不是一个合适的博士论文项目。

I never know whether to call hand print recognition and handwriting recognition “OCR.” but abhor intelligent as a qualifier for the latest wrinkle. No matter: they are here to stay until tracing glyphs with a stylus goes the way of the quill. Both human and machine legibility of manuscripts depend significantly on the motivation of the writer: a hand-printed income tax return requesting a refund is likely to be more legible than one reporting an underpayment. Immediate feedback, the main advantage of on-line recognition, is a powerful form of motivation. Humans still learn better than machines.

我不知道是否要将手写识别和手写识别称为“OCR”,但我讨厌智能作为最新皱纹的限定词。不管怎样:他们会一直呆在这里,直到用触控笔描绘出的字形沿着羽毛笔的方向移动。手稿的人和机器可读性在很大程度上取决于作者的动机:要求退款的手印所得税申报表可能比少付的更容易阅读。即时反馈是在线识别的主要优势,是一种强有力的激励形式。人类仍然比机器学习得更好。

Document Image Analysis (DIA) is a superset of OCR, but many of its other popular subfields require OCR. Almost all line drawings contain text. An E-sized telephone company drawing, for instance, has about 3000 words and numbers (including revision notices). Music scores contain numerals and instructions like pianissimo. A map without place names and elevations would have limited use. Mathematical expressions abound in digits and alphabetic fragments like log, limit, tan or argmin. Good lettering used to be a prime job qualification for the draftsmen who drew the legacy drawings that we are now converting to CAD. Unfortunately, commercial OCR systems, tuned to paragraph-length segments of text, do poorly on the alphanumeric fragments typical of such applications. When Open Source OCR matures, it will provide a fine opportunity for customization to specialized applications that have not yet attracted heavy-weight developers. In the meantime, the conversion of documents containing a mix of text and line art has given rise to distinct sub-disciplines with their own conference sessions and workshops that target graphics techniques like vectorization and complex symbol configurations.

文档图像分析(DIA)是OCR的一个超集,但它的许多其他流行的子字段都需要OCR。几乎所有的线条图都包含文本。例如,一个电子电话公司的图纸上有大约3000个字和数字(包括修订通知)。乐谱包含数字和指令,如pianissimo。一张没有地名和海拔的地图将有有限的用途。数学表达式中有大量的数字和字母片段,如log、limit、tan或argmin。良好的字体曾经是绘图员的主要工作资格,他们绘制了我们现在正在转换为CAD的传统图纸。不幸的是,商业OCR系统,调整到文本的段落长度段,在这类应用的典型字母数字片段上做得很差。当开源OCR成熟时,它将提供一个很好的机会来定制那些尚未吸引大量开发人员的专门应用程序。同时,包含文本和线条艺术混合的文档的转换产生了不同的子学科,它们有自己的会议和研讨会,以矢量化和复杂符号配置等图形技术为目标。

Another subfield of DIA investigates what to do with automatically or manually transcribed books, technical journals, magazines and newspapers. Although Information Retrieval (IR) is not generally considered part of DIA or vice-versa, the overlap between them includes “logical” document segmentation, extraction of tables of content, linking figures and illustrations to textual references, and word spotting. A recurring topic is assessing the effect of OCR errors on downstream applications. One factor that keeps the two disciplines apart is that IR experiments (e.g., TREC) typically involve orders of magnitude more documents than DIA experiments because the number of characters in any collection is far smaller than the number of pixels.

DIA的另一个子领域研究如何处理自动或手动抄写的书籍、技术期刊、杂志和报纸。尽管信息检索(IR)通常不被视为DIA的一部分,反之亦然,但它们之间的重叠包括“逻辑”文档分割、内容表提取、将图形和插图链接到文本引用以及单词识别。一个反复出现的主题是评估OCR错误对下游应用程序的影响。使这两个学科分开的一个因素是,红外实验(例如,TREC)通常比DIA实验涉及数量级的文档,因为任何集合中的字符数都远远小于像素数。

Computer vision used to be easily distinguished from the image processing aspects of DIA by its emphasis on illumination and camera position. The border is blurring because even cellphone cameras now offer sufficient spatial resolution for document image capture at several hundred dpi as well as for legible text in large scene images. The correction of the contrast and geometric distortions in the resulting images goes well beyond what is required for scanned documents

过去,计算机视觉以其对光照和摄像机位置的重视,很容易与DIA的图像处理方面区别开来。边界变得模糊,因为即使是手机摄像头现在也能提供足够的空间分辨率,以几百dpi的速度拍摄文档图像,以及在大型场景图像中显示清晰的文本。结果图像中对比度和几何畸变的校正远远超出了扫描文档的要求

This collection suggests that we are still far from a unified theory of DIA or even OCR. The Handbook is all the more useful because we have no choice except to rely on heuristics or algorithms based on questionable assumptions. The most useful methods available to us were all invented rather than derived from prime principles. When the time is ripe, many alternative methods are invented to fill the same need. They all remain entrenched candidates for “best practice”. This Handbook presents them fairly, but generally avoids picking winners and losers.

这个集合表明,我们还远远没有一个统一的理论,迪亚,甚至OCR。这本手册更有用,因为我们别无选择,只能依靠启发式或基于可疑假设的算法。我们所能得到的最有用的方法都是发明出来的,而不是从基本原理中衍生出来的。当时机成熟时,许多替代方法被发明来满足同样的需求。他们都是“最佳实践”的坚定候选人。这本手册公正地介绍了他们,但通常避免挑选赢家和输家。

“Noise” appears to be the principal obstacle to better results. This is all the more irritating because many types of noise (e.g. skew, bleed-through, underscore) barely slow down human readers. We have not yet succeeded in characterizing and quantifying signal and noise to the extent that communications science has. Although OCR and DIA are prime examples of information transfer, informationtheoretic concepts are seldom invoked. Are we moving in the right direction by accumulating empirical midstream comparisons – often on synthetic data – from contests organized by individual research groups in conjunction with our conferences?

“噪音”似乎是取得更好结果的主要障碍。这更让人恼火,因为许多类型的噪音(如歪斜、出血、下划线)几乎不能减慢人类读者的阅读速度。我们还没有像通信科学那样成功地描述和量化信号和噪声。虽然OCR和DIA是信息传递的主要例子,但很少引用信息论的概念。我们是否正在朝着正确的方向前进,通过积累经验中游比较——通常是综合数据——从各个研究小组与我们的会议一起组织的竞赛中得出?

Be that as it may, as one is getting increasingly forgetful it is reassuring to have most of the elusive information about one’s favorite topics at arm’s reach in a fat tome like this one. Much as on-line resources have improved over the past decade, I like to turn down the corner of the page and scribble a note in the margin. Younger folks, who prefer search-directed saccades to an old-fashioned linear presentation, may want the on-line version.

尽管如此,当一个人变得越来越健忘的时候,在这样一本厚厚的书中,把自己最喜欢的话题的大部分难以捉摸的信息放在手边是令人放心的。虽然在线资源在过去的十年里有了很大的改善,但我还是喜欢把页面的角落调低,在页边空白处潦草地写一条注释。比起老式的线性演示,年轻人更喜欢搜索导向的扫视,他们可能想要在线版本。

David Doermann and Karl Tombre were exceptionally well qualified to plan, select, solicit, and edit this compendium. Their contributions to DIA cover a broad swath and, as far as I know, they have never let the song of the sirens divert them from the muddy and winding channels of DIA. Their technical contributions are well referenced by the chapter authors and their voice is heard at the beginning of each section.

大卫·多尔曼和卡尔·汤姆布雷非常有资格策划、选择、征集和编辑这本简编。据我所知,他们对迪亚的贡献是巨大的,他们从未让警笛的歌声把他们从迪亚泥泞蜿蜒的河道中引开。他们的技术贡献被章节作者很好地引用,他们的声音在每个章节的开头都能听到。

Dave is the co-founding-editor of IJDAR, which became our flagship journal when PAMI veered towards computer vision and machine learning. Along with the venerable PR and the high-speed, high-volume PRL, IJDAR has served us well with a mixture of special issues, surveys, experimental reports, and new theories. Even earlier, with the encouragement of Azriel Rosenfeld, Dave organized and directed the Language and Media Processing Laboratory, which has become a major resource of DIA data sets, code, bibliographies, and expertise.

戴夫是IJDAR的联合创始编辑,当PAMI转向计算机视觉和机器学习时,IJDAR成为我们的旗舰期刊。伴随着古老的公共关系和高速、大容量的公共关系,IJDAR为我们提供了一系列的专题、调查、实验报告和新理论。更早些时候,在Azriel Rosenfeld的鼓励下,Dave组织并指导了语言和媒体处理实验室,该实验室已成为DIA数据集、代码、书目和专业知识的主要资源。

Karl, another IJDAR co-founder, put Nancy on the map as one of the premier global centers of DIA research and development. Beginning with a sustained drive to automate the conversion of legacy drawings to CAD formats (drawings for a bridge or a sewer line may have a lifetime of over a hundred years, and the plans for the still-flying Boeing 747 were drawn by hand), Karl brought together and expanded the horizons of University and INRIA researchers to form a critical mass of DIA.

另一位IJDAR联合创始人卡尔(Karl)将南希列为DIA研究与开发的主要全球中心之一。从持续推动将传统图纸自动转换为CAD格式开始(桥梁或下水道的图纸可能有超过100年的使用寿命,而仍在飞行的波音747的计划是手工绘制的),卡尔把大学和印度研究院的研究人员聚集在一起,拓展了他们的视野,形成了一个DIA的临界质量。

Dave and Karl have also done more than their share to bring our research community together, find common terminology and data, create benchmarks, and advance the state of the art. These big patient men have long been a familiar sight at our conferences, always ready to resolve a conundrum, provide a missing piece of information, fill in for an absentee session chair or speaker, or introduce folks who should know each other.

戴夫和卡尔也做了更多的工作,将我们的研究团体聚集在一起,找到共同的术语和数据,创建基准,并提高技术水平。在我们的会议上,这些有耐心的大人物一直是我们熟悉的景象,他们总是准备解决一个难题,提供缺失的信息,填补缺席会议的主席或发言人,或介绍应该相互认识的人。

The DIA community has every reason to be grateful to the editors and authors of this timely and comprehensive collection. Enjoy, and work hard to make a contribution to the next edition!

DIA社区有充分的理由感谢编辑和作者及时和全面的收集。好好享受,努力为下一版做贡献!

Part A Introduction, Background, Fundamentals .................... 1
1 A Brief History of Documents and Writing Systems ................... 3

2 Document Creation, Image Acquisition and Document Quality...... 11

3 The Evolution of Document Image Analysis ............................ 63

4 Imaging Techniques in Document Analysis Processes ................. 73

Part B Page Analysis........................................................ 133
5 Page Segmentation Techniques in Document Analysis ................ 135

6 Analysis of the Logical Layout of Documents........................... 177

7 Page Similarity and Classification........................................ 223

Part C Text Recognition .................................................... 255
8 Text Segmentation for Document Recognition.......................... 257

9 Language, Script, and Font Recognition ................................ 291

10 Machine-Printed Character Recognition................................ 331

11 Handprinted Character and Word Recognition ........................ 359

12 Continuous Handwritten Script Recognition ........................... 391

13 Middle Eastern Character Recognition ................................. 427

14 Asian Character Recognition ............................................. 459

Volume 2
Part D Processing of Non-textual Information ........................ 487
15 Graphics Recognition Techniques........................................ 489

16 An Overview of Symbol Recognition .................................... 523

17 Analysis and Interpretation of Graphical Documents ................. 553

18 Logo and Trademark Recognition ....................................... 591

19 Recognition of Tables and Forms ........................................ 647

20 Processing Mathematical Notation ....................................... 679

Part E Applications .......................................................... 703
21 Document Analysis in Postal Applications and Check
Processing ................................................................... 705

22 Analysis and Recognition of Music Scores .............................. 749

23 Analysis of Documents Born Digital ..................................... 775

24 Image Based Retrieval and Keyword Spotting in Documents ........ 805

25 Text Localization and Recognition in Images and Video .............. 843

Part F Analysis of Online Data............................................ 885
26 Online Handwriting Recognition......................................... 887

27 Online Signature Verification ............................................. 917

28 Sketching Interfaces ....................................................... 949

Part G Evaluation and Benchmarking .................................. 981
29 Datasets and Annotations for Document Analysis
and Recognition ............................................................ 983

30 Tools and Metrics for Document Analysis Systems Evaluation ....... 1011

Index......................................................................... 1037

原文地址:https://www.cnblogs.com/2008nmj/p/12185468.html

时间: 2024-10-19 13:04:28

Handbook of Document Image Processing and Recognition文档图像处理与识别手册的相关文章

Document类型[第10章-文档对象模型DOM 笔记2]

Document 类型 JavaScript 通过 Document 类型表示文档.在浏览器中, document 对象是 HTMLDocument (继承自 Document 类型)的一个实例,表示整个 HTML 页面.而且, document 对象是 window 对象的一个属性,因此可以将其作为全局对象来访问. Document 类型可以表示 HTML 页面或者其他基于 XML 的文档.不过,最常见的应用还是作为HTMLDocument 实例的 document 对象.通过这个文档对象,不

Umbraco(4)-Outputting the Document Type Properties(翻译文档)

翻译原文地址:http://www.ncloud.hk/%E6%8A%80%E6%9C%AF%E5%88%86%E4%BA%AB/umbraco4outputting-the-document-type-properties/ 输出文档类型属性 你会注意到,我们添加到homepage文档类型中的属性内容并没有显示出来.我们需要将文档类型中定义的属性和显示该文档类型的模板结合起来(那些在Umbraco中创建的数据字段可以编辑),首先让我们看下在homepage页面中属性内容应该显示的位置: [我们

接口文档管理系统mindoc安装手册

硬件: centos6.9-64 mysql5.6 首先确保系统安装gcc套件 yum -y gcc 第一步,安装mysql(如果不会在Linux安装mysql,请看下面文章) http://www.cnblogs.com/gyjx2016/p/5990664.html 第二步,安装go环境,因为mindoc是基于go语言开发 本文安装采用是二进制安装方式, wget https://storage.googleapis.com/golang/go1.7.3.linux-amd64.tar.gz

有关文档碎片(document fragment)的使用方法

通常情况下改动.删除或者添加DOM元素. 更新DOM会导致浏览器又一次绘制屏幕,也会导 致reflow,这样会带来巨大的开销.我们通常解决这的办法尽量降低更新DOM.这也就意 味着将DOM的改变分批处理.并在"活动"文档树之外运行这些更新. 当须要创建一个相对照较大的子树.应该在子树全然创建之后再将子树加入到DOM树中. 这时採用文档碎片技术来容纳全部的节点. //反样例 //在创建时马上加入节点 var p,t; p = document.createElement('p'); t

有关文档碎片(document fragment)的用法

通常情况下修改.删除或者增加DOM元素.更新DOM会导致浏览器重新绘制屏幕,也会导 致reflow,这样会带来巨大的开销.我们通常解决这的办法尽量减少更新DOM,这也就意 味着将DOM的改变分批处理,并在"活动"文档树之外执行这些更新. 当需要创建一个相对比较大的子树,应该在子树完全创建之后再将子树添加到DOM树中, 这时采用文档碎片技术来容纳所有的节点. //反例子 //在创建时立即添加节点 var p,t; p = document.createElement('p'); t =

document.write 向文档中写内容,包括文本、脚本、元素之类的,但是它在什么时候执行不会覆盖当前页面内容尼?

当你打开一个页面,浏览器会 调用 document.open() 打开文档 document.write(...) 将下载到的网页内容写入文档 所有内容写完了,就调用 document.close() 触发 dom ready 事件(DOMContentReady) 所以你如果在第3步之前 document.write(1) 那么你就直接追加内容到当前位置,如果你在第3步之后 document.write(),那么由于 document 已经 close 了,所以必须重新 document.op

第10章 文档对象模型DOM 10.2 Document类型

Document 类型 JavaScript 通过 Document 类型表示文档.在浏览器中, document 对象是 HTMLDocument (继承自 Document 类型)的一个实例,表示整个 HTML 页面.而且, document 对象是 window 对象的一个属性,因此可以将其作为全局对象来访问. Document 类型可以表示 HTML 页面或者其他基于 XML 的文档.不过,最常见的应用还是作为HTMLDocument 实例的 document 对象.通过这个文档对象,不

JS--dom对象:document object model文档对象模型

dom对象:document object model文档对象模型 文档:超文本标记文档 html xml 对象:提供了属性和方法 模型:使用属性和方法操作超文本标记性文档 可以使用js里面的DOM提供的对象,使用这些对象的属性和方法,对标记性文档进行操作 想要对标记性文档进行操作,首先需要对标记性文档里面的所有内容封装成对象 对HTML 标签 属性 文本内容都封装为对象 要想对标记性文档进行操作,解析标记性文档 --使用DOM解析HTML过程 根据HTML的层级结构,在内存中分配一个树形结构,

XML概念,约束文档,解析

day01总结 今日内容 l XML语法 l XML约束之DTD l XML解析器介绍 l XML解析之JAXP( DOM.SAX ) l DOM4J l Schema 一.XML语法 XML概述 1 什么是XML XML全称为Extensible Markup Language, 意思是可扩展的标记语言,它是 SGML(标准通用标记语言)的一个子集. XML语法上和HTML比较相似,但HTML中的元素是固定的,而XML的标签是可以由用户自定义的. W3C在1998年2月发布1.0版本: W3C