Improving the quality of the output

There are a variety of reasons you might not get good quality output from Tesseract. It‘s important to note that unless you‘re using a very unusual font or a new language retraining Tesseract is unlikely to help.

Image processing

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn‘t good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the configuration variabletessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.

Rescaling

Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

Binarisation

This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.

Noise Removal

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.

Rotation / Deskewing

A skewed image is when an page has been scanned when not straight. The quality of Tesseract‘s line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.

Border Removal

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.

Tools / Libraries

Examples

If you need an example how to improve image quality programmatically, have a look at this examples:

Page segmentation method

By default Tesseract expects a page of text when it segments an image. If you‘re just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

To see a complete list of supported page segmentation modes, use tesseract -h. Here‘s the list as of 3.04:

 0   Orientation and script detection (OSD) only.
 1   Automatic page segmentation with OSD.
 2   Automatic page segmentation, but no OSD, or OCR.
 3   Fully automatic page segmentation, but no OSD. (Default)
 4   Assume a single column of text of variable sizes.
 5   Assume a single uniform block of vertically aligned text.
 6   Assume a single uniform block of text.
 7   Treat the image as a single text line.
 8   Treat the image as a single word.
 9   Treat the image as a single word in a circle.
10   Treat the image as a single character.

Dictionaries, word lists, and patterns

By default Tesseract is optimized to recognize sentences of words. If you‘re trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.

Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn‘t dictionary words. They can be disabled by setting the both of the configuration variablesload_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you‘ve tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.

时间: 2024-07-29 23:55:31

Improving the quality of the output的相关文章

官方的提高tesseract识别成功率的相关方法

Improving the quality of the output There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help. I

39. Volume Rendering Techniques

Milan Ikits University of Utah Joe Kniss University of Utah Aaron Lefohn University of California, Davis Charles Hansen University of Utah This chapter presents texture-based volume rendering techniques that are used for visualizing three-dimensional

44个JAVA代码质量管理工具(转)

1. CodePro AnalytixIt’s a great tool (Eclipse plugin) for improving software quality. It has the next key features: Code Analysis, JUnit Test Generation, JUnit Test Editor, Similar Code Analysis, Metrics, Code Coverage and Dependency Analysis.2. PMDI

Power management in semiconductor memory system

A method for operating a memory module device. The method can include transferring a chip select, command, and address information from a host memory controller. The host memory controller can be coupled to a memory interface device, which can be cou

CodeForces 543d Road Improvement

Road Improvement Time Limit: 2000ms Memory Limit: 262144KB This problem will be judged on CodeForces. Original ID: 543D64-bit integer IO format: %I64d      Java class name: (Any) The country has n cities and n - 1 bidirectional roads, it is possible

[C1] Andrew Ng - AI For Everyone

About this Course AI is not only for engineers. If you want your organization to become better at using AI, this is the course to tell everyone--especially your non-technical colleagues--to take. In this course, you will learn: The meaning behind com

(分享)视频压缩Free Video Compressor 汉化版/中文版【全网唯一】

介绍:Free Video Compressor 是一个免费视频压缩软件,可以帮您有效的压缩视频.电影文件的体积大小,减小占用的磁盘空间,使之更容易放到手机中保存播放 Free Video Compressor软件特色:1.First of all, the most important option is "Desired Video Size". After software open a source video, it will read and show key video

HBV(64)_2015年新团队

Baruch S. Blumberg Institute Recruits World-Class Hepatitis B Scientists ‘All-star’ researchers intend to develop breakthrough therapies for the viral liver infection within 3 years DOYLESTOWN, Pa. (March 2015) – The Baruch S. Blumberg Institute (www

第九周作业 实现图片压缩

背景: 现在的网站为提升与用户的交互,上传文件是难免的,用户上传的图片多种多样,方面有文件大小,尺寸,等等为了处理此类的问题.服务器端必须实现对用户上传的图片进行相应的处理以适应网站的需要.其中最重要的就是压缩图片,将用户上传的图片压缩成宽度等比里大小的图片. 实现方式: java实现图片压缩:java有实现图片压缩的jar包,压缩的源文件的格式包括bmp,jig,jpg等等,还有的是压缩动态图片jar,但要购买.价格是好像要几百块人民币.接下里演示的是一个实现简单图片压缩的代码,不包括动态图片