Optical character recognition (OCR) is a technology used to convert scanned paper documents, in the form of PDF files or images, to searchable, editable data. Paper documents—such as brochures, invoices, contracts, etc.—are sent via email. This process usually involves a scanner that converts the document to lots of different colors, known as a raster image. In order to extract the data and repurpose the content of the document, an OCR engine is necessary. The OCR engine detects the characters present in the image, puts those characters into words, and then into sentences, enabling you to search and edit the content of the document.
Tesseract is an optical character recognition engine, one of the most accurate OCR engines currently available. It is licensed under Apache 2.0 and has been developed by Google since 2006.
Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines of code, a scanned paper document containing raster images is converted to a searchable and selectable document.
You can download the OCR processor product setup here.
The following assemblies are required to deploy Essential PDF and the OCR process.
Syncfusion assemblies
Tesseract assemblies
To reference the OCR assemblies in a .NET project:
1. To perform optical character recognition, as a first step, create the OCR processor by generating an object of the OCRProcessor class. It is mandatory for the constructor of the OCRProcessor class to accept the path of the Tesseract binaries, SyncfusionTessaract.dll, and liblept168.dll.
//Initializes the OCR processor by providing Tesseract binaries(SyncfusionTesseract.dll and liblept168.dll) //to the OCR processor overload. OCRProcessor processor = new OCRProcessor(@"TesseractBinaries");
2. The PDF document that has to undergo the optical character recognition is loaded by using the PdfLoadedDocument class.
//Loads a PDF document. PdfLoadedDocument loadedDocument = new PdfLoadedDocument("Input.pdf");
3. The next step is to set the language for the OCR process and start the OCR process with the input of the language dictionary. Tesseract supports a variety of languages. The following code explains the OCR process for English and how to provide the English dictionary input.
//Sets OCR language to process. processor.Settings.Language = "eng"; //Processes OCR by providing PDF document, data dictionary, and language. processor.PerformOCR(loadedDocument, @"Tessdata");
Note: You can get the Tesseract binaries SyncfusionTessaract.dll, liblept168.dll, and the language pack (tessdata)— by downloading the OCR processor zip file from the following location: https://www.syncfusion.com/downloads/latest-version
4. The final step is to save the PDF document and dispose of the PdfLoadedDocument object. The saved PDF document now contains the contents in a searchable form.
//Saves the OCR-processed PDF document to a disk. loadedDocument.Save("Sample.pdf"); loadedDocument.Close(true);
Optical character recognition can also be performed on a section of a document rather than the complete document. The following documentation link provides a code sample and explanation.
The Tesseract engine, starting from version 3, supports a variety of languages such as Arabic, English, Bulgarian, Catalan, Czech, Chinese and German as given in the following table.
Essential PDF also supports all these languages in the OCR processor. By default, Syncfusion ships only the English dictionary in the package. The dictionary packs for the other languages can be downloaded from the following online location:
https://github.com/tesseract-ocr/tessdata
The following table shows the complete set of supported languages and their language codes.
Language | Language code |
Arabic | ara |
Azerbaijani | aze |
Bulgarian | bul |
Catalan | cat |
Czech | ces |
Simplified Chinese | chi_sim |
Traditional Chinese | chi_tra |
Cherokee | chr |
Danish | dan |
Danish (Fraktur) | dan-frak |
German, standard and Fraktur script | deu |
Greek | ell |
English | eng |
Old English | enm |
Esperanto | epo |
Estonian | est |
Finnish | fin |
French | fra |
Old French | frm |
Galician | glg |
Hebrew | heb |
Hindi | hin |
Croatian | hrv |
Hungarian | hun |
Indonesian | ind |
Italian | ita |
Japanese | jpn |
Korean | kor |
Latvian | lav |
Lithuanian | lit |
Dutch | nld |
Norwegian | nor |
Polish | pol |
Portuguese | por |
Romanian | ron |
Russian | rus |
Slovakian | slk |
Slovenian | slv |
Albanian | sqi |
Spanish | spa |
Serbian | srp |
Swedish | swe |
Tamil | tam |
Telugu | tel |
Tagalog | tgl |
Thai | tha |
Turkish | tur |
Ukrainian | ukr |
Vietnamese | vie |
You can improve the accuracy of the OCR process by choosing the correct compression method when converting the scanned paper to a TIFF image and then to a PDF document:
For more details regarding quality improvement, refer to the following link:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality.
The sample can be checked-out from this GitHub repository. Give it a star, if it is being useful to you.
Take a moment to peruse the documentation, where you’ll find other options and features, all accompanying code examples.
If you are new to our PDF library, it is highly recommended that you follow our Getting Started guide.
If you have any questions or require clarification for these features, please let us know in the comments below. You can also contact us through our support forum or Direct-Trac. We are happy to assist you!
If you like this blog post, we think you’ll also like the following resources:
This post was originally published on February 20, 2015.