Tesseract is an open source Optical Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly or (for programmers) using an API to extract typed, handwritten, or printed text from images. Tesseract OPX makes it easy to use Tesseract with Microsoft .NET. Tesseract OPX is also optimized for working with Syncfusion Essential PDF for .NET to be able to process PDF documents with images that contain text. Tesseract OPX, along with Essential PDF, can process the text in images within PDF documents and overlay them with searchable text.
To use the OCR feature in your application, you need to add reference to the following set of assemblies:
Assembly Name |
Description |
---|---|
Syncfusion.Pdf.Base | This assembly contains the core feature for manipulating and saving PDF documents. |
Syncfusion.Compression.Base | This assembly compresses the internal contents of a PDF document. |
Syncfusion.OCRProcessor.Base | This assembly contains core feature for OCR the image and PDF document. |
The following namespaces should be added in the application:
You can perform OCR on a PDF document with the help of OCRProcessor Class. Place the SyncfusionTesseract.dll and liblept168.dll assemblies (available in the installed location Installation Location\Syncfusion\Essential Studio «version number\ocrprocessor) in the local system and provide the assembly path to the OCR processor.
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
Place the Tesseract language data {E.g eng.traineddata} (available in the installed location Installation Location-\Syncfusion\Essential Studio «version number->\OCRProcessor) in the local system and provide a path to the OCR processor
OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
processor.PerformOCR(lDoc,@"Tessdata\");
You can also download the language packages from the link below. https://github.com/tesseract-ocr/tessdata
Please refer to the code snippet below.
//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
//Load a PDF document
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
//Set OCR language to process
processor.Settings.Language = Languages.English;
//Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(lDoc, @"Tessdata\");
//Save the OCR processed PDF document in the disk
lDoc.Save("Sample.pdf");
lDoc.Close(true);
}
//Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
//Load a PDF document
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
//Set OCR language to process
processor.Settings.Language = Languages.English;
RectangleF rect = new RectangleF(0, 100, 950, 150);
//Assign rectangles to the page
List <pageregion> pageRegions = new List <pageregion>();
PageRegion region = new PageRegion();
region.PageIndex = 1;
region.PageRegions = new RectangleF[] { rect };
pageRegions.Add(region);
processor.Settings.Regions = pageRegions;
//Process OCR by providing the PDF document and Tesseract data
processor.PerformOCR(lDoc, @"Tessdata\");
//Save the OCR processed PDF document in the disk
lDoc.Save("Sample.pdf");
lDoc.Close(true);
}
You can perform OCR on an image also. Refer to the below code snippets for a demonstration.
//Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
{
//loading the input image
Bitmap image = new Bitmap("input.jpeg");
//Set OCR language to process
processor.Settings.Language = Languages.English;
//Process OCR by providing the bitmap image, data dictionary and language
string ocrText = processor.PerformOCR(image, @"Tessdata\");
image.Dispose();
}