We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date
Syncfusion Feedback

Tesseract OPX in File Formats

Introduction

Tesseract is an open source Optical Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly or (for programmers) using an API to extract typed, handwritten, or printed text from images. Tesseract OPX makes it easy to use Tesseract with Microsoft .NET. Tesseract OPX is also optimized for working with Syncfusion Essential PDF for .NET to be able to process PDF documents with images that contain text. Tesseract OPX, along with Essential PDF, can process the text in images within PDF documents and overlay them with searchable text.

Assemblies Required

To use the OCR feature in your application, you need to add reference to the following set of assemblies:

Assembly Name

Description

Syncfusion.Pdf.Base This assembly contains the core feature for manipulating and saving PDF documents.
Syncfusion.Compression.Base This assembly compresses the internal contents of a PDF document.
Syncfusion.OCRProcessor.Base This assembly contains core feature for OCR the image and PDF document.

The following namespaces should be added in the application:

  • using Syncfusion.OCRProcessor;
  • using Syncfusion.Pdf.Parsing;

Performing OCR on PDF document

You can perform OCR on a PDF document with the help of OCRProcessor Class. Place the SyncfusionTesseract.dll and liblept168.dll assemblies (available in the installed location Installation Location\Syncfusion\Essential Studio «version number\ocrprocessor) in the local system and provide the assembly path to the OCR processor.

  • c#
  • OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");

    Place the Tesseract language data {E.g eng.traineddata} (available in the installed location Installation Location-\Syncfusion\Essential Studio «version number->\OCRProcessor) in the local system and provide a path to the OCR processor

  • c#
  • OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\");
    processor.PerformOCR(lDoc,@"Tessdata\");

    You can also download the language packages from the link below. https://github.com/tesseract-ocr/tessdata

    Please refer to the code snippet below.

  • c#
  • //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
    using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
    {
    	//Load a PDF document
    	PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
    	//Set OCR language to process
    	processor.Settings.Language = Languages.English;
    	//Process OCR by providing the PDF document and Tesseract data
    	processor.PerformOCR(lDoc, @"Tessdata\");
    	//Save the OCR processed PDF document in the disk
    	lDoc.Save("Sample.pdf");
    	lDoc.Close(true);
    }

    Performing OCR for a region of the document:

  • c#
  • //Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
    using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
    {
    	//Load a PDF document
    	PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
    	//Set OCR language to process
    	processor.Settings.Language = Languages.English;
    	RectangleF rect = new RectangleF(0, 100, 950, 150);
    	//Assign rectangles to the page
    	List <pageregion> pageRegions = new List <pageregion>();
    	PageRegion region = new PageRegion();
    	region.PageIndex = 1;
    	region.PageRegions = new RectangleF[] { rect };
    	pageRegions.Add(region);
    	processor.Settings.Regions = pageRegions;
    	//Process OCR by providing the PDF document and Tesseract data
    	processor.PerformOCR(lDoc, @"Tessdata\");
    	//Save the OCR processed PDF document in the disk
    	lDoc.Save("Sample.pdf");
    	lDoc.Close(true);
    }

    Performing OCR on image

    You can perform OCR on an image also. Refer to the below code snippets for a demonstration.

  • c#
  • //Initialize the OCR processor by providing the path of the tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
    using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries\"))
    {
    	//loading the input image
    	Bitmap image = new Bitmap("input.jpeg");
    	//Set OCR language to process
    	processor.Settings.Language = Languages.English;
    	//Process OCR by providing the bitmap image, data dictionary and language
    	string ocrText = processor.PerformOCR(image, @"Tessdata\");
    	image.Dispose();
    }