Optical character recognition (OCR) technology plays a vital role in transforming printed or handwritten text into editable and searchable digital content. With the advancements in OCR algorithms, extracting information from PDFs, images, and scanned documents has become more accurate and efficient.
In this blog, we’ll explore how the Syncfusion PDF Library simplifies the process of implementing OCR in your apps, making it easy to extract text and data from scanned documents, images, and PDF documents.
Optical Character Recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data. This technology uses advanced algorithms to recognize characters, symbols, and patterns within an image and then translates them into machine-encoded text.
The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine.
Tesseract is an open-source optical character recognition (OCR) engine. It is one of the most widely used OCR engines in the world and is known for its high accuracy and versatility.
Note: The starting supported version of Tesseract in ASP.NET Core is 4.0.
The Syncfusion PDF Library is a robust and feature-rich tool that provides developers with a seamless way to integrate OCR capabilities into their apps. Here are some key reasons why Syncfusion stands out:
In this blog, we’ll delve into the capabilities of the Syncfusion OCR processor library, enabling users to harness the power of OCR. The article will cover the following topics:
Install-Package Syncfusion.PDF.OCR.Net.Core
Leveraging our library, you can transform a complete scanned PDF document into a searchable PDF. This enables quick and efficient access to the extracted textual content.
Follow these steps to perform OCR on an entire scanned PDF document using our .NET PDF Library:
The following code example shows how to convert an entire scanned PDF into a searchable PDF document.
//Initialize the OCR processor. using (OCRProcessor processor = new OCRProcessor()) { //Load an existing PDF document. FileStream inputPDFstream = new FileStream("Input.pdf", FileMode.Open); PdfLoadedDocument document = new PdfLoadedDocument(inputPDFstream); //Set OCR language. processor.Settings.Language = "lat"; //Perform OCR with input document. processor.PerformOCR(document, "Tessdata/"); //Create file stream. using (FileStream outputFileStream = new FileStream("Output.pdf", FileMode.Create, FileAccess.ReadWrite)) { //Save the PDF document to file stream. document.Save(outputFileStream); } }
By executing this code example, you will get a PDF document like in the following screenshot.
Our .NET PDF Library lets you perform OCR on specific regions or multiple regions of a scanned PDF document.
Follow these steps to perform OCR for a region of the scanned PDF document:
The following code example shows how to convert a region in a scanned PDF into a searchable PDF document.
//Initialize the OCR processor. using (OCRProcessor processor = new OCRProcessor()) { //Load a PDF document. FileStream inputPDFStream = new FileStream("Input.pdf", FileMode.Open); PdfLoadedDocument loadedDocument = new PdfLoadedDocument(inputPDFStream); //Set OCR language to process. processor.Settings.Language = "lat"; RectangleF rectangle = new RectangleF(0, 100, 950, 150); //Assign rectangles to the page. List>PageRegion> pageRegions = new List>PageRegion>(); PageRegion region = new PageRegion(); region.PageIndex = 0; region.PageRegions = new RectangleF[] { rectangle }; pageRegions.Add(region); processor.Settings.Regions = pageRegions; //Process OCR by providing the PDF document. processor.PerformOCR(loadedDocument, "Tessdata/"); //Create file stream. using (FileStream outputFileStream = new FileStream("Output.pdf", FileMode.Create, FileAccess.ReadWrite)) { //Save the PDF document to file stream. loadedDocument.Save(outputFileStream); } }
By executing this code example, you will get a PDF document like in the following screenshot.
You can transform any scanned image into a searchable and selectable PDF document. Follow these steps to perform OCR on a scanned image and convert it into a searchable PDF:
Following is the code example demonstrating how to convert a scanned image into a searchable and selectable PDF document.
//Initialize the OCR processor. using (OCRProcessor processor = new OCRProcessor()) { //Get stream from an image file. FileStream imageStream = new FileStream(@"Input.jpg", FileMode.Open); //Set OCR language to process. processor.Settings.Language = Languages.English; //Process OCR by providing the bitmap image. PdfDocument document = processor.PerformOCR(imageStream); //Create file stream. using (FileStream outputFileStream = new FileStream(@"Output.pdf", FileMode.Create, FileAccess.ReadWrite)) { //Save the PDF document to file stream. document.Save(outputFileStream); } }
By executing this code example, you will get a PDF document like in the following screenshot.
Follow these steps to get the text from a rotated page of a PDF document:
Refer to the following code example to perform OCR on an alternated PDF document.
//Initialize the OCR processor. using (OCRProcessor processor = new OCRProcessor()) { //Load an existing PDF document. FileStream stream = new FileStream("Input.pdf", FileMode.Open); PdfLoadedDocument document = new PdfLoadedDocument(stream); //Set OCR language. processor.Settings.Language = "lat"; //Set OCR page auto detection rotation. processor.Settings.PageSegment = PageSegMode.AutoOsd; //Perform OCR with input document and tessdata (Language packs). string extractedText = processor.PerformOCR(document, "Tessdata/"); //Writes the text to the file. File.WriteAllText("OCR.txt", extractedText); }
By executing this code example, you will get a PDF document like in the following screenshot.
You can also easily obtain text and its corresponding bounds from a scanned PDF document. Follow these steps to achieve this functionality:
The following code example explains how to retrieve text and its bounds from a scanned PDF document with OCR.
//Initialize the OCR processor. using (OCRProcessor processor = new OCRProcessor()) { //Load an existing PDF document. FileStream stream = new FileStream("Input.pdf", FileMode.Open); PdfLoadedDocument document = new PdfLoadedDocument(stream); //Set OCR language. processor.Settings.Language = "lat"; //Create the layout result. OCRLayoutResult layoutResult = new OCRLayoutResult(); //Perform OCR with input document and tessdata (Language packs). processor.PerformOCR(document, @"Tessdata/", out layoutResult); //Get line collection from first page. OCRLineCollection lines = layoutResult.Pages[0].Lines; //Get each line and its bounds. foreach (Line line in lines) { string text = line.Text; RectangleF bounds = line.Rectangle; } //Close the document. document.Close(true); }
Follow these steps to execute OCR on images containing Unicode characters:
The following code example demonstrates how to perform OCR with Unicode characters in an image file.
//Initialize the OCR processor by providing the path of tesseract. using (OCRProcessor processor = new OCRProcessor()) { //Get stream from an existing PDF document. FileStream stream = new FileStream(Path.GetFullPath(@"UnicodePDF.pdf"), FileMode.Open); //Load the PDF document. PdfLoadedDocument loadedDocument = new PdfLoadedDocument(stream); //Sets Unicode font to preserve the Unicode characters in a PDF document. FileStream fontStream = new FileStream(Path.GetFullPath(@"ARIALUNI.ttf"), FileMode.Open); //Set the font for unicode text. processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8); //Set OCR language to process processor.Settings.Language = Languages.English; //Perform OCR by providing the PDF document. string ocrText = processor.PerformOCR(loadedDocument); //Create file stream. using (FileStream outputFileStream = new FileStream(Path.GetFullPath(@"Output.pdf"), FileMode.Create, FileAccess.ReadWrite)) { //Save the PDF document to file stream. loadedDocument.Save(outputFileStream); } }
By executing this code example, you will get a PDF document like in the following screenshot.
You can find the examples for all these OCR options in this GitHub repository.
Thanks for reading! In this blog, we’ve seen how easy it is to perform OCR on PDF documents using the Syncfusion .NET PDF Library using C#. With this, you can easily extract text from scanned PDF documents and images. The library also supports a variety of languages, so you can extract text from documents in several languages.
Take a moment to peruse our documentation, where you’ll find other options and features, all with accompanying code examples.
For current Syncfusion customers, the newest version of Essential Studio® is available from the license and downloads page. If you are not a customer, try our 30-day free trial to check out these new features.
If you have any questions about these features, please let us know in the comments below. You can also contact us through our support forum, support portal, or feedback portal. We are always happy to assist you!