Optical character recognition (OCR) is a technology used to convert scanned paper documents, in the form of PDF files or images, into searchable, editable data.
The Syncfusion OCR processor library has extended support to OCR process PDF documents and other scanned images in the .NET Core platform from version Create an ASP.NET Core web application.
Follow these steps to create an ASP.NET Core web application in Visual Studio 2019:
Follow these steps to install the Syncfusion.PDF.OCR.Net.Core NuGet package in the project:
Follow these steps to perform OCR processing on a PDF document in ASP.NET Core:
@{ Html.BeginForm("PerformOCR", "Home", FormMethod.Post); { <input type="submit" value="Perform OCR" class=" btn" /> } }
using Microsoft.AspNetCore.Hosting; using Microsoft.AspNetCore.Mvc; using Syncfusion.OCRProcessor; using Syncfusion.Pdf.Graphics; using Syncfusion.Pdf.Parsing; using System.IO;
public IActionResult PerformOCR() { string binaries = Path.Combine(_hostingEnvironment.ContentRootPath, "TesseractBinaries", "Windows"); //Initialize OCR processor with tesseract binaries. OCRProcessor processor = new OCRProcessor(binaries); //Set language to the OCR processor. processor.Settings.Language = Languages.English; string path = Path.Combine(_hostingEnvironment.ContentRootPath, "Data", "times.ttf"); FileStream fontStream = new FileStream(path, FileMode.Open); //Create a true type font to support unicode characters in PDF. processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8); //Set temporary folder to save intermediate files. processor.Settings.TempFolder = Path.Combine(_hostingEnvironment.ContentRootPath, "Data"); //Load a PDF document. FileStream inputDocument = new FileStream(Path.Combine(_hostingEnvironment.ContentRootPath, "Data", "PDF_succinctly.pdf"), FileMode.Open); PdfLoadedDocument loadedDocument = new PdfLoadedDocument(inputDocument); //Perform OCR with language data. string tessdataPath = Path.Combine(_hostingEnvironment.ContentRootPath, "Tessdata"); processor.PerformOCR(loadedDocument, tessdataPath); //Save the PDF document. MemoryStream outputDocument = new MemoryStream(); loadedDocument.Save(outputDocument); outputDocument.Position = 0; //Dispose OCR processor and PDF document. processor.Dispose(); loadedDocument.Close(true); //Download the PDF document in the browser. FileStreamResult fileStreamResult = new FileStreamResult(outputDocument, "application/pdf"); fileStreamResult.FileDownloadName = "OCRed_PDF_document.pdf"; return fileStreamResult; }
By executing this example, you will get the PDF document shown in the following image.
Follow these steps to publish the OCR application in Azure App Service:
After publishing the application, you can perform OCR processing by navigating to the site URL.
Note: Adobe provides an easy-to-use method for turning scanned files into editable PDF documents instantly, with editable text and custom fonts that look just like the original file. For more details, refer to the article How to edit scanned documents.
You can find examples of performing OCR on ASP.NET Core applications and Azure App Service in the GitHub repository.
In this blog post, we have learned to perform OCR processing on PDF documents in ASP.NET Core web applications and publish the applications in Azure App Service.
Take a moment to peruse our documentation, where you’ll find other options and features, all with accompanying code examples.
If you have any questions about these features, please let us know in the comments below. You can also contact us through our support forum, Direct-Trac, or feedback portal. We are happy to assist you!
If you liked this article, we think you would also like the following articles about our PDF Library: