We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date

PDF OCR gives "Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

I have the simplest WinForms app with file path textbox, output text box and a button. I test PDF OCR capabilities, but at processor.PerformOCR it throws error:

Unhandled Exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Syncfusion.OCRProcessor.Native.OCRApi.InitializeDataPath(IntPtr pt, String path, String lang)
   at Syncfusion.OCRProcessor.OCRProcessor.DoOCR(String[] args)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Program.Main(String[] args)

This is the code, all there is:
try
            {
                SyncfusionLicenseProvider.RegisterLicense("VALID LICENSE KEY HERE");

                //Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll)
                using (OCRProcessor processor = new OCRProcessor(@"C:\Temp\TesseractBinaries\3.02\"))
                {
                    //Load a PDF document
                    PdfLoadedDocument lDoc = new PdfLoadedDocument(file.Text);

                    //Set OCR language to process
                    processor.Settings.Language = Languages.English;
                    //Process OCR by providing the PDF document and Tesseract data
                    output.Text = processor.PerformOCR(lDoc, @"C:\Temp\TesseractData");
                    //Save the OCR processed PDF document in the disk                                
                    lDoc.Close(true);
                }
            }
            catch(Exception ex)
            {
                output.Text = ex.Message;
            }

TesseractBinaries does contain the required *.dll files, TesseractData contains *.traineddata files, Project itself has NuGet references to Syncfusion packages. Actual .sln attached.

I originally tested this on Windows Server 2012R2 within SharePoint 2016 Event Handler, but it gave same error as now also on my local machine (Windows 10) in a test WinForms application.

Attachment: OCRTester_f653dc74.zip

6 Replies

SL Sowmiya Loganathan Syncfusion Team December 20, 2019 11:56 AM UTC

Hi Jussi, 
 
Thank you for contacting Syncfusion support.  
 
We have tried the provided sample in our end, but we regret to let you know that we were unable to reproduce the reported issue. Please find the modified sample from below, 
 
 
We suspect that the issue to be a document specific issue. So could you please share the input PDF document to replicate the issue, it will helpful be helpful for further analysis and provide the better solution on this.  
 
Regards, 
Sowmiya Loganathan 



JP jpalo December 20, 2019 12:10 PM UTC

I'd be happy to share it via email as docs are not public.

This one sample.pdf I can share here, though, it is not throwing the exception, but is not finding any text.

Attachment: sample_968a931e.zip


PN Preethi Nesakkan Gnanadurai Syncfusion Team December 23, 2019 11:37 AM UTC

Hi Jussi, 
  
We have created an incident under your Direct- trac account. Kindly share your documents in the ticket. 
  
Regards, 
Preethi 



PV Prakash Viswanathan Syncfusion Team December 23, 2019 12:08 PM UTC

Hi Jussi, 

Syncfusion OCR processor only recognize text from the images in the PDF document. But the provided document does not have any image and it contains only text. So the OCR processor is not finding any text for the provided document. Kindly try the OCR processing for PDF document with images.  

If you need to get the text from PDF document, you can use extract text functionality, please refer below link for more information, 
 
We have created an incident under your Direct- trac account. Kindly share your documents in the ticket. 
 
Regards, 
Prakash V 



JP jpalo January 3, 2020 10:06 AM UTC

Thank you. Combined OCR + PDF text extraction to support PDFs with both images and text. However, 2 issues:

1. Why text is extracted with (lots of) random line breaks here and there, like this:

(B)
 
Compress
ed Air (kgf/cm
2
)

as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks:
(B) Compressed Air (kgf/cm2)

Same as image copied from the document:

Issue is not related to text being extracted from table, as it occurs on also text outside tables:
  
is extracted as:
TEST DATE: J
AN
.
22
.2019~
F
EB
.1.2019

2. Is it possible to OCR multiple languages with one go? Now it just accepts single language.


SL Sowmiya Loganathan Syncfusion Team January 6, 2020 12:02 PM UTC

Hi Jussi, 

Why text is extracted with (lots of) random line breaks here and there, like this: 
(B)
 
Compress
ed Air (kgf/cm
2
)
 
 
as when opening the PDF with Adobe Acrobat, and selecting text, and copy&pasting it here is without any line breaks: 
(B) Compressed Air (kgf/cm2) 

We have used Tesseract engine to perform OCR on PDF document in our end. In Tesseract engine itself, process the PDF document by word by word. So this could be based on how the content preserved in PDF. Due to this only, extracted text is breaks at random line and this is the behavior.  

Please let us know if you have any concerns on this. 
Is it possible to OCR multiple languages with one go? Now it just accepts single language. 
We can able to process the OCR with multiple language at one time using below code snippet,  

//Initialize the OCR processor by providing the path of tesseract binaries(SyncfusionTesseract.dll and liblept168.dll) 
using (OCRProcessor processor = new OCRProcessor(@"Tesseract Binaries/")) 
{ 
    //Set OCR language to process 
    processor.Settings.Language = "eng+deu";                       

Note: Make sure to include the language data file for the respective language in Tessdata folder.  

Please download the language data files in the below link,  



Regards, 
Sowmiya Loganathan 


Loader.
Up arrow icon