Inconsistencies in extracting text from a PDF file

Hi:

I am performing a series of tests extracting text from a PDF document. I have found a possible problem in the following lines:


            // Create a list to store the text data

            List<TextData> textDataCollection = new List<TextData>();


            //Extract text and get the text data

            string extractedText = loadedPage.ExtractText(out textDataCollection);


The extractedText string contains all the text on the page, however the textDataCollection the text is incomplete, mainly in the central part of the page. The project and the pdf file are attached.

Additionally, I have observed that the "y" is repeated in the extracted text string. highlighted in bold and underlined

Thank you



nosolvidaría en muchos años (es posible que nunca), y encima habríamos disfrutadode todo el tiempo empleado en cada una de las visualizaciones. Esta es la razónpor la que este curso tiene un guión muy progresivo, fácil de seguir, yy está

óptimamente distribuido durante días sucesivos. Su contenido creará sólidas

bases de conocimiento y nos permitirá avanzar con rapidez sin tener ninguna

sensación de dificultad. En vez de ver la misma película varias veces seguidas,vamos a ver una serie cuya trama bien entrelazada nos aportará mucha mayor«cultura cinéfila». En el aprendizaje hay que repartir tareas yy saber dejar cosaspara mañana, pero ojo, también hay que ser muyy constantes si queremos tener

éxito.Como en mis libros anteriores, el lector podrá encontrar aquí las tablas devocabulario completamente traducidas y asociadas. yEn ellas se incluyyen lostérminos en español y alemán, la pronunciación figurada de cada palabra

(my apologies for the Spanish text)

Thank you

Regards,

Jesús


Attachment: Extracion_v4SF_552eec5e.rar


9 Replies

IJ Irfana Jaffer Sadhik Syncfusion Team December 5, 2024 10:48 AM UTC

Hi Jesus,


Currently we are validating on the reported behavior with the provided details on our end and we will share the further details on December 9th, 2024.


Regards,

Irfana J.



RA Rangarajan Ashokan Syncfusion Team December 9, 2024 04:53 PM UTC

Hi Jesus,


We have confirmed the issue “Text cut down issue occurs while extracting the text from the PDF document” as a defect in our product and we will include the fix in weekly release on 24th December, 2024.

 

Please use the below feedback link to track the status of the reported bug.

https://www.syncfusion.com/feedback/63827/text-cut-down-issue-occurs-while-extracting-the-text-from-the-pdf-document

 

Note: If you require a patch for the reported issue in any of our Essential Studio Main or SP release version, then kindly let us know the version, so that we can provide a patch in that version based on our SLA policy.

 

Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”


Regards,

Rangarajan.




IJ Irfana Jaffer Sadhik Syncfusion Team December 24, 2024 11:33 AM UTC

Hi Jesus,

We were unable to include the fix for the issue "Text cut down issue occurs while extracting the text from the PDF document" as promised in this weekly release due to stability concerns. The fix will be included in the upcoming weekly release on December 31, 2024.

If you would like to verify the fix before the next release, we can provide you with a custom patch. Please let us know if you are interested.

We apologize for any inconvenience this may have caused and appreciate your understanding.


Regards,

Irfana J.



JE Jesús December 25, 2024 05:51 PM UTC

HI Irfana,

Please send us the corresponding patch to advance in development


Thank you


Regards,


Jesús



IJ Irfana Jaffer Sadhik Syncfusion Team December 26, 2024 03:00 PM UTC

Hi Jesus,


We are currently working on resolving the issue. We will provide custom patch of latest version on December 27, 2024.

Regards,

Irfana J.



RA Rangarajan Ashokan Syncfusion Team December 30, 2024 01:25 PM UTC

Hi Jesus,


We were unable to include the fix for the issue "Text cut down issue occurs while extracting the text from the PDF document" as promised in this weekly release due to preservation issues and stability concerns. The fix will be included in the upcoming weekly release on January 7, 2025.

 

We will provide the custom patch of latest version on December 31, 2024.

 

We apologize for any inconvenience this may have caused and appreciate your understanding.


Regards,

Rangarajan.



SN Sameerkhan NainarAliBadusha Syncfusion Team December 31, 2024 03:37 PM UTC

Hi Jesus,


We apologize for the inconvenience caused.

Due to the complexities involved in resolving the text preservation issue, we are unable to provide the custom patch today. However, we assure you that it will be delivered by January 3, 2025.


Regards,
Sameerkhan N



RA Rangarajan Ashokan Syncfusion Team January 3, 2025 04:59 PM UTC

Hi Jesus,


We have resolved the issue and prepared a custom NuGet package of version 28.1.37. We have attached the NuGet package here. Please check it.


Regards,

Rangarajan.


Attachment: syncfusion.pdf.winforms.28.1.37_1ecd3424.zip


IJ Irfana Jaffer Sadhik Syncfusion Team January 7, 2025 11:00 AM UTC

Hi Jesus,


We have included the fix for the reported issue “Text cut down issue occurs while extracting the text from the PDF document" in our weekly release (v28.1.38). Please use the below link to download our latest NuGet.

 

https://www.nuget.org/packages/Syncfusion.Pdf.WinForms/28.1.38

 

Root cause:

The input document contains escape sequences in the content. It is not handled properly to skip the escape sequence and causes invalid index. It leads to the issue and draws with cut down content.


Rega


Loader.
Up arrow icon