Incorrect font size in GlyphFontSize when extracting text from PDF page

Hi:

In some PDF documents when extracting text I have found that GlyphFontSize returns an incorrect font size, FontSize=1, as well as a FontName that does not correspond to what is indicated in Adobe Acrobat.

I attach the complete project including the PDF file, and a screenshot of the data returned by Adobe Acrobat

Thanks

Best Regards




Attachment: GetTextglyphdetails_720364b.rar

12 Replies

JT Jeyalakshmi Thangamarippandian Syncfusion Team October 22, 2024 12:20 PM UTC

Hi Jesús,


According to the PDF specification, the font size is indicated by the Tf operator. After examining the given PDF, we noticed that the font size is set to 1. We've shared the internal structure of the input document with you for your reference.

image

Additionally, we observed that a scaling factor is applied in this document, making the text appear larger at the view level. In text extraction, we take into account this scaling factor when calculating the height of each glyph. As a result, the height of each glyph vary compared to the font size.


Regards,

Jeyalakshmi T



JE Jesús October 22, 2024 04:22 PM UTC

Hi:


Thanks for the quick response. I have a question based on your answer about how to calculate the real fontsize of the extracted text. How to get it?


Best regards.


Jesús



IJ Irfana Jaffer Sadhik Syncfusion Team October 23, 2024 12:25 PM UTC

Hi Jesus,


At present, we don't support measuring the exact font size in a PDF document. However, it is possible to retrieve the glyph size from the extracted text. You can refer to the documentation below for instructions on how to do this:

https://help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-text-extraction#working-with-characters


Please try this on your end and let us know the result.


Regards,

Irfana J.






JE Jesús October 23, 2024 01:37 PM UTC

Hi:

Thanks for the link, however it is the same code that I used in the project that I sent you. Is there any way to get the scale factor to get the real Fontsize? Is there no other solution possible?


Thank you very much


Best regards.


Jesús



JE Jesús October 23, 2024 02:20 PM UTC

Hi:

It seems I have found a possible solution.

Instead of using:

 string m_extractedText = page.ExtractText(out TextLines textLines);

that returned (see imagen1) A FontSize=1 and a FontName="Ms San Serif"

If we use:


string extractedText1 = page.ExtractText(out List<TextData> textDataCollection);

It returns a FontSize and FontName ( see imagen2)  matching  Adobe Acrobat () (see screenShot)

All images are in images.rar


Best Regards

Jesús



Attachment: images_131e228f.rar


IJ Irfana Jaffer Sadhik Syncfusion Team October 24, 2024 06:39 AM UTC

Hi Jesus,


Please find the details below:

Thanks for the link, however it is the same code that I used in the project that I sent you. Is there any way to get the scale factor to get the real Fontsize? Is there no other solution possible?
Can you please the complete requirement with us. So that we can analyze further in this and share details.

It seems I have found a possible solution.

Instead of using:

 string m_extractedText = page.ExtractText(out TextLines textLines);

that returned (see imagen1) A FontSize=1 and a FontName="Ms San Serif"

If we use:


string extractedText1 = page.ExtractText(out List<TextData> textDataCollection);

It returns a FontSize and FontName ( see imagen2)  matching  Adobe Acrobat () (see screenShot)

All images are in images.rar

Yes. these were the two possible ways to retrieve the size of the glyph from the extracted text. 


Regards,

Irfana J.



JE Jesús October 24, 2024 04:27 PM UTC

Hi:

The final objective of the project is to edit the text contained in a pdf file.

1-As an initial step, all the lines would be extracted, with their corresponding rectangles, text, fontname and fontsize. The extracted data is stored in

 private List<TextElement> ListTextElements = new List<TextElement>();


        public class TextElement

        {

            public int PageNumber { get; set; }

            public string Text { get; set; }

            public RectangleF Bounds { get; set; }

            public string fontName = "Arial";

            public int fontSize = 8;

            public System.Drawing.FontStyle fontStyle;

            public System.Drawing.Color textColor;


        }

2. Mouse events are programmed in a pdfviewer control

private void pdfviewer_MouseUp(object sender, MouseButtonEventArgs e)

In this event handler it is analyzed whether the click belongs to any of the rectangles, if so it would pass the text to a TextBox for editing but with the same FontName, FontSize and length for prior evaluation to replace the original text (using redation). Having the same FontFamily, FontSize and length allows (or attempts) that the edited text does not overwrite other text. In the event that the font did not exist, a pdf font with the same FontSize would be applied. Any idea, criticism or suggestion will be welcomed.

Thank you so much

Regards,

Jesús



IJ Irfana Jaffer Sadhik Syncfusion Team October 25, 2024 01:22 PM UTC

Hi Jesus,


Currently we are analyzing on the reported behavior with the provided details and we will provide the further details on October 29th, 2024.


Regards,

Irfana J.



IJ Irfana Jaffer Sadhik Syncfusion Team October 29, 2024 01:04 PM UTC

Hi Jesus,


We have confirmed the issue “FontSize and FontName is not retrieved properly while extracting the text using TextLine API” as a defect in our product and we will include the fix in weekly release on 12th November 2024

Please use the below feedback link to track the status of the reported bug.

Note: If you require a patch for the reported issue in any of our Essential Studio Main or SP release version, then kindly let us know the version, so that we can provide a patch in that version based on our SLA policy.

Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”


Regards,

Irfana J.



IJ Irfana Jaffer Sadhik Syncfusion Team November 12, 2024 12:25 PM UTC

Hi Jesus,

Since our 2024 volume 3 SP1 release is expected to be rolled out in the upcoming week. So there will be no weekly release this week. We will include the fix for the reported issue in our upcoming weekly NuGet release (November 19th,2024) , once our volume 3 SP1 release is rolled out which we excepted on the end of this week.We have created the custom NuGet in the latest version 27.1.58.Kindly download the NuGet from the below link

Please use the below feedback link to track the status of the reported bug,
Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors, including but not limited to QA checks and works reprioritization.
Regards,
Irfana J.


IJ Irfana Jaffer Sadhik Syncfusion Team November 19, 2024 11:01 AM UTC

Hi Jesus,


Due to the 2024 Volume 3 SP release last week, our weekly release scheduled for today has been postponed to tomorrow, November 20th, 2024. Further details will be provided tomorrow, November 20th, 2024.

Regards,
Irfana J.



IJ Irfana Jaffer Sadhik Syncfusion Team November 22, 2024 06:05 AM UTC

Hi Jesus,


We have included the fix for this issue “FontSize and FontName is not retrieved properly while extracting the text using TextLine API ” in our latest weekly release (27.2.3).Please download the Nuget from the below link

Root Cause : While extracting the text,font name and font size is not updated properly while extracting the text that causes the FontSize and FontName is not retrieved properly while extracting the text using TextLine API

Regards,

Irfana J.


Loader.
Up arrow icon