BoldDesk®Customer service software with ticketing, live chat & omnichannel support, starting at $99/mo for unlimited agents. Try for free!
Hi:
In some PDF documents when extracting text I have found that GlyphFontSize returns an incorrect font size, FontSize=1, as well as a FontName that does not correspond to what is indicated in Adobe Acrobat.
I attach the complete project including the PDF file, and a screenshot of the data returned by Adobe Acrobat
Thanks
Best Regards
Hi Jesús,
According to the PDF specification, the font size is indicated by the Tf operator. After examining the given PDF, we noticed that the font size is set to 1. We've shared the internal structure of the input document with you for your reference.
Additionally, we observed that a scaling factor is applied in this document, making the text appear larger at the view level. In text extraction, we take into account this scaling factor when calculating the height of each glyph. As a result, the height of each glyph vary compared to the font size.
Regards,
Jeyalakshmi T
Hi:
Thanks for the quick response. I have a question based on your answer about how to calculate the real fontsize of the extracted text. How to get it?
Best regards.
Jesús
Hi Jesus,
At present, we don't support measuring the exact font size in a PDF document. However, it is possible to retrieve the glyph size from the extracted text. You can refer to the documentation below for instructions on how to do this:
Please try this on your end and let us know the result.
Regards,
Irfana J.
Hi:
Thanks for the link, however it is the same code that I used in the project that I sent you. Is there any way to get the scale factor to get the real Fontsize? Is there no other solution possible?
Thank you very much
Best regards.
Jesús
Hi:
It seems I have found a possible solution.
Instead of using:
string m_extractedText = page.ExtractText(out TextLines textLines);
that returned (see imagen1) A FontSize=1 and a FontName="Ms San Serif"
If we use:
string extractedText1 = page.ExtractText(out List<TextData> textDataCollection);
It returns a FontSize and FontName ( see imagen2) matching Adobe Acrobat () (see screenShot)
All images are in images.rar
Best Regards
Jesús
Hi Jesus,
Please find the details below:
Thanks for the link, however it is the same code that I used in the project that I sent you. Is there any way to get the scale factor to get the real Fontsize? Is there no other solution possible? | Can you please the complete requirement with us. So that we can analyze further in this and share details. |
It seems I have found a possible solution. Instead of using: string m_extractedText = page.ExtractText(out TextLines textLines); that returned (see imagen1) A FontSize=1 and a FontName="Ms San Serif" If we use: string extractedText1 = page.ExtractText(out List<TextData> textDataCollection); It returns a FontSize and FontName ( see imagen2) matching Adobe Acrobat () (see screenShot) All images are in images.rar | Yes. these were the two possible ways to retrieve the size of the glyph from the extracted text. |
Regards,
Irfana J.
Hi:
The final objective of the project is to edit the text contained in a pdf file.
1-As an initial step, all the lines would be extracted, with their corresponding rectangles, text, fontname and fontsize. The extracted data is stored in
private List<TextElement> ListTextElements = new List<TextElement>();
public class TextElement
{
public int PageNumber { get; set; }
public string Text { get; set; }
public RectangleF Bounds { get; set; }
public string fontName = "Arial";
public int fontSize = 8;
public System.Drawing.FontStyle fontStyle;
public System.Drawing.Color textColor;
}
2. Mouse events are programmed in a pdfviewer control
private void pdfviewer_MouseUp(object sender, MouseButtonEventArgs e)
In this event handler it is analyzed whether the click belongs to any of the rectangles, if so it would pass the text to a TextBox for editing but with the same FontName, FontSize and length for prior evaluation to replace the original text (using redation). Having the same FontFamily, FontSize and length allows (or attempts) that the edited text does not overwrite other text. In the event that the font did not exist, a pdf font with the same FontSize would be applied. Any idea, criticism or suggestion will be welcomed.
Thank you so much
Regards,
Jesús
Hi Jesus,
Currently we are analyzing on the reported behavior with the provided details and we will provide the further details on October 29th, 2024.
Regards,
Irfana J.
Hi Jesus,
Regards,
Irfana J.
Hi Jesus,
Hi Jesus,
Due to the 2024 Volume 3 SP release last week, our weekly release scheduled for today has been postponed to tomorrow, November 20th, 2024. Further details will be provided tomorrow, November 20th, 2024.
Regards,
Irfana J.
Hi Jesus,
Root Cause : While extracting the text,font name and font size is not updated properly while extracting the text that causes the FontSize and FontName is not retrieved properly while extracting the text using TextLine API
Irfana J.