Some characters are missing using PdfTextExtractor.GetTextFromPage

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Some characters are missing using PdfTextExtractor.GetTextFromPage

Michael
When I parse text from a pdf where a single word contains characters consisting of eg. fi or fl
for example fia or fla
these words turn into fa and fa when reading them with PdfTextExtractor.GetTextFromPage

Note: The fi and fl appears to be a single character in the pdf.

ITextSharp does not return the i and l

I've tried both the LocationTextExtractionStrategy and SimpleTextExtractionStrategy.

...
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
...

Everything else seems to be working fine. Even on pdfs with > 300 pages of text.

Can someone please help me figure out what I'm doing wrong?
I really need my fi and fl....

Thanks in advance
Michael
Loading...