High Volume PDF Text Extraction using Python Open-Source Tools

Harald Lieder

Aug 07, 2024

0:04: Introduction
0:22: The Problem: Large companies have lots of data in PDF format, which is unstructured and not easily extracted.
2:30: The Solution: Python packages that represent a broad feature set, including text extraction, recognition of bullet points, drawings, images, text color, and font size.
3:30: What PDFs Are: An Adobe specification created in 1992, intended to reproduce the same appearance on any device.
5:41: PDFs as Endpoints: They are created at a specific point in time and not meant for downstream processing.
7:39: Why Text Extraction Is Difficult: Some fonts do not contain back translation from glyphs to Unicode.
8:03: Why We Are Interested in Extracting Data: To reproduce reports, analyze shareholder value, and make forecasts.
9:36: The Goal: To access and process PDFs quickly to produce structured information from unstructured text.
11:16: The Tools: PyMuPDF and Tesseract OCR, both open source and freeware.
12:14: PyMuPDF: A binding to a C library, muPDF, capable of processing PDFs, XPS, EPUB formats, and more.
13:00: Tesseract OCR: An OCR engine that can be used standalone or as a subprogram by applications like PyMuPDF.
18:30: PyMuPDF Performance: Processes 100,000 pages with full text detail in 1.25 minutes.
20:08: Text Extraction Speed: PyMuPDF extracts the full 3,000 pages of the Pandas manual in 1.6 seconds.
20:44: Selectivity with OCR: OCR needs 1,000 times longer than basic text extraction, so it should be avoided whenever possible.
22:27: Q&A: Table detection, creating PDFs from data, confidence levels in OCR, and comparing PyMuPDF to LangChain.

Unsynced transcript

[Harald Lieder] (0:04 - 22:30)
Amazing lot of people are interested in this topic. It doesn't really look like, may I use the word sexy nowadays? Yeah, it doesn't look really cool.

But it is actually, if you look at the details. The presentation follows this structure. It's pure principle structure.

So we have a situation which is most large companies have lots and lots and lots of reports in PDF format mostly. And maybe if they are less lucky, even just in paper. But these reports contain critical data.

And companies only belatedly recognize that they need these data still for downstream data processing. When they then finally realize this, they look at these data and find the data are unstructured like reports are. Reports are for human perception and not for downstream processing.

The next problem is that PDF files not easily lend themselves to be extracted. Therefore, even the first PDF specification did not contain any precaution to extract data even. And when you can extract data, you are sometimes surprised how unreadable text may be.

And then of course, the quantity problem. PDFs represent millions of pages. The complication is that, mentioned it a little bit earlier, that not all reports are actually existing in file format.

They may be paper. Or if they are existing in file format, they may be scanned pages. So you have actually images.

And those images don't know what they contain by nature. You need something called OCR, I'm sure you all know it, to recreate characters out of the image. The solution is, if you want to do it with Python, you need Python packages that represent a broad feature set.

It's not only text extraction. And so far, the title of the presentation is a little bit misleading. You need more.

Namely, you have to recognize things like bullet points. They could be drawings, they could be images. You need to also recognize the color of the text.

Is it written in bold or italic? What's the font size? Stuff like this.

All this leads to pure Python tools are just not suitable. You need packages that are based on fast, high-speed C or C++ libraries. So this is an overview of what I will be mentioning today.

I hope you are not bored if I repeat a few facts about what PDFs actually are. This is a specification created in 1992 by Adobe. It's today still, although it's how much, 40 years or what?

A very old specification. It has an evolution behind it with lots of versions creating new features into it. It's a specification, however.

Its purpose, its initial intention is to reproduce the same appearance on any device, any software, any hardware, on a Mac, on a Windows PC, and even when printed out. It should always look the same. And PDF is very good in this.

Very good in this and as well as very good in the widespread presence on all devices and all sorts of software. Technically, a PDF is a text file, an ASCII text file. In a special structure, obviously, but it is that.

And if it contains binary data, and it usually does, these binary data then represent its content, either the text and or the fonts that specify how the text should be presented. It also contains images and vector graphics and today even multimedia content. So that it is.

Maybe you can recognize the left-hand side. This is how a PDF text file looks like in an editor. At the bottom, you see a few information on how many objects do we have.

Just a handful in this case. And I can use this one, right? And here is some binary content representing the text which you can see on the right side.

So for creating the usual inevitable salutation, hello world, all stuff on the left-hand side is necessary. So let me reiterate. PDF documents are endpoints of data processing.

They are created at a certain point in time based on productive information of a company, be it management information, be it cost information, employees, et cetera, et cetera. And the reports are created at certain regular intervals, normally monthly, semi-annual, stuff like this. They are used for reporting how does my company do if I'm a shareholder-based enterprise.

The SEC, for example, is interested in this. I have to inform my shareholders, et cetera. All this is going into the PDFs and those PDFs are stored somewhere.

What a PDF, what it is not. It is not meant, as I mentioned in the beginning, it is not meant to be used for downstream processing. It's an endpoint.

It is not a database. It contains no meta information about the data that it contains. The content is unstructured, it's text.

The text is to be interpreted by a human brain, hopefully capable of doing this. And as I also said, the information may not be extractable at all. This could be because of its images, which have to be first converted via OCR, or someone tried to prevent text extraction by providing a font that contains no back translation from the glyph, the visual appearance of a character, to the Unicode that has caused that glyph to appear.

Not every font contains this back translation. It's who people knowledgeable about PDF internals would find this information in the two Unicode dictionary inside the description of the font. That's not always present.

You can read your PDF, but you cannot extract the text even with a clever program. Why are we interested in doing this? I've listed here in the left column a few of the reasons why it may not work.

For example, you have forgotten to produce some report, and the SEC is behind you and ask you what it is. But your productive data are not backed up in a state consistent as required to reproduce a report. That chance is lost forever.

That's bad because nobody can help you, even not Python packages. The second problem may be you have a new CEO, and he says, well, how did we report from now on backwards our shareholder value, the 10K form, for example, and make me some report on this, a meta report. And now you have to go back to all those 10K reports that you ever produced and try to extract that information.

And of course, you are sometimes interested, if your company expands, if you go into different geographical regions, that you want to produce forecasts. How would we do, given how we did in the past? So, look at the right column, please.

All this leads to the requirement, how can I access my PDFs and produce this information fast? Something like this. This is a little side way to explain what does it mean to recreate structured information from unstructured text.

What can this entail? The top page is actually taken from the Pandas manual, documentation manual. And what you see here is a header, the blue box, the big blue box.

It's a header and some accompanying text. What you could do is take this and put it in the root segment of some hierarchical database, whatever it is. Then the second blue box here is the header of a section.

Have to hurry up. And this goes into this box here and the two bullet points go into these sub-segments. And then, of course, the other example that is easier to understand.

You have a table and you have to recognize it. Where is it? What is the boundary box of that table?

How many columns are there? Have to pass this and extract it and put it, obviously, in an SQL or HD5 database. All this requires first class tools.

You have PDFs here. You have to produce fast structured information here. And if you are unlucky, you have to OCR on the way.

You cannot, I will show a few numbers if I'm available, time being available. All this has to be fast. Text extraction has to be fast.

It must be all text detail, not just the words and the characters, et cetera, but all meta information, color, font, font size, et cetera. The green tick mark is what I'm talking about. The next slides, the red bullet points are just to mention what is then required and is not what I've mentioned here in more detail.

The tools I'm suggesting, I'm using and have used is PyMuPDF and Tesseract OCR. Both are open source and freeware tools. PyMuPDF, by the way, I'm the creator, is a binding to a C library, muPDF.

muPDF is capable of not only processing PDFs, but also XPS, EPUB formats, electronic book formats, and a few more things. Tesseract OCR is an OCR engine, which can either be used standalone, or it can be called as a subprogram in the same process by whatever application, for example, by PyMuPDF. A few words about PyMuPDF.

It is a package which has been downloaded more than 30 million times by now. It's age is approximately eight years. And the goal has always been top performance, and I think we are the top performance package in that field.

And at the same time, easy to use. The main class is a document, and a document is a sequence of its pages, a Python sequence of its pages. So we can do this at the bottom here.

You can say the import name is Fitz for whatever reason. And you say Fitz.open a PDF, giving me a document called doc. And then you say I want to access the first page, and this is simply doc indexed by zero.

And then you say print me the text of that page, which is page.some method, which can do that. And that's what appears here. So all things that PyMuPDF can detect is covering the full spectrum of what is required.

You can detect is there text. You can detect is there OCR text, or is text that is invisible by whatever effect, either covered by an image, or is it written white on white, black on black, this type of thing. Hazard images, are there vector graphics, like lines, curves, circles?

Are there annotations or form fields? One glimpse to look at details that can be extracted alongside the text. A special format of the getText method delivers me a list of dictionaries, stacked dictionaries, actually at the lowest level.

You see things like here. It is a dictionary containing the font size, containing the name of the font being used. This is, by the way, the same page that I showed before, the pandas page number zero.

The text color, which is an sRGB integer, showing the RGB color of the text. So zero is black. 255 would be, for example, blue, et cetera.

Then two font information things, like the font, given the baseline, how much do characters exceed that baseline? And the decenter is how much do characters like a Y or a G go below the baseline? Those two information.

Then comes the text itself. The origin is the starting point of that text, given in page coordinates. So it's 8.43 points from the left of the page border. And 129.78 is from the top of the page border. That's where the text starts. And then the B box, which is the top left and the bottom right coordinates of that rectangle.

It's a rectangle. And as usual, it's given in top left and bottom right coordinates. So northwest and southeast.

Given with this information, you're well positioned. What can happen now is, as mentioned, how do I detect if I have to OCR a page? That's not really easy.

I picked a few situations where we can decide this. For example, if you have a page and you can determine it's not empty, but if I do extract the text, I get an empty string. So something must be wrong here.

I would then, with a few precautions, it's not quite that, invoke tesseract OCR. Let it determine the text. It could still go wrong, of course, but let's assume it's okay.

Then you get, then the second execution of get text would deliver me the text. A more complicated situation is you get the text, but you get those black diamonds with a white question mark inside. Probably everybody has seen this.

This is the invalid Unicode. In this case here, we have a lot of good text, but a few characters just have no back translation that I mentioned before. What you can do then with PyMuPDF, take the rectangle, more or less the gray thing that you see here, hand it over to tesseract and ask tesseract, please determine the missing characters.

And we have a demo program that does exactly this, and I've taken the output of that demo program for here. So the final text comes out. The base class is instead of what we had before.

So this is what you can do to invoke OCR dynamically based on the need. A few information about how fast is PyMuPDF. That's the little blue box on the left side.

On the right side are other packages and products. The second column is another C library, which is three times slower. And the two large blue boxes are pure Python packages doing the same thing.

So we are 20 or 35 times slower. They are, by the way, the most popular pure Python PDF packages. All in all, PyMuPDF is capable of processing 100,000 pages with full text detail, like I showed before, in 1.25 minutes. Little bit more. I will show you later. So this is more or less wrapping up the whole thing.

You need to be able to do full text extraction at the best possible speed, and you need to accompany that text extraction with all information required to interpret it. You have to be able to react to OCR needs. Of course, you could always pre-process your whole thing just to be sure by OCR and work with what comes out.

But this would be an immense waste of time, waste of performance, and would lose information. Here's an example. The top box is extracting the full 3,000 pages of the Pandas manual in plain format.

That means each line is just followed by the other line. And this requires 1.6 seconds for the full document, 3,000 pages of text. And if you require the full text information, it's only 30% or less slower than that.

I hope you are impressed. A final remark, why you should be selective with OCR processing. I've done this here.

The method uppercase OCR, I hope it's recognizable, yeah, does the OCR for a given page. These are the instructions required using Pyme PDF. And then I took an interesting page of the Pandas manual, again, with a lot of text on it, and determined the text in both cases.

And let's see how much time do I need in either case. The first one is, I do it with OCR. Not possible?

No. I do it with OCR, and I get 1.6 seconds to process that page on average. If I do it without OCR, extracting native PDF text, I only need 1.58 milliseconds. So as a rough figure to memorize is, OCR needs 1,000 times longer than basic text extraction. So you should avoid it whenever you can. There we are.

The green tick marks are what I talked about today, and the other bullet points are what Pyme PDF can do else. That's it, ladies and gentlemen. So I'm ready to answer any questions.

[Audience member] (22:31 - 22:35)
Thank you for your talk. Please use the microphone for the questions.

[Audience member] (22:40 - 23:02)
Thank you for your presentation. My question is about one of the issues that you've mentioned in your talk, and that's extracting tabular data. So I'm just wondering if Pyme PDF, or maybe some other library can do this, or how maybe you would approach this problem of extracting tabular data.

[Harald Lieder] (23:03 - 24:33)
Well, thank you. This is, you have been very, very, well, cute to mention this point, because table detection, you know, I have a page and now determine is there a table on it, cannot be done by Pyme PDF today. But if somebody tells me, look, inside this B box is a table, only one table and nothing else, you can do it, because all the coordinates of the stuff inside the table can easily be used to determine what columns do I have, are the columns centered, stuff like this.

An additional comment, to identify content on a page, and I reiterate, PDF is a format with unstructured text always. PDF doesn't know what it shows, it doesn't know it is a table, so somebody else has to determine it, and this is something that requires artificial intelligence and machine learning. There are so many complicated cases, you know, we have grid lines, or you don't, or you just have those grid lines, or the verticals.

Sometimes you have background shading to separate rows from each other, sometimes you haven't. So this is stuff for an AI tool, which the company that owns Pyme PDF today is actually investigating which one to use, but to be frank, it's not a feature contained in Pyme PDF. Okay, thank you.

[Audience member] (24:40 - 24:59)
Thank you for your impressive talk. In our company, we work also with PDFs, and can you tell us how, which tools you use to pack text information into PDFs? Can you say?

[Harald Lieder] (25:00 - 25:03)
You mean, you have data from somewhere and you want to create a PDF?

[Audience member] (25:03 - 25:04)
Yeah.

[Harald Lieder] (25:04 - 26:16)
You can do this with Pyme PDF? Of course. No, that's really, there are several ways to do this in Pyme PDF, you can output single lines of text, and providing it with position information, like start a text here and make sure the text doesn't exceed this right border, or you can provide a text box, or a box, which should contain the text, and then you provide it with a string, and some automatism distributes the text inside that box and gives you information about how much space was unused by filling, and it can also tell you, of course, how much space would be required to pass, to fit in all the text.

If you provided more text, then what would fit in? You can then reduce the font size, or enlarge the text box, stuff like this. And a third method is you can provide HTML source, which all structuring information, like whatever it is, and then ask Pyme PDF to use this HTML and convert it to a PDF.

These are the possibilities.

[Audience member] (26:19 - 26:45)
Thank you. Hello, so thank you for this talk. I have a question about the OCR that you mentioned.

So every time when you use OCR, there's some possibility that you're wrong about your assumption about what do you see with this OCR. So do you have any coefficient that you return for the confidence level for every parsed word, or sentence, or line?

[Harald Lieder] (26:46 - 28:17)
Okay, actually, it's a pyramidal approach. So first of all, you check, do I have text, or do I have an image covering the whole page? And if I don't have text, but that type of image, then you can assume it is OCR and just try it.

Or you have neither of this, but the page isn't empty either. Then you would check, do I have vector graphics, which decompose down to rectangles, approximately covering one character? That's, for example, a head of 10 points would be font size 10, something like this.

And you make, okay, could be that someone just provided graphics that imitate text. Again, use OCR in this case. Or if neither of this is the case, you can look at, do I have annotations on the page?

Or do I have form fields on the page? By the way, this is one of the easiest way to get to structured information, because you have keys and you have values if you have form fields. Okay, this is more or less the approach.

And in this sequence, if I remember it right, how you would go by and determining, do I need OCR? Or better just leave it and the page is really blank or not interesting?

[Audience member] (28:18 - 28:40)
Okay, and sometimes in companies when they create PDFs for files that were initially printed, they just copy the print from the files and it's just a huge image on the PDF page. That's how they would do it, yes. Yes, and sometimes this image is a little bit rotated because of the printer.

So would this library work with a little bit rotated text? Yes, yes, yes.

[Harald Lieder] (28:40 - 29:01)
It depends on the OCR package, of course. Tesseract can do it. It would give you the words tilted by some angle.

And if you determine this, and you can determine this in PyMe PDF, you would then simply untilt the whole thing and then go ahead. That's possible.

[Audience member] (29:01 - 29:02)
All right, awesome, thank you.

[Harald Lieder] (29:02 - 29:02)
You're welcome.

[Audience member] (29:10 - 29:39)
Hi, thank you for the presentation. Maybe you answered that question already during the presentation. I was going through the documentation to find out as well.

So maybe you know, Lankchain is a pretty recent tool that uses AI to scan through a lot of things and documents and help automate things with AI. And they have a PDF reader as well. Have you been able to have a chance to compare your tool to theirs in terms of performance and efficiency?

[Harald Lieder] (29:40 - 29:42)
I didn't completely understand it acoustically.

[Audience member] (29:43 - 30:02)
So yeah, I was looking through Lankchain. Basically, they have PyPDF loader, which would open a PDF file, it would scan through it, and it would return the characters. And from what I see, they're using something like PyPDF.

Have you compared the performance?

[Harald Lieder] (30:03 - 30:05)
One of the large blue columns was PyPDF.

[Audience member] (30:06 - 30:13)
Yeah, probably. And I just found out that they're actually using this. So, okay.

I guess we'll see after the presentation online and we'll be able to see that again.

[Harald Lieder] (30:13 - 30:16)
Yeah, I will share it, of course.

[Audience member] (30:18 - 30:33)
Thank you. Unfortunately, we don't have more time for questions. If you want more questions, you can find Hallard in Codespace or in Discord.

Thanks again, Hallard. Great talk. Thank you very much for your attention.

Thank you.

Synced Transcript

[Harald Lieder]

[00:00:04]

Introduction

Amazing lot of people are interested in this topic. It doesn't really look like, may I use the word sexy nowadays? Yeah, it doesn't look really cool. But it is actually, if you look at the details.

Presentation Structure

The presentation follows this structure. It's pure principle structure.

The Problem: Unstructured Data in PDFs

So we have a situation which is most large companies have lots and lots and lots of reports in PDF format mostly. And maybe if they are less lucky, even just in paper. But these reports contain critical data. And companies only belatedly recognize that they need these data still for downstream data processing.

[00:01:01]

When they then finally realize this, they look at these data and find the data are unstructured like reports are. Reports are for human perception and not for downstream processing.

Challenges of PDF Text Extraction

The next problem is that PDF files not easily lend themselves to be extracted. Therefore, even the first PDF specification did not contain any precaution to extract data even. And when you can extract data, you are sometimes surprised how unreadable text may be. And then of course, the quantity problem. PDFs represent millions of pages. The complication is that, mentioned it a little bit earlier, that not all reports are actually existing in file format.

[00:02:04]

They may be paper. Or if they are existing in file format, they may be scanned pages. So you have actually images. And those images don't know what they contain by nature. You need something called OCR, I'm sure you all know it, to recreate characters out of the image.

The Solution: Python Packages with Broad Feature Sets

The solution is, if you want to do it with Python, you need Python packages that represent a broad feature set. It's not only text extraction. And so far, the title of the presentation is a little bit misleading. You need more. Namely, you have to recognize things like bullet points. They could be drawings, they could be images. You need to also recognize the color of the text.

[00:03:03]

Is it written in bold or italic? What's the font size? Stuff like this. All this leads to pure Python tools are just not suitable. You need packages that are based on fast, high-speed C or C++ libraries. So this is an overview of what I will be mentioning today.

What is a PDF?

I hope you are not bored if I repeat a few facts about what PDFs actually are. This is a specification created in 1992 by Adobe. It's today still, although it's how much, 40 years or what? A very old specification. It has an evolution behind it with lots of versions creating new features into it. It's a specification, however.

[00:04:04]

Its purpose, its initial intention is to reproduce the same appearance on any device, any software, any hardware, on a Mac, on a Windows PC, and even when printed out. It should always look the same. And PDF is very good in this. Very good in this and as well as very good in the widespread presence on all devices and all sorts of software. Technically, a PDF is a text file, an ASCII text file. In a special structure, obviously, but it is that. And if it contains binary data, and it usually does, these binary data then represent its content, either the text and or the fonts that specify how the text should be presented. It also contains images and vector graphics and today even multimedia content.

[00:05:00]

So that it is.

PDF Text File Structure

Maybe you can recognize the left-hand side. This is how a PDF text file looks like in an editor. At the bottom, you see a few information on how many objects do we have. Just a handful in this case. And I can use this one, right? And here is some binary content representing the text which you can see on the right side. So for creating the usual inevitable salutation, hello world, all stuff on the left-hand side is necessary. So let me reiterate.

PDFs as Endpoints of Data Processing

PDF documents are endpoints of data processing. They are created at a certain point in time based on productive information of a company, be it management information, be it cost information, employees, et cetera, et cetera.

[00:06:03]

And the reports are created at certain regular intervals, normally monthly, semi-annual, stuff like this. They are used for reporting how does my company do if I'm a shareholder-based enterprise. The SEC, for example, is interested in this. I have to inform my shareholders, et cetera. All this is going into the PDFs and those PDFs are stored somewhere.

What a PDF is Not

What a PDF, what it is not. It is not meant, as I mentioned in the beginning, it is not meant to be used for downstream processing. It's an endpoint. It is not a database. It contains no meta information about the data that it contains. The content is unstructured, it's text.

[00:07:00]

The text is to be interpreted by a human brain, hopefully capable of doing this. And as I also said, the information may not be extractable at all.

Challenges of Extracting Text from PDFs

This could be because of its images, which have to be first converted via OCR, or someone tried to prevent text extraction by providing a font that contains no back translation from the glyph, the visual appearance of a character, to the Unicode that has caused that glyph to appear. Not every font contains this back translation. It's who people knowledgeable about PDF internals would find this information in the two Unicode dictionary inside the description of the font. That's not always present. You can read your PDF, but you cannot extract the text even with a clever program.

[00:08:03]

Why We Need to Extract Text from PDFs

Why are we interested in doing this? I've listed here in the left column a few of the reasons why it may not work. For example, you have forgotten to produce some report, and the SEC is behind you and ask you what it is. But your productive data are not backed up in a state consistent as required to reproduce a report. That chance is lost forever. That's bad because nobody can help you, even not Python packages. The second problem may be you have a new CEO, and he says, well, how did we report from now on backwards our shareholder value, the 10K form, for example, and make me some report on this, a meta report.

[00:09:08]

And now you have to go back to all those 10K reports that you ever produced and try to extract that information. And of course, you are sometimes interested, if your company expands, if you go into different geographical regions, that you want to produce forecasts. How would we do, given how we did in the past? So, look at the right column, please.

The Requirement: Fast Access to PDF Data

All this leads to the requirement, how can I access my PDFs and produce this information fast? Something like this. This is a little side way to explain what does it mean to recreate structured information from unstructured text.

[00:10:02]

Recreating Structured Information from Unstructured Text

What can this entail? The top page is actually taken from the Pandas manual, documentation manual. And what you see here is a header, the blue box, the big blue box. It's a header and some accompanying text. What you could do is take this and put it in the root segment of some hierarchical database, whatever it is. Then the second blue box here is the header of a section. Have to hurry up. And this goes into this box here and the two bullet points go into these sub-segments. And then, of course, the other example that is easier to understand. You have a table and you have to recognize it. Where is it? What is the boundary box of that table?

[00:11:01]

How many columns are there? Have to pass this and extract it and put it, obviously, in an SQL or HD5 database. All this requires first class tools.

First-Class Tools for PDF Text Extraction

You have PDFs here. You have to produce fast structured information here. And if you are unlucky, you have to OCR on the way. You cannot, I will show a few numbers if I'm available, time being available. All this has to be fast. Text extraction has to be fast. It must be all text detail, not just the words and the characters, et cetera, but all meta information, color, font, font size, et cetera. The green tick mark is what I'm talking about. The next slides, the red bullet points are just to mention what is then required and is not what I've mentioned here in more detail.

[00:12:05]

PyMuPDF and Tesseract OCR: Open Source Tools

The tools I'm suggesting, I'm using and have used is PyMuPDF and Tesseract OCR. Both are open source and freeware tools. PyMuPDF, by the way, I'm the creator, is a binding to a C library, muPDF. muPDF is capable of not only processing PDFs, but also XPS, EPUB formats, electronic book formats, and a few more things. Tesseract OCR is an OCR engine, which can either be used standalone, or it can be called as a subprogram in the same process by whatever application, for example, by PyMuPDF.

PyMuPDF: A High-Performance Python Package

A few words about PyMuPDF.

[00:13:01]

It is a package which has been downloaded more than 30 million times by now. It's age is approximately eight years. And the goal has always been top performance, and I think we are the top performance package in that field. And at the same time, easy to use. The main class is a document, and a document is a sequence of its pages, a Python sequence of its pages. So we can do this at the bottom here. You can say the import name is Fitz for whatever reason. And you say Fitz.open a PDF, giving me a document called doc. And then you say I want to access the first page, and this is simply doc indexed by zero. And then you say print me the text of that page, which is page.some method, which can do that.

[00:14:01]

And that's what appears here. So all things that PyMuPDF can detect is covering the full spectrum of what is required.

PyMuPDF Features: Text Detection, OCR, Images, Vector Graphics, Annotations

You can detect is there text. You can detect is there OCR text, or is text that is invisible by whatever effect, either covered by an image, or is it written white on white, black on black, this type of thing. Hazard images, are there vector graphics, like lines, curves, circles? Are there annotations or form fields? One glimpse to look at details that can be extracted alongside the text. A special format of the getText method delivers me a list of dictionaries, stacked dictionaries, actually at the lowest level.

[00:15:07]

Extracting Text Details with PyMuPDF

You see things like here. It is a dictionary containing the font size, containing the name of the font being used. This is, by the way, the same page that I showed before, the pandas page number zero. The text color, which is an sRGB integer, showing the RGB color of the text. So zero is black. 255 would be, for example, blue, et cetera. Then two font information things, like the font, given the baseline, how much do characters exceed that baseline? And the decenter is how much do characters like a Y or a G go below the baseline? Those two information.

[00:16:00]

Then comes the text itself. The origin is the starting point of that text, given in page coordinates. So it's 8.43 points from the left of the page border. And 129.78 is from the top of the page border. That's where the text starts. And then the B box, which is the top left and the bottom right coordinates of that rectangle. It's a rectangle. And as usual, it's given in top left and bottom right coordinates. So northwest and southeast. Given with this information, you're well positioned. What can happen now is, as mentioned, how do I detect if I have to OCR a page?

Dynamically Invoking OCR with PyMuPDF

That's not really easy. I picked a few situations where we can decide this. For example, if you have a page and you can determine it's not empty, but if I do extract the text, I get an empty string.

[00:17:03]

So something must be wrong here. I would then, with a few precautions, it's not quite that, invoke tesseract OCR. Let it determine the text. It could still go wrong, of course, but let's assume it's okay. Then you get, then the second execution of get text would deliver me the text. A more complicated situation is you get the text, but you get those black diamonds with a white question mark inside. Probably everybody has seen this. This is the invalid Unicode. In this case here, we have a lot of good text, but a few characters just have no back translation that I mentioned before. What you can do then with PyMuPDF, take the rectangle, more or less the gray thing that you see here, hand it over to tesseract and ask tesseract, please determine the missing characters.

[00:18:08]

And we have a demo program that does exactly this, and I've taken the output of that demo program for here. So the final text comes out. The base class is instead of what we had before. So this is what you can do to invoke OCR dynamically based on the need.

PyMuPDF Performance Comparison

A few information about how fast is PyMuPDF. That's the little blue box on the left side. On the right side are other packages and products. The second column is another C library, which is three times slower. And the two large blue boxes are pure Python packages doing the same thing. So we are 20 or 35 times slower. They are, by the way, the most popular pure Python PDF packages.

[00:19:06]

All in all, PyMuPDF is capable of processing 100,000 pages with full text detail, like I showed before, in 1.25 minutes. Little bit more. I will show you later. So this is more or less wrapping up the whole thing.

Key Requirements for PDF Text Extraction

You need to be able to do full text extraction at the best possible speed, and you need to accompany that text extraction with all information required to interpret it. You have to be able to react to OCR needs. Of course, you could always pre-process your whole thing just to be sure by OCR and work with what comes out. But this would be an immense waste of time, waste of performance, and would lose information.

[00:20:07]

PyMuPDF Performance Example: Pandas Manual

Here's an example. The top box is extracting the full 3,000 pages of the Pandas manual in plain format. That means each line is just followed by the other line. And this requires 1.6 seconds for the full document, 3,000 pages of text. And if you require the full text information, it's only 30% or less slower than that. I hope you are impressed.

Selective OCR Processing

A final remark, why you should be selective with OCR processing. I've done this here. The method uppercase OCR, I hope it's recognizable, yeah, does the OCR for a given page.

[00:21:02]

These are the instructions required using Pyme PDF. And then I took an interesting page of the Pandas manual, again, with a lot of text on it, and determined the text in both cases.

OCR vs. Native Text Extraction: Time Comparison

And let's see how much time do I need in either case. The first one is, I do it with OCR. Not possible? No. I do it with OCR, and I get 1.6 seconds to process that page on average. If I do it without OCR, extracting native PDF text, I only need 1.58 milliseconds. So as a rough figure to memorize is, OCR needs 1,000 times longer than basic text extraction. So you should avoid it whenever you can.

[00:22:04]

There we are.

PyMuPDF Capabilities

The green tick marks are what I talked about today, and the other bullet points are what Pyme PDF can do else. That's it, ladies and gentlemen.

Q&A Session

So I'm ready to answer any questions.

[Audience member]

Thank you for your talk. Please use the microphone for the questions.

Question: Extracting Tabular Data

Thank you for your presentation. My question is about one of the issues that you've mentioned in your talk, and that's extracting tabular data. So I'm just wondering if Pyme PDF, or maybe some other library can do this, or how maybe you would approach this problem of extracting tabular data.

[Harald Lieder]

[00:23:03]

Answer: Table Detection Requires AI

Well, thank you. This is, you have been very, very, well, cute to mention this point, because table detection, you know, I have a page and now determine is there a table on it, cannot be done by Pyme PDF today. But if somebody tells me, look, inside this B box is a table, only one table and nothing else, you can do it, because all the coordinates of the stuff inside the table can easily be used to determine what columns do I have, are the columns centered, stuff like this. An additional comment, to identify content on a page, and I reiterate, PDF is a format with unstructured text always. PDF doesn't know what it shows, it doesn't know it is a table, so somebody else has to determine it, and this is something that requires artificial intelligence and machine learning.

[00:24:02]

There are so many complicated cases, you know, we have grid lines, or you don't, or you just have those grid lines, or the verticals. Sometimes you have background shading to separate rows from each other, sometimes you haven't. So this is stuff for an AI tool, which the company that owns Pyme PDF today is actually investigating which one to use, but to be frank, it's not a feature contained in Pyme PDF. Okay, thank you.

[Audience member]

Question: Creating PDFs with Text Information

Thank you for your impressive talk. In our company, we work also with PDFs, and can you tell us how, which tools you use to pack text information into PDFs? Can you say?

[Harald Lieder]

Answer: PyMuPDF Capabilities for PDF Creation

You mean, you have data from somewhere and you want to create a PDF?

[Audience member]

[00:25:03]

Yeah.

[Harald Lieder]

You can do this with Pyme PDF? Of course. No, that's really, there are several ways to do this in Pyme PDF, you can output single lines of text, and providing it with position information, like start a text here and make sure the text doesn't exceed this right border, or you can provide a text box, or a box, which should contain the text, and then you provide it with a string, and some automatism distributes the text inside that box and gives you information about how much space was unused by filling, and it can also tell you, of course, how much space would be required to pass, to fit in all the text. If you provided more text, then what would fit in? You can then reduce the font size, or enlarge the text box, stuff like this. And a third method is you can provide HTML source, which all structuring information, like whatever it is, and then ask Pyme PDF to use this HTML and convert it to a PDF.

[00:26:14]

These are the possibilities.

[Audience member]

Question: Confidence Level for OCR

Thank you. Hello, so thank you for this talk. I have a question about the OCR that you mentioned. So every time when you use OCR, there's some possibility that you're wrong about your assumption about what do you see with this OCR. So do you have any coefficient that you return for the confidence level for every parsed word, or sentence, or line?

[Harald Lieder]

Answer: Pyramidal Approach to OCR Decision Making

Okay, actually, it's a pyramidal approach. So first of all, you check, do I have text, or do I have an image covering the whole page?

[00:27:02]

And if I don't have text, but that type of image, then you can assume it is OCR and just try it. Or you have neither of this, but the page isn't empty either. Then you would check, do I have vector graphics, which decompose down to rectangles, approximately covering one character? That's, for example, a head of 10 points would be font size 10, something like this. And you make, okay, could be that someone just provided graphics that imitate text. Again, use OCR in this case. Or if neither of this is the case, you can look at, do I have annotations on the page? Or do I have form fields on the page? By the way, this is one of the easiest way to get to structured information, because you have keys and you have values if you have form fields.

[00:28:02]

Okay, this is more or less the approach. And in this sequence, if I remember it right, how you would go by and determining, do I need OCR? Or better just leave it and the page is really blank or not interesting?

[Audience member]

Question: Handling Rotated Text in PDFs

Okay, and sometimes in companies when they create PDFs for files that were initially printed, they just copy the print from the files and it's just a huge image on the PDF page. That's how they would do it, yes. Yes, and sometimes this image is a little bit rotated because of the printer. So would this library work with a little bit rotated text?

Answer: Tesseract OCR and PyMuPDF Capabilities

Yes, yes, yes.

[Harald Lieder]

It depends on the OCR package, of course. Tesseract can do it. It would give you the words tilted by some angle. And if you determine this, and you can determine this in PyMe PDF, you would then simply untilt the whole thing and then go ahead. That's possible.

[Audience member]

[00:29:00]

All right, awesome, thank you.

[Harald Lieder]

You're welcome.

[Audience member]

Question: Comparison with Lankchain

Hi, thank you for the presentation. Maybe you answered that question already during the presentation. I was going through the documentation to find out as well. So maybe you know, Lankchain is a pretty recent tool that uses AI to scan through a lot of things and documents and help automate things with AI. And they have a PDF reader as well. Have you been able to have a chance to compare your tool to theirs in terms of performance and efficiency?

[Harald Lieder]

Answer: PyMuPDF Performance Comparison

I didn't completely understand it acoustically.

[Audience member]

So yeah, I was looking through Lankchain. Basically, they have PyPDF loader, which would open a PDF file, it would scan through it, and it would return the characters. And from what I see, they're using something like PyPDF. Have you compared the performance?

[Harald Lieder]

[00:30:02]

One of the large blue columns was PyPDF.

[Audience member]

Yeah, probably. And I just found out that they're actually using this. So, okay. I guess we'll see after the presentation online and we'll be able to see that again.

[Harald Lieder]

Yeah, I will share it, of course.

[Audience member]

End of Q&A Session

Thank you. Unfortunately, we don't have more time for questions. If you want more questions, you can find Hallard in Codespace or in Discord.

Conclusion

Thanks again, Hallard. Great talk. Thank you very much for your attention. Thank you.

Python Digest

Discussion about this post

Python Digest

High Volume PDF Text Extraction using Python Open-Source Tools

Harald Lieder

Summary

Synced Transcript

Introduction

Presentation Structure

The Problem: Unstructured Data in PDFs

Challenges of PDF Text Extraction

The Solution: Python Packages with Broad Feature Sets

What is a PDF?

PDF Text File Structure

PDFs as Endpoints of Data Processing

What a PDF is Not

Challenges of Extracting Text from PDFs

Why We Need to Extract Text from PDFs

The Requirement: Fast Access to PDF Data

Recreating Structured Information from Unstructured Text

First-Class Tools for PDF Text Extraction

PyMuPDF and Tesseract OCR: Open Source Tools

PyMuPDF: A High-Performance Python Package

PyMuPDF Features: Text Detection, OCR, Images, Vector Graphics, Annotations

Extracting Text Details with PyMuPDF

Dynamically Invoking OCR with PyMuPDF

PyMuPDF Performance Comparison

Key Requirements for PDF Text Extraction

PyMuPDF Performance Example: Pandas Manual

Selective OCR Processing

OCR vs. Native Text Extraction: Time Comparison

PyMuPDF Capabilities

Q&A Session

Question: Extracting Tabular Data

Answer: Table Detection Requires AI

Question: Creating PDFs with Text Information

Answer: PyMuPDF Capabilities for PDF Creation

Question: Confidence Level for OCR

Answer: Pyramidal Approach to OCR Decision Making

Question: Handling Rotated Text in PDFs

Answer: Tesseract OCR and PyMuPDF Capabilities

Question: Comparison with Lankchain

Answer: PyMuPDF Performance Comparison

End of Q&A Session

Conclusion

Discussion about this post