You don't know how to arrange your life. Many people will help you arrange what they need you to do.
PDF files are often used, especially in these two scenarios:
- Download reference materials, such as various reports and documents
- Share read-only data to facilitate dissemination and retain source files
Scenarios and modules
Therefore, there are two common requirements for PDF files:
- Processing the file itself belongs to file page level operations, such as merging / splitting PDF pages, encrypting / decrypting, and adding / removing watermarks;
- Processing file content belongs to content level operations, such as extracting text, table data, charts, etc.
At present, Python has three main modules for processing PDF:
- PyPDF2: the module is mature. It was last updated 2 years ago. It is suitable for page level operation, and the effect of text extraction is poor.
- PDFMiner: good at text extraction. At present, the main branch has stopped maintenance and is replaced by PDFMiner six
- pdfplumber: Based on pdfminer The text content extraction tool of six has a lower threshold for use, such as supporting table extraction.
In practice, modules can be selected according to the type of demand. If it is a page level operation, PyPDF2 is used. If content extraction is required, pdfplumber is preferred.
Corresponding module installation:
- pip install pypdf2
- pip install pdfminer.six
- pip install pdfplumber
The following is a demonstration of the use of the three modules according to the use scenario.
PyPDF2
PyPDF2 is mainly capable of page level operation, such as:
- Get basic information of PDF document
- PDF segmentation and merging
- Rotation and sorting of PDF
- PDF watermarking and de watermarking
- PDF encryption and decryption
The two core classes of PyPDF2 are PdfFileReader and PdfFileWriter, which complete the reading and writing operations of PDF files.
Get basic information of PDF document
import pathlib from PyPDF2 import PdfFileReader path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') with open(f_path, 'rb') as f: pdf = PdfFileReader(f) info = pdf.getDocumentInfo() cnt_page = pdf.getNumPages() is_encrypt = pdf.getIsEncrypted() print(f''' author: {info.author} creator: {info.creator} Producer: {info.producer} theme: {info.subject} title: {info.title} PageCount : {cnt_page} Whether to encrypt: {is_encrypt} ''')
PDF segmentation and merging
import pathlib from PyPDF2 import PdfFileReader, PdfFileWriter path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path = path.joinpath('002pdf_split_merge.pdf') out_path_1 = path.joinpath('002pdf_split_half_front.pdf') out_path_2 = path.joinpath('002pdf_split_half_back.pdf') # Divide the document into two parts with open(f_path, 'rb') as f, open(out_path_1, 'wb') as f_out1, open(out_path_2, 'wb') as f_out2: pdf = PdfFileReader(f) pdf_out1 = PdfFileWriter() pdf_out2 = PdfFileWriter() cnt_pages = pdf.getNumPages() print(f'common {cnt_pages} page') for i in range(cnt_pages): if i <= cnt_pages //2: pdf_out1.addPage(pdf.getPage(i)) else: pdf_out2.addPage(pdf.getPage(i)) pdf_out1.write(f_out1) pdf_out2.write(f_out2) # Then merge the second half of the file with the first half of the file, and the second half of the file is in the first with open(out_path, 'wb') as f_out: cnt_f, cnt_b = pdf_out1.getNumPages(), pdf_out2.getNumPages() pdf_out = PdfFileWriter() for i in range(cnt_b): pdf_out.addPage(pdf_out2.getPage(i)) for i in range(cnt_f): pdf_out.addPage(pdf_out1.getPage(i)) pdf_out.write(f_out)
Rotation and sorting of PDF
import pathlib from PyPDF2 import PdfFileReader, PdfFileWriter path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path = path.joinpath('002pdf_rotate.pdf') with open(f_path, 'rb') as f, open(out_path, 'wb') as f_out: pdf = PdfFileReader(f) pdf_out = PdfFileWriter() page = pdf.getPage(0).rotateClockwise(90) pdf_out.addPage(page) # Put the second page to the front pdf_out.addPage(pdf.getPage(2)) page = pdf.getPage(1).rotateCounterClockwise(90) pdf_out.addPage(page) pdf_out.write(f_out)
PDF watermarking and de watermarking
Adding a picture watermark is actually adding a picture with a transparent background to the page, which can be completed through the mergePage method of the page.
import pathlib from PyPDF2 import PdfFileReader, PdfFileWriter path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') wm_path = path.joinpath('watermark.pdf') en_path = path.joinpath('002pdf_with_watermark_en.pdf') out_path = path.joinpath('002pdf_with_watermark.pdf') with open(f_path, 'rb') as f, open(wm_path, 'rb') as f_wm, open(out_path, 'wb') as f_out: pdf = PdfFileReader(f) pdf_wm = PdfFileReader(f_wm) pdf_out = PdfFileWriter() wm_cn_page = pdf_wm.getPage(0) wm_en_page = pdf_wm.getPage(1) cnt_pages = pdf.getNumPages() for i in range(cnt_pages): page = pdf.getPage(i) page.mergePage(wm_cn_page) pdf_out.addPage(page) pdf_out.write(f_out)
De watermarking is more complex and needs to be analyzed according to different situations. Because the watermark may be text, pictures or various combinations, the key is to recognize the features.
Three common ideas for de watermarking refer to:
- Find the characteristic word and replace it. It is suitable for English documents, but not for CJK characters such as Chinese.
- After the PDF page is turned into a picture, the image algorithm is used to watermark, but this will destroy the original information structure of the file.
- According to the size and position characteristics of the watermark, all elements are found and deleted. This is the more recommended way.
The third method works best, but if you encounter some complex document watermarks, it will test your patience.
You have to recognize the operation commands one by one and check the effect while replacing them until the watermark is successfully removed.
However, not all the remaining pages can be eliminated with the same feature mode, because this PDF may have been watermarked by multiple people and has included a variety of watermarked methods.
Therefore, there is no 100% safe, effective (good to delete information) and universal method to remove watermark.
Adding watermark and removing watermark is essentially an attack and defense strategy.
For example, some tools launch the watermark removal function. Once it is disclosed, the printing party can identify and avoid its removal method.
Finally, everyone should respect copyright.
In addition to learning, formal use should follow the rules of the content creator.
PDF encryption and decryption
The password in PDF is divided into user password and owner password.
PyPDF2 provides the basic encryption function, "prevent gentleman but not villain".
If a new file is copied after opening the PDF file, the new file is not constrained by the owner's password and can be modified.
import pathlib from PyPDF2 import PdfFileReader, PdfFileWriter path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path_encrypt = path.joinpath('002pdf_encrypt.pdf') out_path_decrypt = path.joinpath('002pdf_decrypt.pdf') with open(f_path, 'rb') as f, open(out_path_encrypt, 'wb') as f_out: pdf = PdfFileReader(f) pdf_out = PdfFileWriter() cnt_pages = pdf.getNumPages() for i in range(cnt_pages): page = pdf.getPage(i) pdf_out.addPage(page) pdf_out.encrypt('123456', owner_pwd='654321') pdf_out.write(f_out) # Reread the encrypted file and generate the decrypted file with open(out_path_encrypt, 'rb') as f, open(out_path_decrypt, 'wb') as f_out: pdf = PdfFileReader(f) if not pdf.isEncrypted: print('The file is not encrypted') else: success = pdf.decrypt('123456') # if not success: pdf_out = PdfFileWriter() pdf_out.appendPagesFromReader(pdf) pdf_out.write(f_out)
pdfminer.six
There is a high threshold for understanding the structure of PDF documents, which is suitable for the operation of PDF models.
PDFMiner is rarely used directly at ordinary times. Here we only demonstrate the basic operation of document content:
import pathlib from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage from pdfminer.converter import PDFPageAggregator path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') with open(f_path, 'rb') as f: parser = PDFParser(f) doc = PDFDocument(parser) rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(doc): interpreter.process_page(page) layout = device.get_result() for x in layout: # Get text object if isinstance(x, LTTextBox): print(x.get_text().strip()) # Get picture object if isinstance(x,LTImage): print('Get a picture here') # Get figure object if isinstance(x,LTFigure): print('Get one here figure object')
Although the use threshold of pdfminer is high, it has to be used in the end in complex situations. At present, among the open source modules, its support for PDF should be the most complete.
The following pdfplumber is based on pdfminer The module developed by six reduces the threshold of use.
pdfplumber
Compared with pdfminer Six and pdfplumber provide a more convenient interface for PDF content extraction.
Common operations in daily work, such as:
- Extract PDF content and save it to txt file
- Extract tables from PDF to Excel
- Extract pictures from PDF
- Extract charts from PDF
Extract PDF content and save it to txt file
import pathlib import pdfplumber path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path = path.joinpath('002pdf_out.txt') with pdfplumber.open(f_path) as pdf, open(out_path ,'a') as txt: for page in pdf.pages: textdata = page.extract_text() txt.write(textdata)
Extract tables from PDF to Excel
import pathlib import pdfplumber from openpyxl import Workbook path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path = path.joinpath('002pdf_excel.xlsx') wb = Workbook() sheet = wb.active with pdfplumber.open(f_path) as pdf: for i in range(19, 22): page = pdf.pages[i] table = page.extract_table() for row in table: sheet.append(row) wb.save(out_path)
The above uses the function of openpyxl to create an Excel file, which will be introduced in a separate article later.
Extract pictures from PDF
import pathlib import pdfplumber from PIL import Image path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Study on the trend of Chinese community under the influence of epidemic situation-Airy.pdf') out_path = path.joinpath('002pdf_images.png') with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout: page = pdf.pages[10] # for img in page.images: im = page.to_image() im.save(out_path, format='PNG') imgs = page.images for i, img in enumerate(imgs): size = img['width'], img['height'] data = img['stream'].get_data() out_path = path.joinpath(f'002pdf_images_{i}.png') with open(out_path, 'wb') as fimg_out: fimg_out.write(data)
The above uses the function of PIL (pilot) to process pictures.
Extract charts from PDF
Unlike images, charts refer to data generation charts such as histograms and pie charts.
import pathlib import pdfplumber from PIL import Image path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf') f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf') out_path = path.joinpath('002pdf_figures.png') with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout: page = pdf.pages[7] im = page.to_image() im.save(out_path, format='PNG') figures = page.figures for i, fig in enumerate(figures): size = fig['width'], fig['height'] crop = page.crop((fig['x0'], fig['top'], fig['x1'], fig['bottom'])) img_crop = crop.to_image() out_path = path.joinpath(f'002pdf_figures_{i}.png') img_crop.save(out_path, format='png') im.draw_rects(page.extract_words(), stroke='yellow') im.draw_rects(page.images, stroke='blue') im.draw_rects(page.figures) im # show in notebook
summary
This article introduces the common usage scenarios of PDF and the three main modules of Python processing PDF.
In addition, the PDF standard specification is dominated by Adobe.
Usually we don't need to refer to the specification, but if we encounter some complex scenarios, especially if the module doesn't have direct support, we can only go through the documents. The document is public. You can go to the search engine to search the keyword: pdf_reference_1-7.pdf.
Finally, build a learning group. Those who are interested can join. The top 100 are free (pop-up payment information can be ignored).
We are sorting out the code and demonstration data, and publishing and communicating within the group.