Python handles practical poses for PDF

You don't know how to arrange your life. Many people will help you arrange what they need you to do.

PDF files are often used, especially in these two scenarios:

  • Download reference materials, such as various reports and documents
  • Share read-only data to facilitate dissemination and retain source files

Scenarios and modules

Therefore, there are two common requirements for PDF files:

  • Processing the file itself belongs to file page level operations, such as merging / splitting PDF pages, encrypting / decrypting, and adding / removing watermarks;
  • Processing file content belongs to content level operations, such as extracting text, table data, charts, etc.

At present, Python has three main modules for processing PDF:

  • PyPDF2: the module is mature. It was last updated 2 years ago. It is suitable for page level operation, and the effect of text extraction is poor.
  • PDFMiner: good at text extraction. At present, the main branch has stopped maintenance and is replaced by PDFMiner six
  • pdfplumber: Based on pdfminer The text content extraction tool of six has a lower threshold for use, such as supporting table extraction.

In practice, modules can be selected according to the type of demand. If it is a page level operation, PyPDF2 is used. If content extraction is required, pdfplumber is preferred.

Corresponding module installation:

  • pip install pypdf2
  • pip install pdfminer.six
  • pip install pdfplumber

The following is a demonstration of the use of the three modules according to the use scenario.

PyPDF2

PyPDF2 is mainly capable of page level operation, such as:

  • Get basic information of PDF document
  • PDF segmentation and merging
  • Rotation and sorting of PDF
  • PDF watermarking and de watermarking
  • PDF encryption and decryption

The two core classes of PyPDF2 are PdfFileReader and PdfFileWriter, which complete the reading and writing operations of PDF files.

Get basic information of PDF document
import pathlib
from PyPDF2 import PdfFileReader

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
with open(f_path, 'rb') as f:
    pdf = PdfFileReader(f)
    info = pdf.getDocumentInfo()
    cnt_page = pdf.getNumPages()
    is_encrypt = pdf.getIsEncrypted()
print(f'''
author: {info.author}
creator: {info.creator}
Producer: {info.producer}
theme: {info.subject}
title: {info.title}
PageCount : {cnt_page}
Whether to encrypt: {is_encrypt}
''')
PDF segmentation and merging
import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path = path.joinpath('002pdf_split_merge.pdf')
out_path_1 = path.joinpath('002pdf_split_half_front.pdf')
out_path_2 = path.joinpath('002pdf_split_half_back.pdf')
# Divide the document into two parts
with open(f_path, 'rb') as f, open(out_path_1, 'wb') as f_out1, open(out_path_2, 'wb') as f_out2:
    pdf = PdfFileReader(f)
    pdf_out1 = PdfFileWriter()
    pdf_out2 = PdfFileWriter()
    cnt_pages = pdf.getNumPages()
    print(f'common {cnt_pages} page')
    for i in range(cnt_pages):
        if i <= cnt_pages //2:
            pdf_out1.addPage(pdf.getPage(i))
        else:
            pdf_out2.addPage(pdf.getPage(i))
    pdf_out1.write(f_out1)
    pdf_out2.write(f_out2)
# Then merge the second half of the file with the first half of the file, and the second half of the file is in the first
with open(out_path, 'wb') as f_out:
    cnt_f, cnt_b = pdf_out1.getNumPages(), pdf_out2.getNumPages()
    pdf_out = PdfFileWriter()
    for i in range(cnt_b):
        pdf_out.addPage(pdf_out2.getPage(i))
    for i in range(cnt_f):
        pdf_out.addPage(pdf_out1.getPage(i))
    pdf_out.write(f_out)
Rotation and sorting of PDF
import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path = path.joinpath('002pdf_rotate.pdf')

with open(f_path, 'rb') as f, open(out_path, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_out = PdfFileWriter()
    page = pdf.getPage(0).rotateClockwise(90)
    pdf_out.addPage(page)
    # Put the second page to the front
    pdf_out.addPage(pdf.getPage(2))
    page = pdf.getPage(1).rotateCounterClockwise(90)
    pdf_out.addPage(page)
    pdf_out.write(f_out)
PDF watermarking and de watermarking

Adding a picture watermark is actually adding a picture with a transparent background to the page, which can be completed through the mergePage method of the page.

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
wm_path = path.joinpath('watermark.pdf')
en_path = path.joinpath('002pdf_with_watermark_en.pdf')
out_path = path.joinpath('002pdf_with_watermark.pdf')

with open(f_path, 'rb') as f, open(wm_path, 'rb') as f_wm, open(out_path, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_wm = PdfFileReader(f_wm)
    pdf_out = PdfFileWriter()
    wm_cn_page = pdf_wm.getPage(0)
    wm_en_page = pdf_wm.getPage(1)
    cnt_pages = pdf.getNumPages()
    for i in range(cnt_pages):
        page = pdf.getPage(i)
        page.mergePage(wm_cn_page)
        pdf_out.addPage(page)
    pdf_out.write(f_out)

De watermarking is more complex and needs to be analyzed according to different situations. Because the watermark may be text, pictures or various combinations, the key is to recognize the features.

Three common ideas for de watermarking refer to:

  1. Find the characteristic word and replace it. It is suitable for English documents, but not for CJK characters such as Chinese.
  2. After the PDF page is turned into a picture, the image algorithm is used to watermark, but this will destroy the original information structure of the file.
  3. According to the size and position characteristics of the watermark, all elements are found and deleted. This is the more recommended way.

The third method works best, but if you encounter some complex document watermarks, it will test your patience.

You have to recognize the operation commands one by one and check the effect while replacing them until the watermark is successfully removed.

However, not all the remaining pages can be eliminated with the same feature mode, because this PDF may have been watermarked by multiple people and has included a variety of watermarked methods.

Therefore, there is no 100% safe, effective (good to delete information) and universal method to remove watermark.

Adding watermark and removing watermark is essentially an attack and defense strategy.

For example, some tools launch the watermark removal function. Once it is disclosed, the printing party can identify and avoid its removal method.

Finally, everyone should respect copyright.

In addition to learning, formal use should follow the rules of the content creator.

PDF encryption and decryption

The password in PDF is divided into user password and owner password.

PyPDF2 provides the basic encryption function, "prevent gentleman but not villain".

If a new file is copied after opening the PDF file, the new file is not constrained by the owner's password and can be modified.

import pathlib
from PyPDF2 import PdfFileReader, PdfFileWriter

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path_encrypt = path.joinpath('002pdf_encrypt.pdf')
out_path_decrypt = path.joinpath('002pdf_decrypt.pdf')

with open(f_path, 'rb') as f, open(out_path_encrypt, 'wb') as f_out:
    pdf = PdfFileReader(f)
    pdf_out = PdfFileWriter()
    cnt_pages = pdf.getNumPages()
    for i in range(cnt_pages):
        page = pdf.getPage(i)
        pdf_out.addPage(page)
    pdf_out.encrypt('123456', owner_pwd='654321')
    pdf_out.write(f_out)
# Reread the encrypted file and generate the decrypted file
with open(out_path_encrypt, 'rb') as f, open(out_path_decrypt, 'wb') as f_out:
    pdf = PdfFileReader(f)
    if not pdf.isEncrypted:
        print('The file is not encrypted')
    else:
        success = pdf.decrypt('123456')
        # if not success:
        pdf_out = PdfFileWriter()
        pdf_out.appendPagesFromReader(pdf)
        pdf_out.write(f_out)

pdfminer.six

There is a high threshold for understanding the structure of PDF documents, which is suitable for the operation of PDF models.

PDFMiner is rarely used directly at ordinary times. Here we only demonstrate the basic operation of document content:

import pathlib
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage
from pdfminer.converter import PDFPageAggregator

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')

with open(f_path, 'rb') as f:
    parser = PDFParser(f)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            # Get text object
            if isinstance(x, LTTextBox):
                print(x.get_text().strip())
            # Get picture object
            if isinstance(x,LTImage):
                print('Get a picture here')
            # Get figure object
            if isinstance(x,LTFigure):
                print('Get one here figure object')

Although the use threshold of pdfminer is high, it has to be used in the end in complex situations. At present, among the open source modules, its support for PDF should be the most complete.

The following pdfplumber is based on pdfminer The module developed by six reduces the threshold of use.

pdfplumber

Compared with pdfminer Six and pdfplumber provide a more convenient interface for PDF content extraction.

Common operations in daily work, such as:

  • Extract PDF content and save it to txt file
  • Extract tables from PDF to Excel
  • Extract pictures from PDF
  • Extract charts from PDF
Extract PDF content and save it to txt file
import pathlib
import pdfplumber

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path = path.joinpath('002pdf_out.txt')

with pdfplumber.open(f_path) as pdf, open(out_path ,'a') as txt:
    for page in pdf.pages:
        textdata = page.extract_text()
        txt.write(textdata)
Extract tables from PDF to Excel
import pathlib
import pdfplumber
from openpyxl import Workbook

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path = path.joinpath('002pdf_excel.xlsx')

wb = Workbook()
sheet = wb.active
with pdfplumber.open(f_path) as pdf:
    for i in range(19, 22):
        page = pdf.pages[i]
        table = page.extract_table()
        for row in table:
            sheet.append(row)
wb.save(out_path)

The above uses the function of openpyxl to create an Excel file, which will be introduced in a separate article later.

Extract pictures from PDF
import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Study on the trend of Chinese community under the influence of epidemic situation-Airy.pdf')
out_path = path.joinpath('002pdf_images.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[10]
    # for img in page.images:
    im = page.to_image()
    im.save(out_path, format='PNG')
    imgs = page.images
    for i, img in enumerate(imgs):
        size = img['width'], img['height']
        data = img['stream'].get_data()
        out_path = path.joinpath(f'002pdf_images_{i}.png')
        with open(out_path, 'wb') as fimg_out:
            fimg_out.write(data)

The above uses the function of PIL (pilot) to process pictures.

Extract charts from PDF

Unlike images, charts refer to data generation charts such as histograms and pie charts.

import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-Investigation report on the impact of COVID-19 epidemic on China's chain catering industry-China Chain Operation Association.pdf')
out_path = path.joinpath('002pdf_figures.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[7]
    im = page.to_image()
    im.save(out_path, format='PNG')
    figures = page.figures
    for i, fig in enumerate(figures):
        size = fig['width'], fig['height']
        crop = page.crop((fig['x0'], fig['top'], fig['x1'], fig['bottom']))
        img_crop = crop.to_image()
        out_path = path.joinpath(f'002pdf_figures_{i}.png')
        img_crop.save(out_path, format='png')
    im.draw_rects(page.extract_words(), stroke='yellow')
    im.draw_rects(page.images, stroke='blue')
    im.draw_rects(page.figures)
im # show in notebook

summary

This article introduces the common usage scenarios of PDF and the three main modules of Python processing PDF.

In addition, the PDF standard specification is dominated by Adobe.

Usually we don't need to refer to the specification, but if we encounter some complex scenarios, especially if the module doesn't have direct support, we can only go through the documents. The document is public. You can go to the search engine to search the keyword: pdf_reference_1-7.pdf.

Finally, build a learning group. Those who are interested can join. The top 100 are free (pop-up payment information can be ignored).

We are sorting out the code and demonstration data, and publishing and communicating within the group.

Tags: Python Software automation

Posted by eric1235711 on Fri, 20 May 2022 17:28:24 +0300