51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

【随笔】用 Python 轻松提取、合并和压缩 PDF 文件

在日常的工作或项目中,PDF 文件处理往往是非常常见的任务。例如,你可能需要从一个大的 PDF 文件中提取特定的页面、根据内容合并页面,或者将这些文件压缩以减小存储空间。今天,我将介绍一个利用 Python 实现的自动化脚本,它不仅能够提取 PDF 内容,还能合并相同内容的页面,并在完成后压缩生成的文件。

在这篇博客中,我将展示如何通过以下步骤实现 PDF 文件的批量处理:

  • 提取每一页的第二行内容。
  • 根据内容将相同页面合并为一个 PDF
  • 压缩最终生成的 PDF 文件。

所需工具 {#所需工具}

在实现这个任务之前,我们需要安装一些 Python 库:

  • pdfplumber:用于从 PDF 中提取文本内容。
  • PyPDF2:用于操作 PDF 文件(如合并、拆分等)。
  • PyMuPDF(也称为 fitz):用于压缩 PDF 文件,减少文件大小。

可以通过以下命令安装所需的库:

$ pip install pdfplumber PyPDF2 pymupdf

脚本概述 {#脚本概述}

在我们的示例中,首先读取指定的 PDF 文件并提取每一页的第二行内容。根据第二行的内容(以空格分隔的第一个词),我们将相同内容的页面合并为一个新的 PDF 文件。最后,生成的 PDF 文件会被压缩,以减少文件的大小。

import os
import pdfplumber
import fitz  # PyMuPDF
from PyPDF2 import PdfReader, PdfWriter

def extract_and_merge_pages(pdf_path):
print("开始处理 PDF 文件,请稍等...")


    <span class="token comment"># 设置主输出文件夹</span>
    output_dir <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">_merged_pages"</span></span>
    os<span class="token punctuation">.</span>makedirs<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> exist_ok<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>

    <span class="token comment"># 字典存储内容标识与页面编号的映射</span>
    content_page_map <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>

    <span class="token comment"># 第一步:提取每一页的第二行内容并分组</span>
    <span class="token keyword">with</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span> <span class="token keyword">as</span> pdf<span class="token punctuation">:</span>
        <span class="token keyword">for</span> page_num<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>
            text <span class="token operator">=</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
            <span class="token keyword">if</span> text<span class="token punctuation">:</span>
                lines <span class="token operator">=</span> text<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span>
                <span class="token comment"># 确保有第二行内容</span>
                <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">1</span><span class="token punctuation">:</span>
                    second_line <span class="token operator">=</span> lines<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>

                    <span class="token comment"># 提取空格前的部分</span>
                    content_key <span class="token operator">=</span> second_line<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
                    <span class="token comment"># 将页码按内容标识分组</span>
                    <span class="token keyword">if</span> content_key <span class="token keyword">not</span> <span class="token keyword">in</span> content_page_map<span class="token punctuation">:</span>
                        content_page_map<span class="token punctuation">[</span>content_key<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
                    content_page_map<span class="token punctuation">[</span>content_key<span class="token punctuation">]</span><span class="token punctuation">.</span>append<span class="token punctuation">(</span>page_num<span class="token punctuation">)</span>

    <span class="token comment"># 第二步:根据内容标识合并相应页面并保存为单个 PDF</span>
    reader <span class="token operator">=</span> PdfReader<span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span>
    <span class="token keyword">for</span> content_key<span class="token punctuation">,</span> pages <span class="token keyword">in</span> content_page_map<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
        writer <span class="token operator">=</span> PdfWriter<span class="token punctuation">(</span><span class="token punctuation">)</span>
        <span class="token keyword">for</span> page_num <span class="token keyword">in</span> pages<span class="token punctuation">:</span>
            writer<span class="token punctuation">.</span>add_page<span class="token punctuation">(</span>reader<span class="token punctuation">.</span>pages<span class="token punctuation">[</span>page_num<span class="token punctuation">]</span><span class="token punctuation">)</span>

        <span class="token comment"># 设置临时 PDF 文件路径</span>
        temp_pdf_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>content_key<span class="token punctuation">}</span></span><span class="token string">_temp.pdf"</span></span><span class="token punctuation">)</span>
        <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">,</span> <span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> temp_pdf_file<span class="token punctuation">:</span>
            writer<span class="token punctuation">.</span>write<span class="token punctuation">(</span>temp_pdf_file<span class="token punctuation">)</span>

        <span class="token comment"># 压缩 PDF 文件</span>
        compressed_pdf_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>content_key<span class="token punctuation">}</span></span><span class="token string">.pdf"</span></span><span class="token punctuation">)</span>
        compress_pdf<span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">,</span> compressed_pdf_path<span class="token punctuation">)</span>

        <span class="token comment"># 删除临时未压缩的 PDF 文件</span>
        os<span class="token punctuation">.</span>remove<span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">)</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"已保存并压缩合并文件: </span><span class="token interpolation"><span class="token punctuation">{</span>compressed_pdf_path<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>

    <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"处理完成,所有文件已保存至:"</span><span class="token punctuation">,</span> output_dir<span class="token punctuation">)</span>




def compress_pdf(input_path, output_path):
# 使用 PyMuPDF 读取和压缩 PDF
doc = fitz.open(input_path)
doc.save(output_path, garbage=4, deflate=True)
doc.close()

# 示例使用`
`print("请输入 PDF 文件的路径,注意使用斜杠 (/) 作为路径分隔符。例如:C:/path/to/file.pdf")`
pdf_path `=` `input("路径:")`
extract_and_merge_pages`(`pdf_path`)

代码解析 {#代码解析}

1. 提取 PDF 内容 {#1- 提取 -PDF- 内容}

我们使用 pdfplumber 来提取 PDF 中的文本内容。通过遍历每一页,提取每一页的第二行文本,并将第二行中第一个空格之前的内容作为页面的标识符。

with pdfplumber.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages):
        text = page.extract_text()
        if text:
            lines = text.split('\n')
            if len(lines) > 1:
                second_line = lines[1]
                content_key = second_line.split()[0]

2. 根据内容合并页面 {#2- 根据内容合并页面}

通过 content_key(即提取的第二行内容的第一个单词),我们将相同内容的页面编号进行分组。接下来,使用 PyPDF2PdfWriter 将这些页面合并成一个新的 PDF 文件。

reader = PdfReader(pdf_path)
for content_key, pages in content_page_map.items():
    writer = PdfWriter()
    for page_num in pages:
        writer.add_page(reader.pages[page_num])

3. 压缩 PDF 文件 {#3- 压缩 -PDF- 文件}

压缩过程通过 PyMuPDFfitz)实现。我们打开每个生成的临时 PDF 文件,并使用 garbage=4 清理冗余数据,deflate=True 启用压缩算法。

doc = fitz.open(input_path)
doc.save(output_path, garbage=4, deflate=True)
doc.close()

使用说明 {#使用说明}

  1. 路径输入 :用户需要输入 PDF 文件的完整路径,确保路径分隔符使用斜杠(/)或双反斜杠(\\)。
  2. 自动化处理 :脚本会自动根据内容合并相同的页面,并生成压缩后的 PDF 文件,保存到指定目录。
  3. 输出:所有合并和压缩后的文件会保存在一个总文件夹中,文件名为内容的标识符。

总结 {#总结}

通过这篇文章,你已经学会了如何使用 Python 脚本自动化处理 PDF 文件:提取特定页面内容、合并相同内容的页面并压缩最终文件。这个方法在处理大批量 PDF 文件时非常有用,可以帮助你提高工作效率并节省存储空间。

希望这篇博客对你有所帮助!如果你有任何问题或建议,欢迎在评论区留言。

赞(4)
未经允许不得转载:工具盒子 » 【随笔】用 Python 轻松提取、合并和压缩 PDF 文件