在日常的工作或项目中,
Python
实现的自动化脚本,它不仅能够提取
在这篇博客中,我将展示如何通过以下步骤实现 PDF
文件的批量处理:
- 提取每一页的第二行内容。
- 根据内容将相同页面合并为一个
PDF
。 - 压缩最终生成的
PDF
文件。
所需工具 {#所需工具}
在实现这个任务之前,我们需要安装一些 Python
库:
pdfplumber
:用于从PDF
中提取文本内容。PyPDF2
:用于操作PDF
文件(如合并、拆分等)。PyMuPDF
(也称为fitz
):用于压缩PDF
文件,减少文件大小。
可以通过以下命令安装所需的库:
$ pip install pdfplumber PyPDF2 pymupdf
脚本概述 {#脚本概述}
在我们的示例中,首先读取指定的 PDF
文件并提取每一页的第二行内容。根据第二行的内容(以空格分隔的第一个词),我们将相同内容的页面合并为一个新的 PDF
文件。最后,生成的 PDF
文件会被压缩,以减少文件的大小。
import os
import pdfplumber
import fitz # PyMuPDF
from PyPDF2 import PdfReader, PdfWriter
def extract_and_merge_pages(pdf_path):
print("开始处理 PDF 文件,请稍等...")
<span class="token comment"># 设置主输出文件夹</span>
output_dir <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">_merged_pages"</span></span>
os<span class="token punctuation">.</span>makedirs<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> exist_ok<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
<span class="token comment"># 字典存储内容标识与页面编号的映射</span>
content_page_map <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token comment"># 第一步:提取每一页的第二行内容并分组</span>
<span class="token keyword">with</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span> <span class="token keyword">as</span> pdf<span class="token punctuation">:</span>
<span class="token keyword">for</span> page_num<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>
text <span class="token operator">=</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> text<span class="token punctuation">:</span>
lines <span class="token operator">=</span> text<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span>
<span class="token comment"># 确保有第二行内容</span>
<span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">1</span><span class="token punctuation">:</span>
second_line <span class="token operator">=</span> lines<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
<span class="token comment"># 提取空格前的部分</span>
content_key <span class="token operator">=</span> second_line<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
<span class="token comment"># 将页码按内容标识分组</span>
<span class="token keyword">if</span> content_key <span class="token keyword">not</span> <span class="token keyword">in</span> content_page_map<span class="token punctuation">:</span>
content_page_map<span class="token punctuation">[</span>content_key<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
content_page_map<span class="token punctuation">[</span>content_key<span class="token punctuation">]</span><span class="token punctuation">.</span>append<span class="token punctuation">(</span>page_num<span class="token punctuation">)</span>
<span class="token comment"># 第二步:根据内容标识合并相应页面并保存为单个 PDF</span>
reader <span class="token operator">=</span> PdfReader<span class="token punctuation">(</span>pdf_path<span class="token punctuation">)</span>
<span class="token keyword">for</span> content_key<span class="token punctuation">,</span> pages <span class="token keyword">in</span> content_page_map<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
writer <span class="token operator">=</span> PdfWriter<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> page_num <span class="token keyword">in</span> pages<span class="token punctuation">:</span>
writer<span class="token punctuation">.</span>add_page<span class="token punctuation">(</span>reader<span class="token punctuation">.</span>pages<span class="token punctuation">[</span>page_num<span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token comment"># 设置临时 PDF 文件路径</span>
temp_pdf_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>content_key<span class="token punctuation">}</span></span><span class="token string">_temp.pdf"</span></span><span class="token punctuation">)</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">,</span> <span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> temp_pdf_file<span class="token punctuation">:</span>
writer<span class="token punctuation">.</span>write<span class="token punctuation">(</span>temp_pdf_file<span class="token punctuation">)</span>
<span class="token comment"># 压缩 PDF 文件</span>
compressed_pdf_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>output_dir<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>content_key<span class="token punctuation">}</span></span><span class="token string">.pdf"</span></span><span class="token punctuation">)</span>
compress_pdf<span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">,</span> compressed_pdf_path<span class="token punctuation">)</span>
<span class="token comment"># 删除临时未压缩的 PDF 文件</span>
os<span class="token punctuation">.</span>remove<span class="token punctuation">(</span>temp_pdf_path<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"已保存并压缩合并文件: </span><span class="token interpolation"><span class="token punctuation">{</span>compressed_pdf_path<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"处理完成,所有文件已保存至:"</span><span class="token punctuation">,</span> output_dir<span class="token punctuation">)</span>
def compress_pdf(input_path, output_path):
# 使用 PyMuPDF 读取和压缩 PDF
doc = fitz.open(input_path)
doc.save(output_path, garbage=4, deflate=True)
doc.close()
# 示例使用`
`print("请输入 PDF 文件的路径,注意使用斜杠 (/) 作为路径分隔符。例如:C:/path/to/file.pdf")`
pdf_path `=` `input("路径:")`
extract_and_merge_pages`(`pdf_path`)
代码解析 {#代码解析}
1. 提取 PDF 内容 {#1- 提取 -PDF- 内容}
我们使用 pdfplumber
来提取 PDF
中的文本内容。通过遍历每一页,提取每一页的第二行文本,并将第二行中第一个空格之前的内容作为页面的标识符。
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
lines = text.split('\n')
if len(lines) > 1:
second_line = lines[1]
content_key = second_line.split()[0]
2. 根据内容合并页面 {#2- 根据内容合并页面}
通过 content_key
(即提取的第二行内容的第一个单词),我们将相同内容的页面编号进行分组。接下来,使用 PyPDF2
的 PdfWriter
将这些页面合并成一个新的 PDF
文件。
reader = PdfReader(pdf_path)
for content_key, pages in content_page_map.items():
writer = PdfWriter()
for page_num in pages:
writer.add_page(reader.pages[page_num])
3. 压缩 PDF 文件 {#3- 压缩 -PDF- 文件}
压缩过程通过 PyMuPDF
(fitz
)实现。我们打开每个生成的临时 PDF
文件,并使用 garbage=4
清理冗余数据,deflate=True
启用压缩算法。
doc = fitz.open(input_path)
doc.save(output_path, garbage=4, deflate=True)
doc.close()
使用说明 {#使用说明}
- 路径输入 :用户需要输入
PDF
文件的完整路径,确保路径分隔符使用斜杠(/
)或双反斜杠(\\
)。 - 自动化处理 :脚本会自动根据内容合并相同的页面,并生成压缩后的
PDF
文件,保存到指定目录。 - 输出:所有合并和压缩后的文件会保存在一个总文件夹中,文件名为内容的标识符。
总结 {#总结}
通过这篇文章,你已经学会了如何使用 Python
脚本自动化处理 PDF
文件:提取特定页面内容、合并相同内容的页面并压缩最终文件。这个方法在处理大批量 PDF
文件时非常有用,可以帮助你提高工作效率并节省存储空间。
希望这篇博客对你有所帮助!如果你有任何问题或建议,欢迎在评论区留言。