You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
138 lines
4.0 KiB
138 lines
4.0 KiB
---
|
|
name: pdf
|
|
description: Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files.
|
|
---
|
|
|
|
# PDF Processing Skill
|
|
|
|
You now have expertise in PDF manipulation. Follow these workflows:
|
|
|
|
## Reading PDFs
|
|
|
|
**Option 1: Quick text extraction (preferred)**
|
|
```bash
|
|
# Using pdftotext (poppler-utils)
|
|
pdftotext input.pdf - # Output to stdout
|
|
pdftotext input.pdf output.txt # Output to file
|
|
```
|
|
|
|
**Option 2: Page-by-page with metadata (Apache PDFBox)**
|
|
```java
|
|
import org.apache.pdfbox.pdmodel.PDDocument;
|
|
import org.apache.pdfbox.text.PDFTextStripper;
|
|
import java.io.File;
|
|
|
|
PDDocument doc = PDDocument.load(new File("input.pdf"));
|
|
System.out.println("Pages: " + doc.getNumberOfPages());
|
|
System.out.println("Metadata: " + doc.getDocumentInformation().getTitle());
|
|
|
|
PDFTextStripper stripper = new PDFTextStripper();
|
|
for (int i = 1; i <= doc.getNumberOfPages(); i++) {
|
|
stripper.setStartPage(i);
|
|
stripper.setEndPage(i);
|
|
String text = stripper.getText(doc);
|
|
System.out.println("--- Page " + i + " ---");
|
|
System.out.println(text);
|
|
}
|
|
doc.close();
|
|
```
|
|
|
|
## Creating PDFs
|
|
|
|
**Option 1: From Markdown (recommended)**
|
|
```bash
|
|
# Using pandoc
|
|
pandoc input.md -o output.pdf
|
|
|
|
# With custom styling
|
|
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
|
|
```
|
|
|
|
**Option 2: Programmatically (Apache PDFBox)**
|
|
```java
|
|
import org.apache.pdfbox.pdmodel.PDDocument;
|
|
import org.apache.pdfbox.pdmodel.PDPage;
|
|
import org.apache.pdfbox.pdmodel.PDPageContentStream;
|
|
import org.apache.pdfbox.pdmodel.font.PDType1Font;
|
|
|
|
PDDocument doc = new PDDocument();
|
|
PDPage page = new PDPage();
|
|
doc.addPage(page);
|
|
|
|
try (PDPageContentStream content = new PDPageContentStream(doc, page)) {
|
|
content.beginText();
|
|
content.setFont(PDType1Font.HELVETICA, 12);
|
|
content.newLineAtOffset(100, 750);
|
|
content.showText("Hello, PDF!");
|
|
content.endText();
|
|
}
|
|
doc.save("output.pdf");
|
|
doc.close();
|
|
```
|
|
|
|
**Option 3: From HTML (OpenHTMLToPDF)**
|
|
```java
|
|
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;
|
|
import java.io.FileOutputStream;
|
|
import java.io.OutputStream;
|
|
import java.nio.file.Files;
|
|
import java.nio.file.Path;
|
|
|
|
String html = Files.readString(Path.of("input.html"));
|
|
try (OutputStream os = new FileOutputStream("output.pdf")) {
|
|
PdfRendererBuilder builder = new PdfRendererBuilder();
|
|
builder.withHtmlContent(html, new File("input.html").toURI().toString());
|
|
builder.toStream(os);
|
|
builder.run();
|
|
}
|
|
```
|
|
|
|
## Merging PDFs
|
|
|
|
```java
|
|
import org.apache.pdfbox.multipdf.PDFMergerUtility;
|
|
import java.io.File;
|
|
|
|
PDFMergerUtility merger = new PDFMergerUtility();
|
|
merger.addSource(new File("file1.pdf"));
|
|
merger.addSource(new File("file2.pdf"));
|
|
merger.addSource(new File("file3.pdf"));
|
|
merger.setDestinationFileName("merged.pdf");
|
|
merger.mergeDocuments(null);
|
|
```
|
|
|
|
## Splitting PDFs
|
|
|
|
```java
|
|
import org.apache.pdfbox.multipdf.Splitter;
|
|
import org.apache.pdfbox.pdmodel.PDDocument;
|
|
import java.io.File;
|
|
import java.util.List;
|
|
|
|
PDDocument doc = PDDocument.load(new File("input.pdf"));
|
|
Splitter splitter = new Splitter();
|
|
splitter.setSplitAtPage(1); // 每页拆分为一个文件
|
|
List<PDDocument> pages = splitter.split(doc);
|
|
|
|
for (int i = 0; i < pages.size(); i++) {
|
|
pages.get(i).save("page_" + (i + 1) + ".pdf");
|
|
pages.get(i).close();
|
|
}
|
|
doc.close();
|
|
```
|
|
|
|
## Key Libraries
|
|
|
|
| Task | Library | Maven Dependency |
|
|
|------|---------|-----------------|
|
|
| Read/Write/Merge/Split | Apache PDFBox | `org.apache.pdfbox:pdfbox:3.0.x` |
|
|
| Create from HTML | OpenHTMLToPDF | `com.openhtmltopdf:openhtmltopdf-pdfbox:1.0.x` |
|
|
| Advanced layout | iText | `com.itextpdf:itext7-core:8.0.x` |
|
|
| Text extraction | pdftotext | `brew install poppler` / `apt install poppler-utils` |
|
|
|
|
## Best Practices
|
|
|
|
1. **Always check if tools are installed** before using them
|
|
2. **Handle encoding issues** - PDFs may contain various character encodings
|
|
3. **Large PDFs**: Process page by page to avoid memory issues; use try-with-resources 确保资源释放
|
|
4. **OCR for scanned PDFs**: Use Tesseract4J (`net.sourceforge.tess4j:tess4j`) if text extraction returns empty
|
|
|