You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.0 KiB
4.0 KiB
| name | description |
|---|---|
| Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files. |
PDF Processing Skill
You now have expertise in PDF manipulation. Follow these workflows:
Reading PDFs
Option 1: Quick text extraction (preferred)
# Using pdftotext (poppler-utils)
pdftotext input.pdf - # Output to stdout
pdftotext input.pdf output.txt # Output to file
Option 2: Page-by-page with metadata (Apache PDFBox)
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
PDDocument doc = PDDocument.load(new File("input.pdf"));
System.out.println("Pages: " + doc.getNumberOfPages());
System.out.println("Metadata: " + doc.getDocumentInformation().getTitle());
PDFTextStripper stripper = new PDFTextStripper();
for (int i = 1; i <= doc.getNumberOfPages(); i++) {
stripper.setStartPage(i);
stripper.setEndPage(i);
String text = stripper.getText(doc);
System.out.println("--- Page " + i + " ---");
System.out.println(text);
}
doc.close();
Creating PDFs
Option 1: From Markdown (recommended)
# Using pandoc
pandoc input.md -o output.pdf
# With custom styling
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
Option 2: Programmatically (Apache PDFBox)
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
PDDocument doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage(page);
try (PDPageContentStream content = new PDPageContentStream(doc, page)) {
content.beginText();
content.setFont(PDType1Font.HELVETICA, 12);
content.newLineAtOffset(100, 750);
content.showText("Hello, PDF!");
content.endText();
}
doc.save("output.pdf");
doc.close();
Option 3: From HTML (OpenHTMLToPDF)
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
String html = Files.readString(Path.of("input.html"));
try (OutputStream os = new FileOutputStream("output.pdf")) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withHtmlContent(html, new File("input.html").toURI().toString());
builder.toStream(os);
builder.run();
}
Merging PDFs
import org.apache.pdfbox.multipdf.PDFMergerUtility;
import java.io.File;
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new File("file1.pdf"));
merger.addSource(new File("file2.pdf"));
merger.addSource(new File("file3.pdf"));
merger.setDestinationFileName("merged.pdf");
merger.mergeDocuments(null);
Splitting PDFs
import org.apache.pdfbox.multipdf.Splitter;
import org.apache.pdfbox.pdmodel.PDDocument;
import java.io.File;
import java.util.List;
PDDocument doc = PDDocument.load(new File("input.pdf"));
Splitter splitter = new Splitter();
splitter.setSplitAtPage(1); // 每页拆分为一个文件
List<PDDocument> pages = splitter.split(doc);
for (int i = 0; i < pages.size(); i++) {
pages.get(i).save("page_" + (i + 1) + ".pdf");
pages.get(i).close();
}
doc.close();
Key Libraries
| Task | Library | Maven Dependency |
|---|---|---|
| Read/Write/Merge/Split | Apache PDFBox | org.apache.pdfbox:pdfbox:3.0.x |
| Create from HTML | OpenHTMLToPDF | com.openhtmltopdf:openhtmltopdf-pdfbox:1.0.x |
| Advanced layout | iText | com.itextpdf:itext7-core:8.0.x |
| Text extraction | pdftotext | brew install poppler / apt install poppler-utils |
Best Practices
- Always check if tools are installed before using them
- Handle encoding issues - PDFs may contain various character encodings
- Large PDFs: Process page by page to avoid memory issues; use try-with-resources 确保资源释放
- OCR for scanned PDFs: Use Tesseract4J (
net.sourceforge.tess4j:tess4j) if text extraction returns empty