Robust PDF parsing
I’ve ported Didier Steven’s pdf-parser.py script to C++. The problem I have is that the parser doesn’t handle malformed but still loadable by Adobe Reader X pdf files. I found a collection here: http://code.google.com/p/corkami/wiki/PDFTricks – some of the files there no longer load in Reader X it appears though.
If anyone knows of some open source PDF parser that will handle these documents, please inform me. I would like to see how they perform the parsing. So far sumatrapdf and pdfminer do not handle these documents.
(You can also get a lot of PE tricks here: http://code.google.com/p/corkami/downloads/list?can=1&q=Binary+corpus)
You can find Didier’s original code here: http://blog.didierstevens.com/programs/pdf-tools/