Robust PDF parsing

I’ve ported Didier Steven’s pdf-parser.py script to C++. The problem I have is that the parser doesn’t handle malformed but still loadable by Adobe Reader X pdf files. I found a collection here: http://code.google.com/p/corkami/wiki/PDFTricks – some of the files there no longer load in Reader X it appears though.

 

If anyone knows of some open source PDF parser that will handle these documents, please inform me. I would like to see how they perform the parsing. So far sumatrapdf and pdfminer do not handle these documents.

 

(You can also get a lot of PE tricks here: http://code.google.com/p/corkami/downloads/list?can=1&q=Binary+corpus)

You can find Didier’s original code here: http://blog.didierstevens.com/programs/pdf-tools/

Advertisements

~ by ra1ndog on November 14, 2011.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: