Monday, February 16, 2015

Get the metadata of pdf files in python

Often we would like to retrieve the meta data that is stored for a given pdf here I show you two ways to do that : using pyPdf and pdfminer.

1) pyPdf
Install pyPdf using pip

then use this code:
from pyPdf import PdfFileReader
pdf_toread = PdfFileReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.getDocumentInfo()
print str(pdf_info)
Also you might not get all the meta data that you like for instance in my case I was looking for number of page. if you check the functions in pdf_toread you can find right method. for example: 
print pdf_toread.getNumPages()


2) pdfminer 
install odfminer using pip and then:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)

print doc.info  # The "Info" metadata



No comments:

Post a Comment