Home page: Get the metadata of pdf files in python

Often we would like to retrieve the meta data that is stored for a given pdf here I show you two ways to do that : using pyPdf and pdfminer.

1) pyPdf

Install pyPdf using pip

then use this code:

from pyPdf import PdfFileReader
pdf_toread = PdfFileReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.getDocumentInfo()
print str(pdf_info)

Also you might not get all the meta data that you like for instance in my case I was looking for number of page. if you check the functions in pdf_toread you can find right method. for example:

print pdf_toread.getNumPages()

2) pdfminer

install odfminer using pip and then:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)

print doc.info  # The "Info" metadata

Home page

Monday, February 16, 2015

Get the metadata of pdf files in python

No comments:

Post a Comment