Then we open up the file using pdfrw's PdfReader class and grab the total number of pages from the input PDF. Here we create a function called split that takes an input PDF file path, the number of pages that you want to extract and the output path. Split('reportlab-sample.pdf', 10, 'subset.pdf') # splitter.pyĭef split(path, number_of_pages, output): For this example, we will use my ReportLab book's sample chapter PDF that you can download on Leanpub. For example, maybe you want to take the cover off of a book for some reason or you just want to extract the chapters of a book into multiple PDFs instead of storing them in one file. You can also use pdfrw to split a PDF up. I haven't figured out exactly why that is, but I am assuming that PyPDF2 does some extra data massaging on the PDF trailer information that pdfrw currently does not do. If you run this against the reportlab-sample.pdf file that I also included in the source code for this article, you will find that the author name that is returned ends up being '' instead of "Michael Driscoll". Note: I am using the standard W9 form from the IRS for this example. While pdfrw does let you get the Info object, it displays it in a less friendly way. If you have using PyPDF2 in the past, then you may recall that PyPDF2 let's you extract an document information object that you can use to pull out information like author, title, etc. The pdfrw package does not extract data in quite the same way that PyPDF2 does. Now that we have pdfrw installed, let's learn how to extract some information from our PDFs. Let's get that done so we can start using pdfrw: python -m pip install pdfrw Code can be found on GitHub.Īs you might expect, you can install pdfrw using pip. Note: This article is based on my book, ReportLab: PDF Processing with Python.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |