Extract Raw Text from a PDF
Iterate over every page and join the text into one string
Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.
Extracting text with pypdf
pypdf is a third-party Python library for reading PDF files. You installed it in the previous chapter.
Its PdfReader class loads a PDF and exposes a .pages list. Each page has an .extract_text() method that returns the page content as a plain string.
To get all text, loop over reader.pages, extract each page's text, and join the results with "\n". The newline delimiter preserves page boundaries so downstream chunking doesn't merge text across pages.
reader = pypdf.PdfReader("doc.pdf")
pages = [page.extract_text() for page in reader.pages]
text = "\n".join(pages)
Instructions
Complete the extract_text function. The starter code provides the signature.
- Create a variable named
reader. Passpdf_pathto thePdfReaderclass to load the PDF. - Create a list named
pages. Loop overreader.pagesand extract the text from each page. - Join
pagesinto a single string. Use a newline as the delimiter to keep page boundaries intact.
import pypdf
def extract_text(pdf_path):
# Step 1: Create reader from pdf_path using PdfReader
# Step 2: Extract text from each page into a list
# Step 3: Join pages with newline and return
Interactive Code Editor
Sign in to write and run code, track your progress, and unlock all chapters.
Sign In to Start Coding