Extract Raw Text from a PDF
Exit

Extract Raw Text from a PDF

Iterate over every page and join the text into one string

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

Extracting text with pypdf

pypdf is a third-party Python library for reading PDF files. You installed it in the previous chapter.

Its PdfReader class loads a PDF and exposes a .pages list. Each page has an .extract_text() method that returns the page content as a plain string.

To get all text, loop over reader.pages, extract each page's text, and join the results with "\n". The newline delimiter preserves page boundaries so downstream chunking doesn't merge text across pages.

reader = pypdf.PdfReader("doc.pdf")
pages = [page.extract_text() for page in reader.pages]
text = "\n".join(pages)

Instructions

Complete the extract_text function. The starter code provides the signature.

  1. Create a variable named reader. Pass pdf_path to the PdfReader class to load the PDF.
  2. Create a list named pages. Loop over reader.pages and extract the text from each page.
  3. Join pages into a single string. Use a newline as the delimiter to keep page boundaries intact.