Split Text into Fixed-Size Chunks with Overlap
Build a RAG App — Chat with Your PDFsExtract and Chunk TextSplit Text into Fixed-Size Chunks with Overlap
Exit
Split Text into Fixed-Size Chunks with Overlap
Write the chunking function that divides text into overlapping pieces
💻
Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.
How the chunking loop works
To split text with overlap, use a for loop where the stride is smaller than the chunk size.
If chunk_size = 500 and overlap = 100, the stride is 500 - 100 = 400. The loop starts at 0, then 400, then 800, and so on.
At each position i, slice text[i : i + chunk_size] to get a chunk.
for i in range(0, len(text), chunk_size - overlap):
chunk = text[i : i + chunk_size]
chunks.append(chunk)
Instructions
Complete the chunk_text function. The starter code provides the signature.
- Create an empty list named
chunks. - Create a
forloop with a variable namedi. Start at0, end atlen(text), and increment bychunk_size - overlap. - Inside the loop, append
text[i : i + chunk_size]tochunks. - After the loop, return
chunks.
import pypdf
def extract_text(pdf_path):
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)
def chunk_text(text, chunk_size=500, overlap=100):
# Step 1: Create empty chunks list
# Step 2: Loop from 0 to len(text) with stride chunk_size - overlap
# Step 3: Append text[i : i + chunk_size] to chunks
# Step 4: Return chunks
Interactive Code Editor
Sign in to write and run code, track your progress, and unlock all chapters.
Sign In to Start Coding