Chunk into Structured Format
Exit

Chunk into Structured Format

Rewrite the indexer to produce chunks with source filenames attached

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

The current index_folder returns a list of plain strings. That's enough for embedding, but it means the assistant loses all information about where each piece of text came from the moment it enters the pipeline.

This creates three concrete problems:

  • No attribution: When the assistant answers a question, it can't tell you which file the answer came from.
  • No routing: Later in the course, you'll add @filename targeting — "search only in README.md". Without the source tracked per chunk, that feature is impossible.
  • No filtering: If a user asks about a specific file, there's no way to limit results to chunks from that file.

The fix is to change what index_folder produces. Instead of a list of strings, it returns a list of dicts:

{"text": "chunk content...", "source": "README.md"}

Each dict carries the chunk text and the filename it came from. The source value uses just the filename (not the full path) because that's what gets displayed to the user.

Chunking with overlap

Large files produce more text than a single embedding can capture well. index_folder slices each file's content into fixed-size segments with a small overlap at the boundaries. Using chunk_size - overlap as the loop step means adjacent chunks share overlap characters — so a sentence that falls exactly at a boundary appears in full in at least one of the two adjacent chunks.

Instructions

  1. Update index_folder's signature to index_folder(folder, chunk_size=500, overlap=100).
  2. Replace the texts list with an empty list called chunks.
  3. After reading text and checking it isn't empty, add filename = os.path.basename(path) to get just the filename.
  4. Replace texts.append(text) with a loop that creates one chunk dict per segment:
    • Add for i in range(0, len(text), chunk_size - overlap): — this advances through the text in steps of chunk_size - overlap, so adjacent chunks share overlap characters at their boundaries.
    • Inside the loop, append {"text": text[i:i + chunk_size], "source": filename} to chunks. The "source" key records which file the chunk came from.
  5. Return chunks.