Test the Search

Run the full extract-chunk-embed-search pipeline and print results

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

Putting it together for the first time

test_search calls every function you have written so far in sequence. It is not part of the final app — it is a sanity check to confirm that search returns meaningful results before you add generation.

Running this locally against a real PDF and a real question is the first time the pipeline feels like magic.

Instructions

Complete the test_search function. The starter code provides the signature.

Create a variable named text. Assign it extract_text(pdf_path).
Create a variable named chunks. Assign it chunk_text(text).
Create a variable named embeddings. Assign it embed_all_chunks(client, chunks).
Create a variable named results. Assign it search(client, question, chunks, embeddings).
Use a for loop with variable i and chunk over enumerate(results, 1). Inside the loop, print f"Result {i}:\n{chunk}\n".

← Previous Chapter Next Chapter →

import os
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types

def extract_text(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in reader.pages]
    return "\n".join(pages)

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

def preview_chunks(chunks):
    print(f"Total chunks: {len(chunks)}")
    print(f"First chunk:\n{chunks[0]}")

def create_client():
    load_dotenv()
    api_key = os.getenv("GEMINI_API_KEY")
    client = genai.Client(api_key=api_key)
    return client

def embed_text(client, text):
    result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
    return result.embeddings[0].values

def embed_all_chunks(client, chunks):
    BATCH_SIZE = 90
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        for chunk in batch:
            embeddings.append(embed_text(client, chunk))
        if i + BATCH_SIZE < len(chunks):
            print("Rate limit pause — waiting 60 seconds...")
            time.sleep(60)
    return embeddings

def cosine_similarity(vec_a, vec_b):
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    return dot / norm

def search(client, query, chunks, embeddings, top_k=3):
    result = client.models.embed_content(model="gemini-embedding-001", contents=query, config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"))
    query_vector = result.embeddings[0].values
    scores = [(cosine_similarity(query_vector, emb), chunk) for emb, chunk in zip(embeddings, chunks)]
    scores.sort(key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in scores[:top_k]]

def test_search(client, pdf_path, question):
    # Step 1: Extract text from PDF
    # Step 2: Chunk the text
    # Step 3: Embed all chunks
    # Step 4: Search for relevant chunks
    # Step 5: Loop and print each result

Test the Search

Putting it together for the first time

Instructions

Interactive Code Editor