Test the Search
Run the full extract-chunk-embed-search pipeline and print results
💻
Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.
Putting it together for the first time
test_search calls every function you have written so far in sequence. It is not part of the final app — it is a sanity check to confirm that search returns meaningful results before you add generation.
Running this locally against a real PDF and a real question is the first time the pipeline feels like magic.
Instructions
Complete the test_search function. The starter code provides the signature.
- Create a variable named
text. Assign itextract_text(pdf_path). - Create a variable named
chunks. Assign itchunk_text(text). - Create a variable named
embeddings. Assign itembed_all_chunks(client, chunks). - Create a variable named
results. Assign itsearch(client, question, chunks, embeddings). - Use a
forloop with variableiandchunkoverenumerate(results, 1). Inside the loop, printf"Result {i}:\n{chunk}\n".
import os
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types
def extract_text(pdf_path):
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i : i + chunk_size])
return chunks
def preview_chunks(chunks):
print(f"Total chunks: {len(chunks)}")
print(f"First chunk:\n{chunks[0]}")
def create_client():
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
return client
def embed_text(client, text):
result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
return result.embeddings[0].values
def embed_all_chunks(client, chunks):
BATCH_SIZE = 90
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
for chunk in batch:
embeddings.append(embed_text(client, chunk))
if i + BATCH_SIZE < len(chunks):
print("Rate limit pause — waiting 60 seconds...")
time.sleep(60)
return embeddings
def cosine_similarity(vec_a, vec_b):
dot = np.dot(vec_a, vec_b)
norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
return dot / norm
def search(client, query, chunks, embeddings, top_k=3):
result = client.models.embed_content(model="gemini-embedding-001", contents=query, config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"))
query_vector = result.embeddings[0].values
scores = [(cosine_similarity(query_vector, emb), chunk) for emb, chunk in zip(embeddings, chunks)]
scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scores[:top_k]]
def test_search(client, pdf_path, question):
# Step 1: Extract text from PDF
# Step 2: Chunk the text
# Step 3: Embed all chunks
# Step 4: Search for relevant chunks
# Step 5: Loop and print each result
Interactive Code Editor
Sign in to write and run code, track your progress, and unlock all chapters.
Sign In to Start Coding