Implement Cosine Similarity
Exit
Implement Cosine Similarity
Write the function that scores how similar two vectors are
💻
Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.
From formula to Python
You know the cosine similarity formula from the previous chapter. Now translate it to Python using numpy.
numpy gives you two functions that map directly to the formula:
np.dot(a, b)— computes the dot product of two vectorsnp.linalg.norm(a)— computes the norm of a vector
Divide the dot product by the product of the two norms. That gives you the similarity score.
import numpy as np
def cosine_similarity(vec_a, vec_b):
return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
Instructions
Complete the cosine_similarity function. The starter code provides the signature.
- Create a variable named
dot. Assign itnp.dot(vec_a, vec_b). - Create a variable named
norm. Assign itnp.linalg.norm(vec_a) * np.linalg.norm(vec_b). - Return
dot / norm.
import os
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types
def extract_text(pdf_path):
reader = pypdf.PdfReader(pdf_path)
pages = [page.extract_text() for page in reader.pages]
return "\n".join(pages)
def chunk_text(text, chunk_size=500, overlap=100):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i : i + chunk_size])
return chunks
def preview_chunks(chunks):
print(f"Total chunks: {len(chunks)}")
print(f"First chunk:\n{chunks[0]}")
def create_client():
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)
return client
def embed_text(client, text):
result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
return result.embeddings[0].values
def embed_all_chunks(client, chunks):
BATCH_SIZE = 90
embeddings = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
for chunk in batch:
embeddings.append(embed_text(client, chunk))
if i + BATCH_SIZE < len(chunks):
print("Rate limit pause — waiting 60 seconds...")
time.sleep(60)
return embeddings
def cosine_similarity(vec_a, vec_b):
# Step 1: Compute dot product
# Step 2: Compute product of norms
# Step 3: Return dot / norm
Interactive Code Editor
Sign in to write and run code, track your progress, and unlock all chapters.
Sign In to Start Coding