Implement Cosine Similarity

Write the function that scores how similar two vectors are

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

From formula to Python

You know the cosine similarity formula from the previous chapter. Now translate it to Python using numpy.

numpy gives you two functions that map directly to the formula:

np.dot(a, b) — computes the dot product of two vectors
np.linalg.norm(a) — computes the norm of a vector

Divide the dot product by the product of the two norms. That gives you the similarity score.

import numpy as np

def cosine_similarity(vec_a, vec_b):
    return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

Instructions

Complete the cosine_similarity function. The starter code provides the signature.

Create a variable named dot. Assign it np.dot(vec_a, vec_b).
Create a variable named norm. Assign it np.linalg.norm(vec_a) * np.linalg.norm(vec_b).
Return dot / norm.

← Previous Chapter Next Chapter →

import os
import time
import numpy as np
import pypdf
from dotenv import load_dotenv
from google import genai
from google.genai import types

def extract_text(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in reader.pages]
    return "\n".join(pages)

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

def preview_chunks(chunks):
    print(f"Total chunks: {len(chunks)}")
    print(f"First chunk:\n{chunks[0]}")

def create_client():
    load_dotenv()
    api_key = os.getenv("GEMINI_API_KEY")
    client = genai.Client(api_key=api_key)
    return client

def embed_text(client, text):
    result = client.models.embed_content(model="gemini-embedding-001", contents=text, config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"))
    return result.embeddings[0].values

def embed_all_chunks(client, chunks):
    BATCH_SIZE = 90
    embeddings = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        for chunk in batch:
            embeddings.append(embed_text(client, chunk))
        if i + BATCH_SIZE < len(chunks):
            print("Rate limit pause — waiting 60 seconds...")
            time.sleep(60)
    return embeddings

def cosine_similarity(vec_a, vec_b):
    # Step 1: Compute dot product
    # Step 2: Compute product of norms
    # Step 3: Return dot / norm

Implement Cosine Similarity

From formula to Python

Instructions

Interactive Code Editor