Load or Embed

Check for a cached embedding file and skip the API if one exists

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

Avoiding redundant work

Embedding a full PDF can take minutes and costs API quota. If you already embedded this PDF in a previous run, you should not pay that cost again. The main() function checks for a cache file before calling the API.

The decision is a single if/else:

cache exists?
  yes → load chunks and embeddings from disk
  no  → extract text, chunk, embed, save to cache

The branching logic

The load_embeddings function you wrote in the previous chapter returns None when the cache file does not exist. That makes the check simple:

cached = load_embeddings(cache_path)
if cached:
    chunks, embeddings = cached
else:
    # embed from scratch and save

When cached is not None, it holds a tuple of (chunks, embeddings). The if branch unpacks that tuple into two variables. The else branch runs the full embedding pipeline and saves the result so the next run can skip this step.

Instructions

Continue the main function. The starter code has the argument parsing from the previous chapter already filled in.

  1. Print f"Loading {pdf_path}...".
  2. Create a variable named cached. Assign it load_embeddings(cache_path).
  3. Add an if cached: block. Inside it, unpack cached into chunks, embeddings using chunks, embeddings = cached, then print f"Loaded cache from {cache_path}".
  4. Add an else: block. Inside it: create text from extract_text(pdf_path), create chunks from chunk_text(text), print f"No cache found. Embedding {len(chunks)} chunks...", create embeddings from embed_all_chunks(client, chunks), call save_embeddings(chunks, embeddings, cache_path), and print f"Cache saved to {cache_path}".