Update app.py

Remove the file I/O functions from app.py and rewrite main() to accept a folder

💻

Writing code and entering commands is only available on desktop. Open this page on a larger screen to complete this chapter.

Now that files.py owns all file I/O, app.py can be cleaned up. Three things need to change:

Remove the moved functions. list_files, read_file, and index_folder now live in files.py. The SUPPORTED_EXTENSIONS constant moved there too. Keeping these in app.py would create two copies that could drift out of sync.

Remove the PDF-specific code. The original app.py was built around PDFs — extract_text uses pypdf, chunk_text splits raw text, and print_result displays answers. None of these belong in the new folder-based assistant. import pypdf goes away with them.

Rewrite main(). The new version accepts a folder path instead of a PDF path and a question. It calls index_folder from files.py to get structured chunks, then extracts the plain text strings before embedding.

Notice that embed_all_chunks now takes texts (plain strings extracted from the chunk dicts) instead of the dicts themselves. main() extracts them with [chunk["text"] for chunk in chunks] before passing them to embed_all_chunks.

After this chapter, running python3 app.py ./docs will index the folder and cache the result — but won't yet answer questions. The question-answering loop comes in Lesson 2.

Instructions

  1. Delete the extract_text, chunk_text, print_result, list_files, read_file, and index_folder functions from app.py.
  2. Delete import pypdf.
  3. Delete the SUPPORTED_EXTENSIONS constant — it now lives in files.py.
  4. Add from files import index_folder below the existing imports.
  5. Replace main() with this new version:

def main():
       if len(sys.argv) < 2:
           print("Usage: python app.py <folder>")
           sys.exit(1)
       folder = sys.argv[1]
       cache_path = folder.rstrip("/\\") + ".cache.json"
       client = create_client()
       cached = load_embeddings(cache_path)
       if cached:
           chunks, embeddings = cached
           print(f"Loaded cache from {cache_path}")
       else:
           print(f"Indexing {folder}...")
           chunks = index_folder(folder)
           texts = [chunk["text"] for chunk in chunks]
           file_count = len(set(chunk["source"] for chunk in chunks))
           print(f"Indexed {len(chunks)} chunks from {file_count} files.")
           embeddings = embed_all_chunks(client, texts)
           save_embeddings(chunks, embeddings, cache_path)
           print(f"Cache saved to {cache_path}")