Priyanka’s Substack

Creating a Hands-Free YouTube Experience with Computer Vision and AI

Priyanka — Thu, 15 May 2025 15:43:47 GMT

Imagine controlling your YouTube playback without touching your keyboard or mouse - just using simple hand gestures. That's exactly what a fascinating project I recently discovered on GitHub aims to accomplish. Let's dive into how modern computer vision and AI techniques are making this possible.

The Problem: When Your Hands Are Busy

We've all been there - cooking with messy hands, working out, or maybe just eating while watching a tutorial or favorite video. Reaching for the keyboard to pause, play, or adjust volume becomes an inconvenience. This might seem like a minor issue, but it's precisely these small frictions in user experience that innovative technology seeks to address.

Enter Gesture Control

The CV_YoutubeGestures_GenAI project takes a practical approach to solving this problem by implementing a gesture recognition system that allows users to control YouTube with simple hand movements. Based on the official README, the system supports several intuitive gestures:

👆 Index finger up: Play/Pause
✌️ Index + Middle fingers up: Next video
🤘 Ring + Pinky fingers up: Previous video
🖐️ All fingers up: Toggle fullscreen
🤏 Thumb + Index close: Volume down
🖐️ Thumb + Index apart: Volume up

The application provides a side-by-side display of the YouTube video and webcam feed, making it easy to see both your gestures and the content you're controlling. It even supports multiple camera devices, offering flexibility in your setup.

How It Works: The Technical Breakdown

The project leverages several key technologies:

1. Computer Vision with MediaPipe

At the core of the system is MediaPipe, Google's open-source framework for building multimodal (vision, audio, etc.) applied ML pipelines. MediaPipe Hands is particularly crucial for this project as it:

Detects hand presence in video frames
Tracks 21 3D landmarks on each hand
Works in real-time with minimal latency
Functions across different lighting conditions and backgrounds

The system processes the webcam feed, analyzing each frame to detect hands and interpret specific gestures based on the relative positions of these landmarks.

2. Machine Learning for Gesture Classification

After extracting hand landmarks, the system uses machine learning to classify different gestures. By training on various hand positions, the model learns to recognize patterns that correspond to specific commands.

3. Browser Automation with Selenium

To actually control YouTube, the project employs Selenium, a powerful browser automation tool. Once a gesture is recognized and classified, Selenium executes the corresponding action on the YouTube player - whether that's clicking the pause button, adjusting the volume slider, or navigating through the video timeline.

Implementation and System Requirements

According to the README, the project requires:

Python 3.7 or higher
A functioning webcam
Windows OS (specifically for the volume control functionality)
Several Python libraries including OpenCV, MediaPipe, NumPy, yt-dlp, keyboard, and pycaw

The implementation is streamlined for ease of use. After installation, you simply:

Run the application with python main.py
Enter a YouTube video URL (or use the default)
Position your hand in front of the webcam
Make the appropriate gesture for your desired action

The modular design separates concerns between gesture detection and YouTube control, making the code easier to understand and extend.

The Broader Implications

This project demonstrates something greater than just a convenient way to watch YouTube. It represents an important trend in human-computer interaction:

Natural Interfaces: Moving away from traditional input devices toward systems that understand human movement and intention
Accessibility: Creating alternative ways to interact with technology for people with different abilities and needs
Ambient Computing: Enabling technology interaction that blends into our environment rather than demanding explicit attention

Building On This Foundation

What makes this project particularly exciting is its extensibility. The framework established here could be adapted for:

Controlling other streaming platforms
Smart home interactions
Educational applications for children
Accessible computing for individuals with mobility limitations

Getting Started With The Project

If you're interested in trying this out or contributing, the repository includes clear setup instructions:

Clone the repository:

git clone https://github.com/Prikat25/CV_YoutubeGestures_GenAI.git
cd CV_YoutubeGestures_GenAI/

Install the required dependencies:

pip install -r requirements.txt

Run the application:

python main.py

When prompted, enter a YouTube video URL or press Enter to use the default video
Position your webcam to detect hand gestures
Start controlling YouTube with your hands!

If you encounter any issues with webcam access, the README provides helpful troubleshooting steps:

Ensure your webcam is properly connected
Check if no other application is using the webcam
Verify that your webcam is enabled in your system settings
Test your webcam with the Windows Camera app
Try closing other applications that might be using the webcam
Restart your computer if problems persist

The project is open-source under the MIT License, welcoming contributions from developers interested in expanding its capabilities or improving performance.

Future Directions

The potential evolutions of this technology are numerous:

Multi-gesture combinations for more complex controls
Personalized gesture training to accommodate individual preferences
Integration with other applications beyond YouTube
Enhanced recognition accuracy through more sophisticated models

Final Thoughts

Projects like CV_YoutubeGestures_GenAI highlight how the boundaries between humans and computers continue to blur in the most interesting ways. As computer vision and AI technologies become more accessible to developers, we're seeing an explosion of creative applications that make technology more intuitive and human-centered.

Whether you're a developer looking to contribute to an interesting open-source project, or simply someone fascinated by the evolution of human-computer interaction, this gesture control system offers a glimpse into a future where technology responds to our natural movements rather than forcing us to adapt to rigid interfaces.

What everyday friction points do you encounter with technology that could be solved with similar approaches? The possibilities are limited only by our imagination.

Have you experimented with gesture control or computer vision projects? Share your experiences in the comments below!

Building a Podcast Transcript Summarizer with Replit, Spotify API, and Google Gemini

Priyanka — Sat, 10 May 2025 16:32:39 GMT

In today's fast-paced world, we often don't have time to listen to entire podcast episodes. What if there was a tool that could summarize podcast content, allowing you to get the key points in just a few minutes? That's exactly what I built using Replit: a Podcast Transcript Summarizer that leverages the Spotify API to find podcasts and Google's Gemini AI to create concise summaries.

In this blog post, I'll walk you through how I used Replit to build this application from scratch and then deployed it to GitHub.

What We're Building

The Podcast Transcript Summarizer is a web application that allows users to:

Search for podcasts using keywords or specific podcast names
Fetch episodes and their transcripts from Spotify
Generate AI-powered summaries using Google's Gemini
View both the summary and full transcript in a clean interface

All of this is packaged in a user-friendly interface built with Streamlit.

Why Replit?

Replit provided the perfect environment for this project because:

It offers a complete development environment without any local setup
It integrates seamlessly with external APIs like Spotify and Google Gemini
It provides secure secret management for API keys
It has built-in workflows for running Streamlit applications
It makes deployment and sharing incredibly simple

Step 1: Setting Up the Project on Replit

Getting started was as simple as creating a new Replit project and selecting Python as the primary language. Replit automatically set up a Python environment with all the necessary tools.

The first prompt I used to kick off the project was:

python langgrpah streamlit and gemini api to get transcripts from spotify api and then summarize them using llm and display that summary in website. So user enters query -> search that in spotify db using api, then get transcript, then summarize that transcript which is the final output displayed to user.

This clear description of what I wanted to build helped set the project direction from the beginning.

Step 2: Installing Dependencies

Replit made it incredibly easy to install the necessary packages. I needed:

streamlit for the web interface
spotipy for interacting with the Spotify API
google-generativeai for accessing Google's Gemini AI

All dependencies were automatically managed in the pyproject.toml file.

Step 3: Setting Up API Keys

One of the most convenient features of Replit is its secure management of environment variables. I needed three important API keys:

SPOTIFY_CLIENT_ID
SPOTIFY_CLIENT_SECRET
GEMINI_API_KEY

Adding these to Replit was straightforward using the Secrets tab. This kept my API keys secure while still making them accessible to my application.

Step 4: Building the Core Components

Spotify API Integration

First, I created a module to handle interaction with the Spotify API:

Search for podcasts by keywords
Filter searches by podcast name only
Retrieve episode details
Fetch episode transcripts when available

Gemini AI Integration

Next, I integrated Google's Gemini API to:

Process podcast transcripts
Generate concise, structured summaries
Format the output for easy reading

Streamlit User Interface

The final piece was building a clean, intuitive interface with Streamlit that included:

A search form with filtering options
A direct episode lookup feature for Spotify URLs
Tabs for viewing summaries and full transcripts
Responsive layout with proper image handling

Step 5: Refining the Application

After building the basic functionality, I made several improvements:

Added a "Podcast Name Only" search filter for more precise results
Implemented direct episode lookup using Spotify URLs
Enhanced the display of podcast results with better image handling
Improved error handling for cases where transcripts aren't available
Optimized the Gemini prompt for better summary quality

Step 6: Version Control with Git and GitHub

Once the application was complete, I pushed it to GitHub for version control and sharing. While Replit has some limitations with direct Git commands, I was able to set up Git and push my repository to GitHub using the following commands:

git init

git add .

git commit -m "Add Podcast Transcript Summarizer using Spotify and Gemini APIs"

git remote add origin "Replace with Github Project Remote URL"

git push -u origin main

This made my project accessible outside of Replit and allowed for easier collaboration and version tracking.

Challenges and Solutions

Challenge 1: Spotify Transcript Access

The Spotify API doesn't always provide full transcripts for all podcasts. To address this, I implemented a fallback mechanism that uses episode descriptions when transcripts aren't available.

Challenge 2: Git Integration with Replit

Initially, I wanted to use the git integration in replit, but ran into errors for adding repo url. I resolved this by defaulting to the Git Bash by extracting project folder and then pushing using git commands to repository.

Challenge 3: UI Layout and Responsiveness

Getting the UI just right took some iteration, particularly with image sizing and result formatting. Streamlit's container and column features helped create a clean, responsive layout.

The Final Result

The finished Podcast Transcript Summarizer provides a seamless experience:

Users can search for podcasts or input direct Spotify URLs
Search results display with podcast images and relevant metadata
With a single click, users can fetch transcripts and generate summaries
The summary tab presents key points in a structured format
The full transcript is available for those who want more detail

Conclusion

Building this Podcast Transcript Summarizer with Replit demonstrated how quickly modern development tools allow us to create powerful applications. By combining Replit's development environment with Spotify's content API and Google's AI capabilities, I was able to build a useful tool that solves a real problem.

The entire project is open source and available on GitHub at SpotifyPodcastSummarizer. Feel free to check it out, contribute, or use it as inspiration for your own projects!

Next Steps

Some features I'm considering for future development:

Support for more podcast platforms beyond Spotify
Customizable summary length and style options
User accounts to save favorite episodes and summaries
Enhanced topic extraction and tagging

Resources

Have you built something interesting with Replit? I'd love to hear about your projects in the comments below!

AI-Powered Travel Planner with Streamlit, FastAPI, and Google's Gemini

Priyanka — Thu, 08 May 2025 13:29:51 GMT

Like many developers, I've spent countless hours planning trips — researching destinations, reading reviews, plotting routes, and trying to create the perfect itinerary. After one particularly frustrating planning session involving 30+ browser tabs and several spreadsheets, I thought: "There has to be a better way."

That's when I decided to build my own solution: an AI-powered travel planner that generates personalized itineraries based on user preferences and Google Maps reviews. In this post, I'll walk you through how I built it using Streamlit, FastAPI, and Google's Generative AI.

The Tech Stack

Before diving into the code, here's the technology stack I chose and why:

Streamlit: For the user-friendly frontend that non-technical users can easily navigate
FastAPI: To create robust, high-performance API endpoints
Google Maps API: To access location data and authentic user reviews
Google Gemini AI: To process reviews and generate customized travel plans
Python: As the glue holding everything together

Starting with a Clear Vision

I wanted to create a tool that would:

Collect user preferences through a simple form
Analyze Google Maps reviews to find highly-rated attractions matching those preferences
Generate a day-by-day itinerary with realistic timing
Display the final itinerary on Google Maps for easy navigation
(Eventually) add Airbnb options, flight recommendations, and photo collaboration

Building the Backend with FastAPI

I started with FastAPI to build the core logic. Here's a simplified version of my main API endpoint:

%%writefile streamlit_app/server.py

import os
from dotenv import load_dotenv
from typing import Dict, List, TypedDict, Annotated
from typing import List, Optional
from pydantic import BaseModel, Field
import googlemaps
from langchain.tools import tool
from langchain_google_genai import ChatGoogleGenerativeAI
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from google.genai import types
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import time
from pathlib import Path
from google import genai

load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")  # Replace with your actual Gemini API key
GOOGLE_MAPS_API_KEY = os.getenv("GOOGLE_MAPS_API_KEY")  # Replace with your actual Google Maps API key
GOOGLE_USER_ID = os.getenv("GOOGLE_USER_ID")  # Replace with your actual Google User ID
STREAMLIT_DATA_PATH = Path("streamlit_app/data")
STREAMLIT_DATA_PATH.mkdir(parents=True, exist_ok=True)

class Place(BaseModel):
    name: str = Field(..., description="Name of the place(hotel, park, restaurants)")
    pid: Optional[str] = Field(None, description="Google Maps Place ID of the place")
    rating: Optional[float] = Field(None, description="Google Maps Rating of the place")
    formatted_address: str = Field(..., description="Formatted address of the place")
    price_level: Optional[int] = Field(None, description="Price level of the place (0-4)")
    notes: List[str] = Field(..., description="List of notes for the place")

class Activities(BaseModel):
    time_of_day: str = Field(..., description="Time of the day")
    activities: List[str] = Field(..., description="List of activities with place names as per the time of the day")
    # places: List[Place] = Field(..., description="List of places in activities to visit, explore with details")
    # notes: List[str] = Field(..., description="List of notes for the time of the day")

class Day_Itinerary(BaseModel):
    name: str = Field(..., description="Name of the day")
    activities: List[Activities] = Field(..., description="List of activities as per the time of the day")

class Itinerary(BaseModel):
    day_itinerary: List[Day_Itinerary] = Field(..., description="List of days with itineraries")

# Initialize Google Maps client
# https://github.com/googlemaps/google-maps-services-python
gmaps = googlemaps.Client(key=GOOGLE_MAPS_API_KEY)

# Define tools
def browse_url(url: str) -> str:
    """Captures a snapshot of the webpage at the provided URL.

    A graphical browser will be used to connect to the URL provided,
    and generate a screenshot of the rendered web page.

    Args:
        url: The full absolute URL to browse/snapshot.

    Returns:
        snapshot if successfully captured, or any error messages.
    """

    try:
      chrome_options = webdriver.ChromeOptions()
      chrome_options.add_argument('--headless')
      chrome_options.add_argument('--no-sandbox')
      chrome_options.headless = True
      driver = webdriver.Chrome(options=chrome_options)
      driver.get(url)
      # Wait for the page to fully load.
      time.sleep(5)
      # Grab rendered HTML
      html = driver.page_source
      soup = BeautifulSoup(html, "html.parser")
      # Try to use the page  for filename
      raw_title = soup.title.string if soup.title else "untitled_page"
      clean_title = re.sub(r'\W+', '_', raw_title.strip())[:100]  # Keep it tidy and not too long
      filename = f"{clean_title}.html"
      # Save to file
      with open(STREAMLIT_DATA_PATH / filename, "w", encoding="utf-8") as f:
        f.write(html)
      print(f"📄 HTML saved as {filename}")
      return html

    except Exception as e:
      print(f"An error occurred: {e}")
      return str(e)

    finally:
      # Close the browser
      if driver:
        driver.quit()

@tool
def search_hotels(location: str) -> List[Dict]:
    """Search for hotels in a given location."""
    try:
        # First get the geocode for the location
        geocode_result = gmaps.geocode(location)
        if not geocode_result:
            return []

        location = geocode_result[0]['geometry']['location']

        # Search for hotels nearby
        places_result = gmaps.places_nearby(
            location=location,
            radius=15000,  # 5km radius
            type='lodging',
            rank_by='prominence'
        )

        # Format the results
        results = []
        for place in places_result.get('results', [])[:5]:
            details = gmaps.place(place['place_id'], fields=['name', 'rating', 'formatted_address', 'price_level'])
            result = details.get('result', {})
            results.append({
                'name': result.get('name', 'Unknown'),
                'rating': place.get('rating', 0),
                'user_ratings_total': place.get('user_ratings_total', 0),
                'business_status': place.get('business_status', 'Unknown'),
                'formatted_address': result.get('formatted_address', 'Address not available'),
                'price_level': place.get('price_level', 0)
            })
        return results
    except Exception as e:
        return [{"error": str(e)}]

@tool
def search_parks(location: str) -> List[Dict]:
    """Search for parks in a given location."""
    try:
        # First get the geocode for the location
        geocode_result = gmaps.geocode(location)
        if not geocode_result:
            return []

        location = geocode_result[0]['geometry']['location']

        # Search for parks nearby
        places_result = gmaps.places_nearby(
            location=location,
            radius=15000,  # 5km radius
            type='park',
            rank_by='prominence'
        )

        # Format the results
        results = []
        for place in places_result.get('results', [])[:5]:
            details = gmaps.place(place['place_id'], fields=['name', 'formatted_address'])
            result = details.get('result', {})
            results.append({
                'name': result.get('name', 'Unknown'),
                'formatted_address': result.get('formatted_address', 'Address not available'),
                'rating': place.get('rating', 0),
                'user_ratings_total': place.get('user_ratings_total', 0),
                'business_status': place.get('business_status', 'Unknown')
            })
        return results
    except Exception as e:
        return [{"error": str(e)}]

@tool
def search_restaurants(location: str) -> List[Dict]:
    """Search for restaurants in a given location."""
    try:
        # First get the geocode for the location
        geocode_result = gmaps.geocode(location)
        if not geocode_result:
            return []

        location = geocode_result[0]['geometry']['location']

        # Search for restaurants nearby
        places_result = gmaps.places_nearby(
            location=location,
            radius=15000,  # 5km radius
            type='restaurant',
            rank_by='prominence'
        )

        # Format the results
        results = []
        for place in places_result.get('results', [])[:5]:
            details = gmaps.place(place['place_id'], fields=['name', 'rating', 'formatted_address', 'price_level'])
            result = details.get('result', {})
            results.append({
                'name': result.get('name', 'Unknown'),
                'rating': result.get('rating', 0),
                'formatted_address': result.get('formatted_address', 'Address not available'),
                'price_level': result.get('price_level', 0),
                'user_ratings_total': place.get('user_ratings_total', 0),
                'business_status': place.get('business_status', 'Unknown')
            })
        return results
    except Exception as e:
        return [{"error": str(e)}]

# Initialize tools
tools = [search_hotels, search_parks, search_restaurants]

MODEL_NAME = "gemini-1.5-flash"
MODEL_CACHE_NAME = "models/gemini-1.5-flash-001"
# Initialize LLM
llm = ChatGoogleGenerativeAI(
    model=MODEL_NAME,
    google_api_key=GOOGLE_API_KEY,
    temperature=0.7
)
structured_llm = llm.with_structured_output(Itinerary)

class TravelPlan(BaseModel):
    options: List[str]  # ["flights", "hotels", "parks", "food"]
    total_cost: float
    preferences: Dict  # Additional preferences
    opted_preferences: Dict  # User-selected preferences
    user_preferences: Dict  # User-provided preferences

# Build Prompt using user preferences
def build_prompt(plan: TravelPlan):
    prefs = plan.opted_preferences
    user_prefs = plan.user_preferences
    header = (
        f"You are a {prefs['vibe']} Gen-Z travel bard. "
        f"Craft a {prefs['pace']} for {prefs['days']}-day in {prefs['region']} adventure featuring "
        f"{', '.join(prefs['activities'])}, global cuisines including {', '.join(prefs['cuisines'])}, "
        f"and a total cost of ${plan.total_cost}."
        f"Take below user preferences for likes: {user_prefs['likes']}"
        f"Take below user preferences for dislikes: {user_prefs['dislikes']}"
        f"Use these POIs for each day as inspiration:"
    )
    for dest in prefs['destinations']:
        header += f"\n {dest}"
    header += (
        "\n\nNow write a time-blocked itinerary with morning, afternoon, and evening sections, "
        f"detailed notes for all listed in {plan.options}"
    )
    return header

def process_travel_request(plan: TravelPlan) -> Dict:
    """Process a travel request using the agent workflow."""

    # Run the workflow
    system_prompt = """You are a travel researcher tasked with gathering information about travel destinations.
          Your responsibilities:
          1. For each destination, use the appropriate tools to find:
            - Hotels: Search for lodging options with ratings and price levels
            - Parks: Find nearby parks, national parks and recreational areas
            - Restaurants: Locate dining options with ratings and price levels
          2. For each search:
            - Focus on finding the best-rated options (4+ stars)
            - Consider the location and accessibility
            - Note the price levels and amenities
            - Limit results to top 5 options per category using price levels and ratings
          3. After gathering information:
            - Organize findings by time of day and destination
            - Include ratings, addresses, and price levels
            - Highlight any notable features or reviews
            - Include Time of Day, rating and price_level for each option
            - Format the information clearly as a JSON object containing a list of itineraries for all days."""

    generation_config = {'system_instruction': system_prompt,
                     'tools': tools,}

    prompt = build_prompt(plan)

    result = llm.invoke(prompt, config=generation_config)

    struct_result = structured_llm.invoke(f"Arrange below itinerary into JSON format. Given itinerary: {result.content}")
    # struct_result = result.content
    # struct_result = structured_llm.invoke(prompt, config=generation_config)

    return struct_result

# process_travel_request(destinations=["seattle", "washington", "oregon", "seattle"], options=["hotels", "food"])

class UserProfile(BaseModel):
    vibe: str
    activities: List[str]
    cuisines: List[str]
    pace: str

class UserPreferences(BaseModel):
    likes: UserProfile
    dislikes: UserProfile

client = genai.Client(api_key=GOOGLE_API_KEY)
llm_cached = llm
structured_user_llm = llm.with_structured_output(UserPreferences)

def get_gmaps_profile() -> str:
  "Get Gmaps reviews of user for personalized content"
  
  # Get file
  url = f'https://www.google.com/maps/contrib/{GOOGLE_USER_ID}/reviews/'
  if not os.path.exists(STREAMLIT_DATA_PATH / "Google_Maps.html"):
    html = browse_url(url) # saved in data path
    # cache_content("streamlit_app/data/Google_Maps.html", "html_cache") # cache data
  else:
    with open(STREAMLIT_DATA_PATH / 'Google_Maps.html', 'r') as f:
      html = f.read().strip()
  try:
      # Generate content using the LLM
      response = llm_cached.invoke(f"Please Analyse the below html and identify the vibe, pace, activity types, cuisines, places, etc this user likes and dislikes as a structured list. Provided HTML: {html}. No explanations and proceed with given data.") # analyze html
      response = structured_user_llm.invoke(f"Please Analyse the below text and identify the vibe, pace, activity types, cuisines, places, etc this user likes and dislikes as a structured list. Provided Text: {response.content}. No explanations and proceed with given data.") # analyze user profile
      # print(response)
      return response
  except Exception as e:
      print(f"An error occurred: {e}")
      return str(e)
  return "ok"

# %%writefile streamlit_app/src/app.py
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import List

app = FastAPI()

# Configure CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.post("/api/travel-plan")
async def create_travel_plan(plan: TravelPlan):
    try:
        # Validate destinations
        if not plan.opted_preferences.get("destinations"):  # Access using dictionary key
            raise HTTPException(status_code=400, detail="At least one destination is required")

        print(f"Received travel plan: {plan.dict()}")
        # Process the travel request using the agent workflow
        itinerary = process_travel_request(plan)

        return itinerary

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/user-profile")
async def get_user_profile():
    try:
        # Process the travel request using the agent workflow
        user_profile = get_gmaps_profile()

        return user_profile

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)</code></code></pre><h2>Creating a User-Friendly Frontend with Streamlit</h2><p>With the API working, I built a simple but effective frontend using Streamlit:</p><pre><code><code>%%writefile streamlit_app/1_travel_planner.py
import streamlit as st
import requests
import json
import os
from typing import List
import time
import urllib.parse

GOOGLE_MAPS_API_KEY = os.environ.get('GOOGLE_MAPS_API_KEY')

# Page configuration
st.set_page_config(
    page_title="Travel Planner",
    page_icon="✈️",
    layout="wide"
)

# Custom CSS
st.markdown("""
    <style>
    .main {
        padding: 2rem;
    }
    .stButton>button {
        width: 100%;
    }
    .sidebar .sidebar-content {
        background-color: #f0f2f6;
    }
    .stProgress > div > div > div > div {
        background-color: #1f77b4;
    }
    .place-card {
        background-color: #f8f9fa;
        padding: 1rem;
        margin: 0.5rem 0;
        border-radius: 0.5rem;
        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
    }
    </style>
    """, unsafe_allow_html=True)

# Title
st.title("✈️ Travel Planner")

# Sidebar for options
with st.sidebar:
    st.header("Travel Options")
    options = st.multiselect(
        "Select what you want to include in your trip:",
        ["Hotels", "Parks", "Food", "Flights", "Car Rentals", "Confirmed"],
        default=["Hotels", "Food"]
    )
    total_cost = st.number_input("Total Cost", min_value=1000, max_value=10000, value=5000)
    duration = st.number_input("Duration (in days)", min_value=1, max_value=30, value=12)

    # Additional preferences
    st.header("Preferences")
    min_rating = st.slider("Minimum Rating", 1.0, 5.0, 4.0, 0.5)
    budget = st.selectbox(
        "Budget Level",
        ["Budget", "Mid-range", "Luxury"],
        index=1
    )

    # Toggle for mock data
    use_mock_data = st.checkbox("Use Mock Data", value=False)

# Main content
col1, col2 = st.columns([2, 1])
# Initialize session state
if "selected_places" not in st.session_state:
    st.session_state.selected_places = []
if "user_profile" not in st.session_state:
    st.session_state.user_profile = []
if "itinerary" not in st.session_state:
    st.session_state.itinerary = None

with col1:
    # Destination input
    st.subheader("Enter Your Travel Preferences")

    with st.form("itinerary_form"):
      region = st.text_input("Which region or trip theme?", value="Pacific Northwest")
      destinations = st.text_input("Destinations (comma-separated)", value="Portland, Seattle, Washington, Oregon")
      # days = st.slider("Trip Length (days)", min_value=1, max_value=21, value=5)
      vibe = st.text_input("Trip Vibe", value="relaxing and adventurous")
      activities = st.text_input("Must-have activities (comma-separated)", value="hiking, stargazing, live music")
      cuisines = st.text_input("Preferred cuisines (comma-separated)", value="Indian, Mexican, Japanese")
      pace = st.radio("Trip pace", options=["relaxed", "balanced", "packed"], index=1)
      submitted = st.form_submit_button("✨ Generate Itinerary")

    if submitted:
        prefs = {
            'region': region,
            'destinations': [d.strip() for d in destinations.split(',') if d.strip()],
            'days': duration,
            'vibe': vibe,
            'activities': [a.strip() for a in activities.split(',') if a.strip()],
            'cuisines': [c.strip() for c in cuisines.split(',') if c.strip()],
            'pace': pace
        }
        if destinations or use_mock_data:
            if use_mock_data:
                st.session_state.itinerary = MOCK_ITINERARY
                st.success("Using mock data for demonstration")
            else:
                destinations = destinations.split("-")

                # Prepare the request data
                data = {
                    "options": [option.lower() for option in options],
                    "total_cost": total_cost,
                    "preferences": {
                        "min_rating": min_rating,
                        "budget": budget.lower()
                    },
                    "opted_preferences": prefs,
                    "user_preferences": st.session_state.user_profile
                }

                try:
                    # Show progress
                    progress_bar = st.progress(0)
                    status_text = st.empty()

                    # Make API request
                    status_text.text("Planning your trip...")
                    response = requests.post(
                        "http://localhost:8000/api/travel-plan",
                        json=data
                    )

                    if response.status_code == 200:
                        itinerary = response.json()
                        st.session_state.itinerary = itinerary
                        # st.success("Itinerary generated successfully!")
                    else:
                        st.error(f"Error: {response.text}")

                except Exception as e:
                    st.error(f"An error occurred: {str(e)}")
        else:
            st.warning("Please enter at least one destination")
        
    if "itinerary" in st.session_state and st.session_state.itinerary:
      st.subheader("Your Travel Itinerary")
      for day in st.session_state.itinerary['day_itinerary']:
          with st.expander(f"📍 {day['name']}", expanded=True):
              for place in day['activities']:
                  st.subheader(place['time_of_day'])
                  for activity in place['activities']:
                      checkbox_key = f"{day['name']}_{place['time_of_day']}_{activity}"
                      is_selected = checkbox_key in st.session_state.selected_places
                      if st.checkbox(f"✅ {activity}", key=checkbox_key, value=is_selected):
                          if checkbox_key not in st.session_state.selected_places:
                              st.session_state.selected_places.append(checkbox_key)
                      else:
                          if checkbox_key in st.session_state.selected_places:
                              st.session_state.selected_places.remove(checkbox_key)
    else:
      st.warning("Please enter at least one destination to create your itinerary.")


with col2:
    # User Preferences 
    st.subheader("User Travel Preferences")
    if not st.session_state.user_profile and len(st.session_state.user_profile) == 0:
      try:
          # Show progress
          progress_bar = st.progress(0)
          status_text = st.empty()

          # Make API request
          status_text.text("Personalizing your trip...")
          response = requests.get("http://localhost:8000/api/user-profile")

          if response.status_code == 200:
              st.session_state.user_profile = response.json()
          else:
              st.error(f"Error: {response.text}")

      except Exception as e:
          st.error(f"An error occurred: {str(e)}")

    st.markdown(st.session_state.user_profile)

    st.subheader("Map View")
    # Placeholder for Google Maps integration
    st.info("Map view will be displayed here once destinations are selected")
    # Button to generate a mock map-based itinerary
    if st.button("🗺️ Generate GMap Itinerary with Timestamps"):
      if len(st.session_state.selected_places) >= 2:
            st.success("GMap-based itinerary generation initialized!")
            st.markdown("**Selected Stops:**")
            for place in st.session_state.selected_places:
                st.markdown(f"- {place}")
            # Generate GMap link
            base_url = "https://www.google.com/maps/dir/"
            embed_url = f"https://www.google.com/maps/embed/v1/directions?key={GOOGLE_MAPS_API_KEY}"
            embed_url += f"&origin={urllib.parse.quote_plus(st.session_state.selected_places[0])}&destination={urllib.parse.quote_plus(st.session_state.selected_places[-1])}"  # orgin and destination
            # Add waypoints if there are more than 2 places
            waypoints = "|".join([urllib.parse.quote_plus(place) for place in st.session_state.selected_places[1:-1]])
            embed_url += f"&waypoints={waypoints}"
            encoded_stops = [urllib.parse.quote_plus(place) for place in st.session_state.selected_places]
            gmap_url = base_url + "/".join(encoded_stops)
            st.success("Here's your custom route:")
            st.markdown(f"[Click here to open in Google Maps 🚗]({gmap_url})", unsafe_allow_html=True)
            st.components.v1.iframe(embed_url, height=600)
      else:
        st.warning("No places selected yet. Tap atleast 2 checkboxes to build your journey!")

    # Additional information
    st.subheader("Trip Summary")
    if 'itinerary' in locals():
        st.write(f"Total Destinations: {len(destinations)}")
        st.write(f"Selected Options: {', '.join(options)}")
        st.write(f"Budget Level: {budget}")
        st.write(f"Minimum Rating: {min_rating}")</code></code></pre><h2>Integrating Google's Gemini AI</h2><p>The magic happens when we connect to Google's Gemini API. This generative AI model takes the attractions and reviews we've collected and transforms them into a cohesive travel plan that makes sense logically and geographically.</p><p>What makes this approach powerful is that the AI doesn't just randomly suggest popular spots — it analyzes actual reviews to understand what people enjoy about each place, what to expect, and how much time to allocate.</p><h2>Challenges and Learnings</h2><p>Building this tool wasn't without challenges:</p><ol><li><p><strong>Rate Limits</strong>: Google Maps API has rate limits that required implementing caching strategies</p></li><li><p><strong>Context Handling</strong>: Gemini sometimes needed additional context to create geographically sensible routes</p></li><li><p><strong>Review Quality</strong>: Not all Google reviews are helpful, so I had to implement filtering</p></li><li><p><strong>Time Estimation</strong>: Creating realistic timing was tricky and required fine-tuning</p></li></ol><h2>Deployment in Google Colab</h2><p>For initial testing and to make it accessible to friends without having to set up proper hosting, I deployed the entire application in Google Colab. This allowed me to quickly iterate on the design and get feedback.</p><p>The Colab integration required some additional code to handle authentication and session management, but it made sharing and testing much easier during development.</p><h2>Next Steps</h2><p>While the current version is already useful, I have several enhancements planned:</p><ol><li><p><strong>Airbnb Integration</strong>: Pull in accommodation options near the planned activities</p></li><li><p><strong>Flight Options</strong>: Compare flight prices and schedules</p></li><li><p><strong>Photo Collaboration</strong>: Allow travelers to share and collect photos in one place</p></li><li><p><strong>Machine Learning Improvements</strong>: Train models on successful itineraries to improve recommendations</p></li></ol><h2>Lessons for AI Application Developers</h2><p>If you're building your own AI-powered applications, here are a few takeaways from this project:</p><ol><li><p><strong>Start with a Clear User Problem</strong>: My frustration with travel planning gave me a clear focus</p></li><li><p><strong>Combine Existing APIs Creatively</strong>: The power is in connecting Google Maps data with generative AI</p></li><li><p><strong>Simple UI Matters</strong>: Streamlit allowed me to build a user-friendly interface quickly</p></li><li><p><strong>Prompt Engineering is Key</strong>: The quality of AI output depends heavily on how you structure your prompts</p></li><li><p><strong>Test with Real Data</strong>: Real-world testing revealed issues that weren't obvious during development</p></li></ol><h2>Try It Yourself</h2><p>The code for this project is available on my GitHub repository, and you can try a demo version on Google Colab: [<a href="https://github.com/Prikat25/Kaggle/blob/main/Project_Travel_Planner.ipynb">Github Link</a>]</p><p>If you have any questions about the implementation or ideas for improvements, feel free to leave a comment below!</p><div><hr></div><p><em>Do you have an idea for an AI application that solves a real problem? What technologies would you combine to make it happen? Share your thoughts in the comments.</em></p>
</article>
<article>
<h1>Beyond the Surface: Deep Document Dives with RAG & Vector Embeddings</h1>
<p>Priyanka — Sun, 20 Apr 2025 13:59:19 GMT</p>
<p>Okay, our AI Content Navigator could intelligently summarize webpages and even let us chat about those summaries. But we knew the real gems – the deep insights, the specific details – were often locked away inside linked documents like PDFs and lengthy video transcripts. How could we unlock <em>that</em> knowledge without forcing the user (or the AI!) to read everything cover-to-cover?</p><blockquote><p><strong>The Challenge:</strong> How do you enable precise Q&A over vast amounts of text when Large Language Models (LLMs) have finite attention spans (context windows)?</p></blockquote><p>Standard LLM prompting often involves stuffing as much text as possible into the input, hoping the answer is in there somewhere. This is inefficient and often fails for long documents. We needed a smarter approach.</p><p><strong>Our GenAI Solution: Retrieval Augmented Generation (RAG)</strong></p><p>We implemented <strong>Retrieval Augmented Generation (RAG)</strong>, a technique that's revolutionizing how AI interacts with large datasets. Think of it like an open-book exam for the AI: instead of relying only on what it memorized during training, it first looks up the relevant information before answering.</p><p>Here’s our RAG pipeline:</p><ol><li><p><strong>Knowledge Prep (Ingestion & Chunking):</strong> We first processed our source documents (PDFs via <code>PyPDF2</code>, transcripts via <code>youtube_transcript_api</code>). Critically, we broke these large documents into smaller, overlapping text <em>chunks</em>. This makes finding specific information much easier.</p></li><li><p><strong>Creating Semantic Fingerprints (Embeddings):</strong> This is the core of RAG's "retrieval" magic. We used a specialized Gemini embedding model (<code>models/text-embedding-004</code>) via a custom <code>GeminiEmbeddingFunction</code>. This model reads each text chunk and converts its meaning into a dense numerical vector – an <strong>Embedding</strong>. Think of it as a unique fingerprint capturing the chunk's semantic essence. Similar concepts get similar fingerprints.</p></li><li><p><strong>Building the Semantic Library (Vector Databases):</strong> These embedding vectors, along with their corresponding text chunks, need a home where they can be searched quickly based on similarity. We used <strong>Vector Databases</strong> for this:</p><ul><li><p><strong>ChromaDB:</strong> Great for persistent storage – the embeddings stick around.</p></li><li><p>FAISS: Blazingly fast for in-memory searches.</p><p>These databases excel at Vector Search: finding vectors closest to a given query vector.</p></li></ul></li><li><p><strong>Finding the Clues (Retrieval):</strong> When a user asks a question (e.g., "Explain the RLHF pipeline"), we <em>first</em> convert that question into an embedding vector using the <em>same</em> Gemini model. Then, we query the vector database: "Find the text chunks whose vectors are most similar to this question vector." The database instantly returns the most relevant chunks from our documents.</p></li><li><p><strong>Generating the Informed Answer (Augmentation & Generation):</strong> Now, instead of just sending the question to the LLM, we send the question <em>plus</em> the relevant text chunks we just retrieved. This "augmented" prompt gives the Gemini chat model (<code>ChatGoogleGenerativeAI</code>, managed via LangChain's <code>RetrievalQAWithSourcesChain</code>) the specific context it needs. The LLM then generates an answer that is directly <em>grounded</em> in the information found within the source documents, often even citing which chunks it used.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-oOu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55f89749-c64e-46ed-a935-8c87c0879595_1920x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title>(Figure 1: The RAG process – finding relevant information before asking the LLM to answer.)
(Figure 2: Imagine text chunks as points in space; embeddings place similar meanings close together, making retrieval efficient.)
Code Concepts
Custom Gemini Embedding Function (Conceptual):
# Simplified concept from PDF page 36
from chromadb import Documents, EmbeddingFunction, Embeddings
import google.generativeai as genai_api # Alias to avoid conflicts
# ... other imports, retry logic ...

class GeminiEmbeddingFunction(EmbeddingFunction):
    """Uses Gemini API to generate embeddings for text."""
    def __init__(self, model_name="models/text-embedding-004", task_type="retrieval_document"):
        self.model_name = model_name
        self.task_type = task_type # Use 'retrieval_query' for queries

    # Decorate with retry logic for API stability
    # @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        """Generates embeddings for a list of text documents."""
        if not isinstance(input, list): input = [input] # Ensure list input
        try:
            response = genai_api.embed_content(
                model=self.model_name,
                content=input,
                task_type=self.task_type,
            )
            # Handle potential API response structure variations
            return [e.get("values") for e in response.get("embeddings", [])]
        except Exception as e:
            print(f"Embedding Error: {e}")
            return [[]] * len(input) # Return empty lists on failure

Chunking and Adding to Vector DB (Conceptual):
# Simplified concept from PDF page 37
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
# ... other imports (PdfReader etc.)

# Initialize ChromaDB client and embedding function for documents
db_client = chromadb.PersistentClient(path=DATABASE_BASE_PATH) # Assume path defined
embed_fn_doc = GeminiEmbeddingFunction(task_type="retrieval_document")
collection = db_client.get_or_create_collection(name="pdf_content", embedding_function=embed_fn_doc)

# Initialize a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Process documents (e.g., PDFs)
# ... (loop through files in PDF_BASE_PATH) ...
#     full_text = extract_text_from_pdf(file_path)
#     if full_text:
#         chunks = text_splitter.split_text(full_text)
#         # Prepare lists for batch addition
#         chunk_texts = []
#         chunk_metadatas = []
#         chunk_ids = []
#         for i, chunk in enumerate(chunks):
#             chunk_id = f"{filename}_chunk_{i}"
#             chunk_texts.append(chunk)
#             chunk_metadatas.append({"source": filename, "chunk_num": i})
#             chunk_ids.append(chunk_id)
#
#         # Add batch to ChromaDB
#         if chunk_texts:
#             collection.add(documents=chunk_texts, metadatas=chunk_metadatas, ids=chunk_ids)


Querying with LangChain RAG Chain (Conceptual):
# Simplified concept from PDF page 40-41
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma

# Initialize LLM for generation
llm = ChatGoogleGenerativeAI(model=MODEL_NAME, temperature=0.3)

# Embedding function for the query
query_embed_fn = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004", task_type="retrieval_query")

# Connect to the vector store
vectorstore = Chroma(
    client=db_client,
    collection_name="pdf_content",
    embedding_function=query_embed_fn # Used for querying
)

# Create the RAG chain
rag_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff", # Simple method to combine context and query
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), # Get top 5 relevant chunks
    return_source_documents=True, # Include sources in the output
)

# Run a query
user_query = "What are the main challenges in implementing RAG systems?"
result = rag_chain.invoke({"question": user_query})

print(f"Answer: {result['answer']}")
print("\nSources:")
for doc in result.get("source_documents", []):
    print(f" - {doc.metadata.get('source', 'Unknown')} (Chunk {doc.metadata.get('chunk_num', '?')})")

Grounding AI in Reality
RAG was transformative for the Navigator. It allowed users to ask detailed questions and receive answers directly supported by the source material, drastically reducing the chance of the AI making things up (hallucinating). The quality of the answer now depended more on the relevance of the retrieved information than just the LLM's internal knowledge.
The Emerging Complexity: We now had several powerful AI modules: summarization, database chat, RAG Q&A. How could we make them collaborate effectively? The next post introduces the "AI Conductor" – the LangGraph Supervisor agent designed to manage this growing complexity.



Giving Data a Voice: Interactive Chat with FastAPI & LangGraph Agents
Priyanka — Sun, 20 Apr 2025 13:56:36 GMT
In Part 1, we tackled the first challenge: getting our AI Content Navigator to read and distill webpages into neat, structured summaries (JSON/CSV). Useful? Absolutely. Engaging? Not quite. Information becomes truly powerful when you can interact with it, ask questions, and explore it conversationally.
The Goal: Allow users to query the summarized website data using natural language, without needing to know SQL or complex commands.
How do you build a bridge between a human asking "What videos were mentioned?" and a database holding the answer?
Our GenAI Approach: The AI Database Agent
We engineered an interactive chat layer powered by an AI agent, a database, and a modern web API:
Storing the Summary: The structured CSV data (containing extracted links, types, summaries) was loaded into a simple, efficient SQL database (SQLite). This created a queryable knowledge base specific to the summarized webpage.
Meet the Agent: The star of this show is an AI Agent. Using LangGraph's create_react_agent and the intelligence of a Gemini model, we built a specialized agent – think of it as an AI assistant trained specifically to be a database query expert.
Tools of the Trade: We equipped this agent with the necessary tools (Function Calling) to interact with our SQL database:
list_tables: "What data do we have?"
describe_table: "What's in this specific table?"
execute_query: "Run this specific SQL command (safely, only SELECTs!)."
insert_query: Used initially to load the data.
The Magic of Translation: This is where the agent shines.
A user asks in plain English: "Show me the PDF summaries."
The agent analyzes the request, understanding the intent is to find PDF links and their descriptions.
It determines the execute_query tool is needed and formulates the appropriate SQL: SELECT links_summary FROM genai_csv_summary_table WHERE link_key = 'pdf'.
It calls the tool with the SQL query.
The tool runs the query against the database and returns the results.
The agent receives the raw data and translates it back into a friendly, conversational response for the user.
Connecting to the User (FastAPI): To make this agent accessible to our web application (built with Streamlit), we needed an API. We chose FastAPI, a high-performance Python web framework ideal for building APIs quickly and efficiently. We created an API endpoint (e.g., /generate) that listens for user chat messages. The Streamlit frontend sends messages to this endpoint, FastAPI routes them to the LangGraph agent, and then streams the agent's responses back to the UI in real-time.
(Figure 1: The flow of conversation, mediated by the FastAPI backend and the LangGraph agent.)
Code Concepts
Database Tools (Conceptual):
# Simplified concept from tools.py
import sqlite3
import pandas as pd

# Assume db_file points to the SQLite database path

def list_tables() -> list[str]:
    """Retrieves table names from the database."""
    # ... (connect, execute 'SELECT name FROM sqlite_master...', handle errors)
    pass

def execute_query(sql: str) -> list[list[str]]:
    """Executes a safe SQL SELECT query."""
    if not sql.strip().upper().startswith("SELECT"):
         return [["Error: Only SELECT queries are permitted."]]
    # ... (connect, execute query, fetch results, handle errors)
    pass

def insert_query(table_name: str = 'genai_csv_summary_table') -> str:
    """Loads data from the summary CSV into the database table."""
    # ... (find CSV, read with pandas, connect, df.to_sql, handle errors)
    pass

Creating the Agent (Conceptual):
# Simplified concept from graph_fastapi.py
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent

llm = ChatGoogleGenerativeAI(model=MODEL_NAME, temperature=0.5)
# Define the tools the agent can use (excluding dangerous ones like drop_table)
db_tools = [list_tables, describe_table, execute_query]
llm_with_tools = llm.bind_tools(tools=db_tools)

# Craft a clear prompt telling the agent its role and how to use tools
agent_prompt = """
You are an AI assistant querying a database with website link summaries.
Table: genai_csv_summary_table (columns: link_id, link_value, link_key, links_summary).
Use tools list_tables, describe_table, execute_query to answer user questions accurately.
Formulate SQL SELECT queries based on user intent. Explain results clearly.
If data is unavailable, state that clearly. Do not invent answers.
"""

# Create the agent using LangGraph's helpers
agent_executor = create_react_agent(llm_with_tools, tools=db_tools, prompt=agent_prompt)

# Integrate agent_executor into a LangGraph workflow ('graph')
# result = graph.invoke({"messages": [("user", "List PDF links")]}, config=...)

FastAPI Endpoint (Conceptual):
# Simplified concept from server.py
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
import json
import asyncio
import time
# Assume 'graph' is the compiled LangGraph agent graph

app = FastAPI()
# ... (CORS setup) ...

def sse_format(data: dict) -> str:
    """Formats data for Server-Sent Events."""
    return f"data: {json.dumps(data)}\n\n"

@app.post("/generate")
async def generate_content(request: Request):
    """Handles user queries and streams agent responses via SSE."""
    # ... (Error handling for request body) ...
    data = await request.json()
    user_query = data.get("db_input")
    # ... (Error handling for missing input) ...

    thread_id = f"chat_{int(time.time())}" # Simple session ID
    config = {"configurable": {"thread_id": thread_id}}

    async def stream_generator():
        """Streams agent output asynchronously."""
        async for output in graph.astream({"messages": [("user", user_query)]}, config=config):
            # Extract relevant message content based on your graph structure
            agent_messages = output.get("interact_with_db", {}).get("messages", [])
            if agent_messages and hasattr(agent_messages[-1], 'content'):
                yield sse_format({"content": agent_messages[-1].content})
                await asyncio.sleep(0.02) # Throttle streaming slightly
        # Signal end of stream (optional)
        # yield sse_format({"event": "end"})

    return StreamingResponse(stream_generator(), media_type="text/event-stream")

# Run with: uvicorn server:app --reload --port 8000

Giving Data Its Voice
It worked beautifully! The agent acted as a seamless translator between human language and SQL. Users could ask questions naturally, and the combination of LangGraph, Gemini, and FastAPI delivered answers conversationally. It felt less like querying and more like collaborating with an AI that understood the data's context.
But what about the real depth? The summaries are helpful, but the richest information often lies within the linked PDFs and video transcripts themselves. How could we query those directly? Get ready for Part 3, where we explore the fascinating technique of Retrieval Augmented Generation (RAG).


Drowning in Data? How We Built an AI Navigator to Find the Signal
Priyanka — Sun, 20 Apr 2025 13:53:33 GMT
Let's face it: we're swimming in an ocean of digital information. Websites, articles, research papers, videos – it's a constant flood. Finding the crucial insights, the actual signal in all that noise, often feels less like research and more like a frantic search for a needle in an infinitely expanding haystack.
The Core Problem: Information overload hinders our ability to learn, analyze, and make decisions effectively.
This feeling of being overwhelmed was the spark for our capstone project. We asked: Could we build an intelligent assistant, an AI-Powered Content Navigator, to not just fetch information, but to understand, summarize, and interact with it?
This series chronicles our adventure in building exactly that, harnessing the incredible power of Generative AI (GenAI), particularly Google's versatile Gemini models, orchestrated with the workflow magic of LangGraph.
(Figure 1: The daily struggle – finding clarity amidst the digital chaos.)
First Things First: What is GenAI and Gemini?
Before we dive deeper, let's clarify our key tools:
Generative AI (GenAI): This isn't your typical analytical AI. GenAI models are creators. Trained on vast datasets, they learn patterns to generate entirely new content – think writing articles, composing music, designing images, or even writing code, like Gemini is doing right now!
Gemini: Developed by Google DeepMind, Gemini represents a leap forward in AI. It's a family of multimodal models. "Multimodal" is key – it means Gemini isn't limited to just text; it can seamlessly understand, combine, and generate information across different formats, including text, code, images, audio, and video. This versatility was crucial for our Navigator.
Challenge #1: Making Sense of the Web
Our first task was fundamental: how could the Navigator read and understand the content of any given webpage? Modern websites are complex beasts, often loading content dynamically long after the initial page load.
The Hurdle: How do you reliably capture the complete picture of a dynamic webpage and extract its essence into a structured, usable format?
Our GenAI Approach: The Smart Scraper
We combined automated browsing with Gemini's analytical prowess:
Simulating the User: We created a tool (browse_url) using Selenium, a browser automation library. This tool acts like a patient user, visiting a URL (checking robots.txt first!), waiting for the page to fully render (including dynamic elements), and then capturing the final HTML code.
Delegating Analysis: We used LangGraph to define a workflow step (generate_summary_webpage). Here, we instructed a Gemini model that it had a new skill: using the browse_url tool (Function Calling).
Extracting Structured Gold: The crucial instruction to Gemini was: Use the tool to get the HTML, analyze its content (Document Understanding), and give me back a structured JSON summary (Structured Output/JSON Mode). Requesting JSON ensures the output is predictable and easy for other parts of our system to use.
Mapping the Links: A subsequent step (generate_summary_links) fed this JSON summary back to Gemini, asking it to identify all hyperlinks (PDFs, YouTube, etc.) and organize them into a neat CSV format.
(Figure 2: The clean, organized data structure Gemini provides.)
Code Sneak Peek
Here’s a glimpse of the concepts in action:
Fetching the Webpage HTML (Simplified):
# Simplified concept from tools.py
from selenium import webdriver
import time
# ... other imports

def browse_url(url: str) -> str:
    """Uses Selenium to capture rendered HTML from a URL."""
    # ... (respect robots.txt, setup webdriver options) ...
    try:
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)
        time.sleep(5) # Allow time for dynamic content loading
        html_content = driver.page_source
        return html_content
    # ... (error handling) ...
    finally:
        if driver: driver.quit()

Instructing Gemini for JSON Summary (Conceptual):
# Simplified concept from 1_download_website_summary_home.py

# System prompt tells the LLM about its tools and desired output format
sys_instruction = """
You are an expert web content analyzer. Use the `browse_url` tool to fetch webpage HTML.
Analyze the HTML structure and content thoroughly.
Return a comprehensive summary structured as a valid JSON object.
The JSON should capture key sections, headings, and main points.
"""

# Configuration for the Gemini API call
generation_config = {
    'temperature': 0.5, # Adjust for desired creativity/factuality
    'system_instruction': sys_instruction,
    'tools': [browse_url], # Make the tool available
    'response_mime_type': 'application/json' # Explicitly request JSON
}

# Making the call within a LangGraph node
# response = client.models.generate_content(...)
# json_summary = response.text

Initial Success & What We Learned
This approach worked! We could consistently pull structured summaries from diverse websites. The JSON output became a reliable input for the next stages. However, the quality wasn't always perfect. It depended heavily on how well-structured the source webpage was. Messy HTML could sometimes confuse the analysis.
Coming Up Next: Static summaries are just the beginning. How did we enable users to talk to this data? In Part 2, we'll dive into building an interactive chat layer using a database agent and the FastAPI framework. Stay tuned!


The Navigator's Log: MLOps, Lessons Learned & The Road Ahead
Priyanka — Sun, 20 Apr 2025 13:50:31 GMT
Our voyage is nearly complete. We've charted the course of building the AI Content Navigator – an ambitious project tackling information overload through web summarization, interactive chat, deep RAG-based Q&A, sophisticated agent orchestration, AI-powered quizzes, and even on-demand educational video generation. It's been a deep dive into the capabilities of Google's Gemini models and LangGraph.
But building cutting-edge AI isn't just about cool features; it's also about building sustainably and efficiently. This brings us to the crucial, often unglamorous, world of MLOps (Machine Learning Operations).
The Practical Reality: How do you manage the operational aspects – particularly cost and performance – of a complex AI system that makes numerous API calls, often with costs tied directly to data processed (tokens)?
Our MLOps Approach: Monitor, Analyze, Optimize
We integrated MLOps thinking from the start:
Universal Metering (Token Tracking): We instrumented every significant interaction with the Gemini API. Whether generating text, analyzing images, synthesizing audio, evaluating answers, or calling functions, we captured the prompt_token_count and candidates_token_count (output) from the API's metadata. This data, collected across all modules (summarizer, agents, quiz, video pipeline), gave us a granular view of our resource consumption. Think of it as installing detailed usage meters throughout our AI factory.
AI, Analyze Thyself: Why manually sift through logs? We leveraged Gemini's analytical power. We fed the aggregated token_count data back into the model, tasking it with acting as an MLOps analyst. Its job: identify the most token-hungry operations (both input and output) and suggest concrete optimization strategies.
Code Concept: AI-Powered Token Analysis
(Simplified concept based on our implementation)
import json
from IPython.display import Markdown # Or other display method

# Assume 'token_count_data' is the aggregated dictionary of token usage

analysis_prompt = f"""
**Act as an MLOps Analyst.**

**Input:** JSON data showing prompt and completion token counts for tasks in an AI application.

**Task:**
1. Summarize total token usage per task.
2. Identify tasks with the highest *prompt* token costs and explain why (e.g., large input data).
3. Identify tasks with the highest *completion* token costs and explain why (e.g., verbose output).
4. For each high-cost area, provide 2 actionable strategies for token reduction (e.g., input summarization, prompt optimization, RAG tuning).

**Data:**
```json
{json.dumps(token_count_data, indent=2)}
Key Lessons from the Logbook
The AI's analysis, combined with our own observations during development, highlighted critical MLOps lessons for building practical Generative AI applications:
Input Size Matters:
Why: Processing large raw inputs (like full HTML pages, extensive RAG context, or long transcripts) directly translates to high prompt token costs.
Lesson: Pre-process and condense inputs whenever possible. Summarize, extract key information, or use techniques to reduce the raw data fed to the model.
Output Verbosity Costs:
Why: Generating detailed JSON structures, lengthy explanations, or extensive feedback significantly increases completion tokens.
Lesson: Tailor output length via prompt instructions. Be specific about the desired format and brevity. Consider multi-step generation if a complex final result can be built up from smaller, cheaper steps.
RAG Needs Tuning:
Why: Retrieving too many or irrelevant document chunks for Retrieval Augmented Generation (RAG) inflates prompt costs unnecessarily by increasing the context window size.
Lesson: Optimize retrieval parameters such as the number of chunks (k), similarity thresholds, and filtering. Explore techniques like contextual compression to select only the most relevant sentences from retrieved chunks.
Agents Add Overhead:
Why: Each step in an agent or supervisor workflow often involves additional LLM calls for reasoning, tool selection, or routing decisions, adding token costs beyond the core task.
Lesson: Design efficient agent workflows. Minimize the number of turns or reasoning steps required. Consider techniques like context caching (a planned future step for us) to reduce repetitive prompts.
Effective MLOps isn't merely an afterthought tacked on at the end of a project; it's absolutely essential for building practical, scalable, and cost-effective AI applications from the ground up.
Capabilities Unleashed: The AI Navigator in Review
This project served as a powerful demonstration of how multiple GenAI capabilities can synergize to create a complex, functional system:
Structured Output: For reliable data extraction and flow between components.
Document Understanding: Handling and processing diverse text formats from the web.
Function Calling & Agents: Enabling the AI to use external tools and orchestrate multi-step processes.
Embeddings, Vector Search & RAG: Providing deep contextual understanding and access to external knowledge.
GenAI Evaluation: Automating assessment tasks like quiz grading.
Multimodality: Generating images and audio alongside text for rich content creation (like videos).
MLOps: Monitoring and analyzing performance and cost for sustainability.
Setting Sail for the Future: The Road Ahead
The AI Content Navigator, as presented in this series, is a robust proof-of-concept. However, the ocean of AI possibilities is vast, and this journey is far from over. Future voyages could include:
True Video Understanding: Moving beyond transcripts to analyze frames and audio directly.
Finetuning: Training models on domain-specific data for superior domain-specific performance.
Full MLOps Automation: Implementing CI/CD pipelines, automated deployment, and sophisticated alerting/monitoring.
Context Caching: Developing smarter memory systems for agents to reduce latency and token costs in ongoing conversations.
More Sophisticated Agents: Exploring more sophisticated agent architectures with better long-term memory and planning.
Scalable Deployment: Moving the system from a prototype environment to robust, scalable cloud platforms.
Building the AI Content Navigator wasn't merely an exercise in coding; it was about orchestrating intelligence. It showcased how combining different facets of Generative AI can create tools that don't just passively provide information, but actively help us understand, interact with, and learn from our increasingly complex digital world.
The voyage continues!


The AI Tutor: Generating Personalized Educational Videos On Demand
Priyanka — Sun, 20 Apr 2025 13:34:49 GMT
Our AI Content Navigator could find information, let users chat with it, answer deep questions from documents, manage its own workflow, and even evaluate user understanding via interactive quizzes. We were close to realizing our vision of an AI learning companion. The final, most ambitious step: could we automatically create video content to help users solidify concepts they struggled with?
The Ultimate Learning Aid: Can AI generate personalized, multimodal educational content (video!) tailored to a user's specific knowledge gaps?
Creating video manually is slow and expensive. Automating it, especially in a personalized way, is a frontier challenge in AI.
Our GenAI Approach: The Multimodal Assembly Line
We designed a pipeline that combined Gemini's text, image, and audio capabilities with video editing tools:
Identifying the Need: The process kicks off using the output from the AI-evaluated quiz (Part 5). The system knows which topics the user found difficult.
Gathering Raw Material (RAG): Accuracy is paramount. Before generating anything, the system uses RAG to retrieve relevant factual snippets about the weak topics from the original source documents (PDFs, transcripts) stored in our vector databases.
Writing the Script (Structured Output): This retrieved context fuels a specialized Gemini agent (video agent or similar). It's prompted to write a short, clear educational script explaining the difficult topics. Crucially, it uses Structured Output to format the script into segments, each containing:
image_prompt: A description for the visual element of that scene.
audio_text: The narration script for that scene.
character_description: (Optional) Notes for visual consistency.
Creating the Visuals (Image Generation): The image_prompt for each segment is fed to an image generation model (like Imagen, or using Gemini's multimodal capabilities). This leverages Image Generation/Understanding to create a unique visual for each part of the narration.
Adding the Voice (Audio Generation): The audio_text narration for each segment is sent to the Gemini Live API, which uses Audio Generation/Understanding to synthesize speech. This audio is saved as a WAV file for each segment.
Putting it Together (Video Assembly): The Python library MoviePy acts as our automated video editor. It takes each generated image, turns it into a short video clip (ImageClip), sets its duration to match the corresponding generated audio file (AudioFileClip), and then combines the image and audio. Finally, it stitches all these individual segment clips together (concatenate_videoclips) into a finished MP4 video.
(Figure 1: The end product – a custom video ready for the user.)
Code Concepts
Generating the Structured Script (Conceptual):
# Simplified concept from PDF page 47-48, 60
from pydantic import BaseModel, Field
import json
# Assuming 'llm', 'StoryResponse', 'StorySegment' are defined
# Assuming 'weak_topics' and 'retrieved_context' are available

def generate_video_script(topics: str, context: str) -> str:
    """Generates a structured video script using an LLM."""
    prompt = f"""
    Create a short (1-2 min) educational video script explaining: {topics}.
    Use this context for accuracy: {context}. Keep it simple for someone learning.
    Output ONLY a valid JSON object using the provided schema (StoryResponse with StorySegments).
    Each segment needs an 'image_prompt' (visual description) and 'audio_text' (narration).
    """
    try:
        generation_config = {
            'response_mime_type': 'application/json',
            # 'response_schema': StoryResponse.model_json_schema(), # Define schema if needed
            'temperature': 0.7 # Allow some creativity in explanation
        }
        # response = llm.invoke(prompt, config=generation_config) # Or client.generate_content
        # Validate response.content is valid JSON before returning
        # return response.content
        pass # Placeholder for the actual call
    except Exception as e:
        print(f"Script Generation Error: {e}")
        return '{"complete_story": []}' # Return empty structure on error

Generating Segment Audio (Conceptual):
# Simplified concept from PDF page 82
import asyncio
import wave
# Assuming 'client' with Live API access is initialized

async def generate_audio_live_async(narration: str, output_wav_path: str) -> bool:
    """Generates WAV audio from text using Gemini Live API."""
    # Add negative prompt to prevent conversational filler from the AI
    prompt = "don't say OK , I will do this or that, just only read the following text: " + narration
    config = {"response_modalities": ["AUDIO"]}
    audio_data = bytearray()
    try:
        async with client.aio.live.connect(model=MODEL, config=config) as session:
            await session.send(input=prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data: audio_data.extend(response.data)
        if not audio_data: return False
        # Save audio data to WAV file
        with wave.open(output_wav_path, "wb") as wf:
            wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(24000)
            wf.writeframes(bytes(audio_data))
        return True
    except Exception as e:
        print(f"Audio Generation Error for {output_wav_path}: {e}")
        # Implement retry logic if desired
        return False

Assembling the Video with MoviePy (Conceptual):
# Simplified concept from PDF page 79, 85
from moviepy.editor import ImageClip, AudioFileClip, CompositeVideoClip, concatenate_videoclips
import os
# Assuming 'segments' list contains {'image_path': '...', 'audio_path': '...'}

def assemble_video(segments: list, output_path: str) -> bool:
    """Combines image and audio clips into a final video."""
    clips = []
    success = True
    try:
        for i, seg in enumerate(segments):
            if not os.path.exists(seg['image_path']) or not os.path.exists(seg['audio_path']):
                print(f"Skipping segment {i}: Missing files.")
                continue
            try:
                audio = AudioFileClip(seg['audio_path'])
                if audio.duration <= 0: continue # Skip zero-duration audio
                img = ImageClip(seg['image_path']).set_duration(audio.duration)
                video_segment = img.set_audio(audio)
                clips.append(video_segment)
            except Exception as e_inner:
                print(f"Error processing segment {i}: {e_inner}")
                # Close clips if they were opened
                if 'audio' in locals(): audio.close()
                if 'img' in locals(): img.close()
                if 'video_segment' in locals(): video_segment.close()

        if not clips:
            print("No valid clips to assemble.")
            return False

        final_video = concatenate_videoclips(clips, method="compose")
        final_video.write_videofile(output_path, fps=24, codec='libx264', audio_codec='aac')
        print(f"Video saved to {output_path}")

    except Exception as e_outer:
        print(f"Video Assembly Error: {e_outer}")
        success = False
    finally:
        # Ensure all clips are closed
        if 'final_video' in locals(): final_video.close()
        for clip in clips:
            if clip: clip.close()
    return success

The AI Video Tutor Emerges
This pipeline, while complex, achieved our goal: automatically generating personalized educational videos. The quality hinges on the synergy between the script generation, image relevance, audio clarity, and assembly process. While creating Hollywood-level productions automatically is still futuristic, this demonstrated the incredible potential of using multimodal AI to create tailored, engaging learning experiences on demand. The use of RAG to ground the script was noted as particularly important for educational value.
Final Thoughts: Building this entire system was a journey through the cutting edge of GenAI. In our concluding post, we'll reflect on the practical MLOps lessons learned, the overall impact, and the exciting future possibilities for AI-powered content navigation and learning.


The AI Grader: Interactive Quizzes with GenAI Evaluation
Priyanka — Sun, 20 Apr 2025 13:26:49 GMT
Our AI Content Navigator could now find, summarize, query, and orchestrate information like a pro. But learning isn't just about access; it's about engagement and feedback. We wanted to push the Navigator beyond being an information source and towards becoming an active learning partner.
The Learning Challenge: How can we automatically test a user's comprehension of complex material (like a technical video) and provide meaningful feedback beyond a simple "correct" or "incorrect"?
Standard quizzes often fall short. We envisioned a system where the AI could not only generate questions but also understand and evaluate the user's answers with nuance.
Our GenAI Approach: The Quiz Master & Evaluator
We integrated an AI-powered quiz module using LangGraph and Gemini's capabilities:
Automated Question Mining: Forget manual quiz creation. We tasked Gemini, using Structured Output (response_schema), to read through our processed YouTube transcripts and automatically extract relevant Question/Answer pairs (extract_qa_from_file). This created a dynamic pool of potential quiz questions, stored and indexed in our ChromaDB vector store.
Tailored Quiz Sessions: The user starts by choosing topics and the number of questions, allowing for focused review sessions.
Contextual Question Selection (RAG): The LangGraph quiz workflow uses the selected topics to retrieve the most relevant Q&A pairs from the vector store (RAG), ensuring the questions align with the user's learning goals.
Enter the AI Grader (GenAI Evaluation): This is the core innovation. When the user answers a question, we don't just check for keywords. We invoke another Gemini model specifically prompted to act as an expert evaluator (evaluate_answer node). This AI receives the question, the correct answer (from our extracted pool), and the user's answer. Following a detailed rubric (EVAL_BOT_SYSINT), it assesses the user's response for accuracy, relevance, and clarity, assigning a score (e.g., 1-5). This is GenAI Evaluation in action – AI assessing the quality of human-generated (or AI-assisted) text against a known good answer.
Adaptive Learning Loop: The system tracks scores. If the user consistently struggles with certain topics (based on low scores), and they've enabled "interactive Q/A," the system subtly adjusts the next query to the vector store (provide_feedback node). It prioritizes questions related to these weaker areas, creating an adaptive learning path focused on reinforcing understanding where it's needed most.
Score & Summary: The quiz concludes when the user chooses or the question pool is exhausted, providing a final performance summary.
(Figure 1: The cycle of question retrieval, user interaction, AI evaluation, and adaptive feedback.)
Code Concepts
Extracting Q&A Pairs (Conceptual):
# Simplified concept from PDF page 33-34
from pydantic import BaseModel, Field
import json
# Assuming 'client' is initialized GenAI client

class QAPair(BaseModel):
    question: str
    answer: str

class QAExtractResponse(BaseModel):
    qa_pairs: list[QAPair]

def extract_qa_from_file(transcript: str) -> str:
    """Uses Gemini with structured output to find Q&A in text."""
    prompt = f"Analyze the transcript and extract question/answer pairs. Format as JSON matching the schema.\n\nTranscript:\n{transcript}"
    try:
        generation_config = {
            'response_mime_type': 'application/json',
            'response_schema': QAExtractResponse.model_json_schema(),
        }
        response = client.models.generate_content(
            model=MODEL_NAME, contents=prompt, generation_config=generation_config
        )
        return response.text
    except Exception as e:
        print(f"Q&A Extraction Error: {e}")
        return '{"qa_pairs": []}'

Evaluating User Answer with AI (Conceptual LangGraph Node):
# Simplified concept from PDF page 68-69
# Within the 'evaluate_answer' LangGraph node function

def evaluate_answer_node(state: QuizState) -> QuizState:
    # ... (get user_answer, question_text, correct_answer from state) ...

    # EVAL_BOT_SYSINT contains the detailed rubric (1-5 score) and instructions
    eval_system_instruction = EVAL_BOT_SYSINT

    eval_generation_config = {
        'temperature': 0.2, # Low temp for consistent scoring
        'system_instruction': eval_system_instruction,
    }

    eval_prompt = f"""
    **Evaluate User Answer**
    **Question:** {question_text}
    **Correct Answer:** {correct_answer}
    **User Answer:** {user_answer}
    ---
    Follow the rubric. Output ONLY the integer score (1-5).
    **Score:**"""

    try:
        # Assuming 'chat' is an initialized chat session/client
        response = chat.send_message(eval_prompt, config=eval_generation_config)
        score = int(response.text.strip())
        if not 1 <= score <= 5: score = 0 # Handle invalid scores
    except Exception as e:
        print(f"Evaluation Error: {e}")
        score = 0 # Default score on error

    # Update state based on score
    is_correct = score > 2 # Define 'correct' threshold
    # ... (update correct/incorrect counts, add feedback messages) ...
    state["evaluation"] = f"AI Score: {score}"
    return state

Beyond Right or Wrong
This AI-driven quiz system offered a leap forward in interactive learning. The GenAI Evaluation provided instant, nuanced feedback far richer than simple pass/fail. While the quality depends on good initial Q&A extraction and consistent evaluation by the AI, it demonstrated the potential for AI to act not just as an information source, but as a personalized tutor, identifying weaknesses and adapting the learning experience.
What's Next? What happens when textual feedback isn't enough? Could we generate visual learning aids on the fly? In the penultimate post of our series, we explore the exciting, multimodal capability of generating personalized educational videos.


The AI Conductor: Orchestrating Complex Workflows with the LangGraph Supervisor
Priyanka — Sun, 20 Apr 2025 13:20:33 GMT
Our AI Content Navigator was becoming a multi-talented powerhouse. It could summarize, chat about summaries, and perform deep dives into documents using RAG. But with each new skill, a question loomed larger: How do all these pieces work together? If a user asks a complex question, how does the system know whether to query the database, perform RAG, or maybe even simplify the result?
The Orchestration Challenge: Managing a team of specialized AI capabilities requires a conductor – something to direct the flow and ensure the right tool is used at the right time.
A single, giant AI trying to do everything would be a tangled mess. Instead, we turned to a more elegant solution using LangGraph: the Supervisor Agent pattern.
Meet the Conductor and the Orchestra
Think of our AI system as an orchestra:
The Musicians (Worker Agents): Each worker agent is a highly skilled specialist, built using LangGraph and powered by Gemini:
researcher: Scours the vector databases (ChromaDB/FAISS) for information within documents.
editor: Takes raw data from the researcher and crafts well-structured reports.
simplifier: Translates complex technical explanations into plain English.
mindmap: Visualizes information as clear diagrams (using Mermaid syntax).
video: Writes scripts for educational videos.
db_agent: (From Post 2) Handles queries against the summarized data database.
(Figure 2: The Supervisor directs the workflow, deciding which agent acts next.)
The Conductor (Supervisor Agent): This top-level agent, built using LangGraph's state machine capabilities, doesn't perform the tasks itself. Its job is purely orchestration.
It listens to the user's request (e.g., "Explain RAG simply and show me a diagram").
It consults its internal "score" (its programming and the current conversation state).
It intelligently delegates:
"Okay, researcher, find information on RAG." (Researcher runs, returns findings)
"Now, simplifier, explain these findings simply." (Simplifier runs, returns explanation)
"Next, mindmap, create a diagram based on the original findings." (Mindmap runs, returns diagram code)
"Alright, present the simplified explanation and the diagram to the user."
It ensures tasks flow logically and manages the conversation state.
Code Concepts
Creating a Worker Agent (Simplifier Example - Conceptual):
# Simplified concept from PDF page 46
from langgraph.prebuilt import create_react_agent
from langchain_google_genai import ChatGoogleGenerativeAI

# Assuming llm is initialized

# Simplifier usually works on text input, doesn't need external tools
simplifier_tools = []

simplifier = create_react_agent(
    model=llm,
    tools=simplifier_tools,
    name="simplifier",
    prompt=(
        "You are an expert communicator specializing in simplifying complex topics.\n"
        "Your input will be technical text provided by another agent.\n"
        "Your task is to rewrite the input text using:\n"
        "- Plain language and short sentences.\n"
        "- Analogies or simple examples.\n"
        "- Bullet points for key takeaways.\n"
        "Do NOT add new information. Focus *only* on simplifying the provided text.\n"
        "Target audience: General public or beginners.\n"
        "Output only the simplified text."
    )
)

Building the Supervisor (LangGraph State Machine - Conceptual):
# Simplified concept from PDF page 48 - using LangGraph's core StateGraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, List, Sequence
import operator
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_google_genai import ChatGoogleGenerativeAI # For supervisor decisions

# Define the state that flows through the graph
class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    next_agent: str # Which agent should execute next?

# --- Define Agent Nodes ---
# Assume agent runnables like 'researcher_runnable', 'editor_runnable', etc. exist
# def run_researcher(state: AgentState): ... return {"messages": [AIMessage(...)]}
# def run_editor(state: AgentState): ... return {"messages": [AIMessage(...)]}
# ... other agent node functions ...

# --- Define Supervisor Node ---
# Initialize LLM for supervisor decisions
supervisor_llm = ChatGoogleGenerativeAI(model=MODEL_NAME, temperature=0)
members = ["researcher", "editor", "simplifier", "mindmap", "video"] # List of worker agents

def supervisor_router(state: AgentState) -> str:
    """Determines the next agent to call or ends the process."""
    last_message = state['messages'][-1]
    # Simple routing based on last message content (real implementation uses LLM)
    if isinstance(last_message, HumanMessage):
        # Initial request often goes to researcher
        return "researcher"
    elif last_message.name == "researcher":
        # After research, maybe edit or simplify? Depends on initial request.
        # This is where the supervisor LLM call would happen in a real system.
        # Let's assume we route to editor for this example.
        return "editor"
    elif last_message.name == "editor":
        # After editing, finish.
        return END
    else:
        # Default or error case
        return END

# --- Build the Graph ---
workflow = StateGraph(AgentState)

# Add nodes for each worker and the supervisor logic (router)
# workflow.add_node("researcher", run_researcher)
# workflow.add_node("editor", run_editor)
# ... add other worker nodes ...

# The supervisor logic is implicitly in the conditional edges
workflow.set_entry_point("researcher") # Example: start with research

# Define transitions based on the supervisor's decision
# workflow.add_conditional_edges(
#     "supervisor_router_node", # A node that calls supervisor_router
#     supervisor_router,
#     {"researcher": "researcher", "editor": "editor", ..., END: END}
# )
# Edges usually go from worker -> supervisor router -> next worker
# workflow.add_edge("researcher", "supervisor_router_node")
# workflow.add_edge("editor", "supervisor_router_node")
# ...

# Compile the graph
# app = workflow.compile()

# --- Invoke ---
# initial_state = {"messages": [HumanMessage(content="Research RAG and write a report.")]}
# for event in app.stream(initial_state):
#     print(event)


Harmony from Complexity
The Supervisor pattern was key to making our multi-skilled AI Navigator work cohesively.
Modularity: Each agent focuses on its strength.
Extensibility: Adding a new skill means adding a new worker agent, not redesigning the whole system.
Control: We define the high-level workflow and let the supervisor manage the details.
It allowed us to build a sophisticated application by composing simpler, specialized parts – a powerful paradigm for complex AI development.
What's Next? Our Navigator is smart, knowledgeable, and well-orchestrated. But how can we make it a better teacher? Part 5 introduces an interactive quiz system where the AI doesn't just ask questions, it evaluates the user's understanding.


Coming soon
Priyanka — Sun, 20 Apr 2025 13:17:05 GMT
This is Priyanka’s Substack.
Subscribe now