The AI Tutor: Generating Personalized Educational Videos On Demand

(Part 6 of 7: The AI Content Navigator Series)

Apr 20, 2025

Our AI Content Navigator could find information, let users chat with it, answer deep questions from documents, manage its own workflow, and even evaluate user understanding via interactive quizzes. We were close to realizing our vision of an AI learning companion. The final, most ambitious step: could we automatically create video content to help users solidify concepts they struggled with?

The Ultimate Learning Aid: Can AI generate personalized, multimodal educational content (video!) tailored to a user's specific knowledge gaps?

Creating video manually is slow and expensive. Automating it, especially in a personalized way, is a frontier challenge in AI.

Our GenAI Approach: The Multimodal Assembly Line

We designed a pipeline that combined Gemini's text, image, and audio capabilities with video editing tools:

Identifying the Need: The process kicks off using the output from the AI-evaluated quiz (Part 5). The system knows which topics the user found difficult.
Gathering Raw Material (RAG): Accuracy is paramount. Before generating anything, the system uses RAG to retrieve relevant factual snippets about the weak topics from the original source documents (PDFs, transcripts) stored in our vector databases.
Writing the Script (Structured Output): This retrieved context fuels a specialized Gemini agent (video agent or similar). It's prompted to write a short, clear educational script explaining the difficult topics. Crucially, it uses Structured Output to format the script into segments, each containing:
- image_prompt: A description for the visual element of that scene.
- audio_text: The narration script for that scene.
- character_description: (Optional) Notes for visual consistency.
Creating the Visuals (Image Generation): The image_prompt for each segment is fed to an image generation model (like Imagen, or using Gemini's multimodal capabilities). This leverages Image Generation/Understanding to create a unique visual for each part of the narration.
Adding the Voice (Audio Generation): The audio_text narration for each segment is sent to the Gemini Live API, which uses Audio Generation/Understanding to synthesize speech. This audio is saved as a WAV file for each segment.
Putting it Together (Video Assembly): The Python library MoviePy acts as our automated video editor. It takes each generated image, turns it into a short video clip (ImageClip), sets its duration to match the corresponding generated audio file (AudioFileClip), and then combines the image and audio. Finally, it stitches all these individual segment clips together (concatenate_videoclips) into a finished MP4 video.

(Figure 1: The end product – a custom video ready for the user.)

Code Concepts

Generating the Structured Script (Conceptual):

# Simplified concept from PDF page 47-48, 60
from pydantic import BaseModel, Field
import json
# Assuming 'llm', 'StoryResponse', 'StorySegment' are defined
# Assuming 'weak_topics' and 'retrieved_context' are available

def generate_video_script(topics: str, context: str) -> str:
    """Generates a structured video script using an LLM."""
    prompt = f"""
    Create a short (1-2 min) educational video script explaining: {topics}.
    Use this context for accuracy: {context}. Keep it simple for someone learning.
    Output ONLY a valid JSON object using the provided schema (StoryResponse with StorySegments).
    Each segment needs an 'image_prompt' (visual description) and 'audio_text' (narration).
    """
    try:
        generation_config = {
            'response_mime_type': 'application/json',
            # 'response_schema': StoryResponse.model_json_schema(), # Define schema if needed
            'temperature': 0.7 # Allow some creativity in explanation
        }
        # response = llm.invoke(prompt, config=generation_config) # Or client.generate_content
        # Validate response.content is valid JSON before returning
        # return response.content
        pass # Placeholder for the actual call
    except Exception as e:
        print(f"Script Generation Error: {e}")
        return '{"complete_story": []}' # Return empty structure on error

Generating Segment Audio (Conceptual):

# Simplified concept from PDF page 82
import asyncio
import wave
# Assuming 'client' with Live API access is initialized

async def generate_audio_live_async(narration: str, output_wav_path: str) -> bool:
    """Generates WAV audio from text using Gemini Live API."""
    # Add negative prompt to prevent conversational filler from the AI
    prompt = "don't say OK , I will do this or that, just only read the following text: " + narration
    config = {"response_modalities": ["AUDIO"]}
    audio_data = bytearray()
    try:
        async with client.aio.live.connect(model=MODEL, config=config) as session:
            await session.send(input=prompt, end_of_turn=True)
            async for response in session.receive():
                if response.data: audio_data.extend(response.data)
        if not audio_data: return False
        # Save audio data to WAV file
        with wave.open(output_wav_path, "wb") as wf:
            wf.setnchannels(1); wf.setsampwidth(2); wf.setframerate(24000)
            wf.writeframes(bytes(audio_data))
        return True
    except Exception as e:
        print(f"Audio Generation Error for {output_wav_path}: {e}")
        # Implement retry logic if desired
        return False

Assembling the Video with MoviePy (Conceptual):

# Simplified concept from PDF page 79, 85
from moviepy.editor import ImageClip, AudioFileClip, CompositeVideoClip, concatenate_videoclips
import os
# Assuming 'segments' list contains {'image_path': '...', 'audio_path': '...'}

def assemble_video(segments: list, output_path: str) -> bool:
    """Combines image and audio clips into a final video."""
    clips = []
    success = True
    try:
        for i, seg in enumerate(segments):
            if not os.path.exists(seg['image_path']) or not os.path.exists(seg['audio_path']):
                print(f"Skipping segment {i}: Missing files.")
                continue
            try:
                audio = AudioFileClip(seg['audio_path'])
                if audio.duration <= 0: continue # Skip zero-duration audio
                img = ImageClip(seg['image_path']).set_duration(audio.duration)
                video_segment = img.set_audio(audio)
                clips.append(video_segment)
            except Exception as e_inner:
                print(f"Error processing segment {i}: {e_inner}")
                # Close clips if they were opened
                if 'audio' in locals(): audio.close()
                if 'img' in locals(): img.close()
                if 'video_segment' in locals(): video_segment.close()

        if not clips:
            print("No valid clips to assemble.")
            return False

        final_video = concatenate_videoclips(clips, method="compose")
        final_video.write_videofile(output_path, fps=24, codec='libx264', audio_codec='aac')
        print(f"Video saved to {output_path}")

    except Exception as e_outer:
        print(f"Video Assembly Error: {e_outer}")
        success = False
    finally:
        # Ensure all clips are closed
        if 'final_video' in locals(): final_video.close()
        for clip in clips:
            if clip: clip.close()
    return success

The AI Video Tutor Emerges

This pipeline, while complex, achieved our goal: automatically generating personalized educational videos. The quality hinges on the synergy between the script generation, image relevance, audio clarity, and assembly process. While creating Hollywood-level productions automatically is still futuristic, this demonstrated the incredible potential of using multimodal AI to create tailored, engaging learning experiences on demand. The use of RAG to ground the script was noted as particularly important for educational value.

Final Thoughts: Building this entire system was a journey through the cutting edge of GenAI. In our concluding post, we'll reflect on the practical MLOps lessons learned, the overall impact, and the exciting future possibilities for AI-powered content navigation and learning.

Priyanka’s Substack

Discussion about this post