PromptHub
Developer Tools Artificial Intelligence

Stop Building Chatbots! Build Ava: The WhatsApp Agent That Passes the Turing Test

B

Bright Coding

Author

16 min read
36 views
Stop Building Chatbots! Build Ava: The WhatsApp Agent That Passes the Turing Test

Stop Building Chatbots! Build Ava: The WhatsApp Agent That Passes the Turing Test

What if your users couldn't tell whether they were texting a human or an AI? Not in some distant sci-fi future—but right now, in their WhatsApp inbox?

Here's the brutal truth: most "AI agents" today are nothing more than overglorified FAQ bots. They choke on context, forget everything you told them five minutes ago, and sound about as natural as a GPS recalculating directions. Developers pour hours into building these Frankenstein monsters, only to watch users abandon them after the first interaction.

But what if I told you two ML engineers just cracked the code on something genuinely unsettling?

Meet Ava—a WhatsApp agent so convincingly human, it feels like texting a person with perfect memory, emotional range, and the ability to see, hear, and speak. Inspired by the haunting film Ex Machina, this isn't another toy project. It's a production-ready, multi-modal AI agent built on LangGraph and Groq that you can actually deploy and use today.

And the kicker? You can build it for free.

The ava-whatsapp-agent-course repository contains everything—step-by-step lessons, working code, video tutorials, and deployment guides. No PhD required. No enterprise budget. Just pure, hands-on engineering that transforms how you think about conversational AI.

Ready to stop building chatbots and start building agents that actually think?


What Is the Ava WhatsApp Agent Course?

Ava is an open-source educational project created by Miguel Otero Pedrido and Jesús Copado—two Senior ML/AI Engineers who merged their love for cinema and artificial intelligence into something extraordinary. The project lives at neural-maze/ava-whatsapp-agent-course and represents one of the most comprehensive free resources for building production-grade AI agents in 2024.

The course isn't a superficial tutorial. It's a six-lesson deep dive that takes you from zero to a deployed WhatsApp agent capable of:

  • Processing text, voice, and images in real-time
  • Maintaining persistent memory across conversations
  • Generating original images of its "daily activities"
  • Speaking back to users with natural-sounding voice
  • Deploying to Google Cloud Run for production traffic

What makes Ava genuinely different from the thousand other "AI agent" tutorials flooding GitHub? Architecture. While most projects bolt a language model onto a simple API and call it an agent, Ava uses LangGraph to implement sophisticated state machines with branching logic, loops, and conditional edges. This isn't prompt engineering—it's agent engineering.

The project has exploded in popularity because it solves a real developer pain point: bridging the gap between "cool AI demo" and "actually works in production." Every lesson includes written documentation, video walkthroughs, and working code. The full course even offers a 2+ hour comprehensive video where the creators dissect every line of code.

And here's what makes this genuinely accessible: the entire stack runs on free tiers. Groq for lightning-fast inference. Qdrant Cloud for vector storage. ElevenLabs for voice synthesis. Together AI for image generation. Google Cloud Run with $300 in starter credits. The barrier to entry isn't money—it's just your willingness to build.


Key Features That Make Ava Insanely Powerful

Let's dissect what makes this agent technically remarkable. These aren't marketing bullet points—they're architectural decisions that separate toy projects from production systems.

Multi-Modal Perception Pipeline

Ava doesn't just read text. It sees (via Llama 3.2 Vision), hears (via Whisper STT), and speaks (via ElevenLabs TTS). The course teaches you to orchestrate these disparate models through a unified LangGraph workflow, handling media conversion, error recovery, and context preservation across modalities.

Persistent Memory Architecture

Short-term memory via graph state persistence. Long-term memory via Qdrant vector database. This dual-memory system means Ava remembers your preferences from last month while maintaining conversation coherence in the current thread. The course implements this with production patterns—no in-memory hacks that vanish on restart.

Lightning-Fast Inference with Groq

Groq's LPU (Language Processing Unit) architecture delivers token generation speeds that make GPUs feel prehistoric. Ava leverages Llama 3.3 for reasoning, Llama 3.2 Vision for image understanding, and Whisper for transcription—all through Groq's unified API with consistent sub-100ms latency.

Production Deployment Patterns

The course doesn't stop at "it works on my machine." You'll deploy containerized agents to Google Cloud Run, configure webhook integrations with the WhatsApp Business API, and implement health checks, scaling policies, and secrets management. This is how real AI products ship.

Image Generation & Persona Consistency

Perhaps the most unsettling feature: Ava generates images of its "daily activities" using FLUX diffusion models via Together AI, maintaining visual and narrative consistency with its established persona. This isn't random image generation—it's character-driven visual storytelling.

Chainlit Development Interface

Before deploying to WhatsApp, you'll build and debug interactions using Chainlit, a Python framework for creating chat interfaces. This rapid iteration cycle means you can test complex conversational flows without touching your phone.


Real-World Use Cases Where Ava Destroys Traditional Solutions

Still wondering if this is just a cool demo? Here are four concrete scenarios where Ava's architecture solves problems that break conventional approaches.

1. Hyper-Personalized Customer Support

Traditional chatbots ask you to repeat your order number every session. Ava remembers that you complained about shipping delays three months ago, prefers voice messages, and recently moved to a new address. The Qdrant-backed long-term memory means context accumulates rather than resets, creating genuinely personalized support at scale.

2. Accessible Information Services

For visually impaired users or situations where hands-free interaction is essential, Ava's STT/TTS pipeline creates natural voice conversations. A farmer checking crop prices, a driver requesting navigation updates, or an elderly patient confirming medication schedules—all through natural WhatsApp voice messages processed by Whisper and responded to by ElevenLabs.

3. Creative Companion & Storytelling Agent

Ava's image generation capabilities enable interactive creative experiences. Users describe scenarios; Ava responds with generated visuals of its "activities," creating ongoing narrative threads. This isn't a chatbot—it's a collaborative storytelling partner with persistent character continuity.

4. Field Service & Remote Diagnostics

Technicians in the field can photograph equipment issues, send voice descriptions, and receive troubleshooting guidance—all within WhatsApp. The VLM processes images for fault identification, while the agent's memory maintains equipment history and previous interventions. No special apps, no training—just the messaging platform everyone already uses.


Step-by-Step Installation & Setup Guide

Let's get Ava running on your machine. The repository's GETTING STARTED.md contains the authoritative instructions, but here's the complete workflow:

Prerequisites

  • Python 3.11+
  • Git
  • A Groq API account (free tier available)
  • Qdrant Cloud account (free tier available)
  • ElevenLabs account (free tier available)
  • Together AI account (free tier available)
  • WhatsApp Business API access (for production deployment)

Environment Setup

# Clone the repository
git clone https://github.com/neural-maze/ava-whatsapp-agent-course.git
cd ava-whatsapp-agent-course

# Create virtual environment (critical for dependency isolation)
python -m venv venv

# Activate on macOS/Linux
source venv/bin/activate

# Activate on Windows
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root with your API credentials:

# Core LLM inference via Groq
GROQ_API_KEY=your_groq_api_key_here

# Vector database for long-term memory
QDRANT_URL=your_qdrant_cluster_url
QDRANT_API_KEY=your_qdrant_api_key

# Voice synthesis
ELEVENLABS_API_KEY=your_elevenlabs_api_key

# Image generation
TOGETHER_API_KEY=your_together_api_key

# WhatsApp Business API (for production deployment)
WHATSAPP_TOKEN=your_whatsapp_business_token
WHATSAPP_PHONE_NUMBER_ID=your_phone_number_id
VERIFY_TOKEN=your_webhook_verify_token

Verify Installation

# Run the Chainlit development interface for local testing
chainlit run app.py

This launches a local web interface where you can interact with Ava before connecting WhatsApp. The development server typically runs on http://localhost:8000.

Production Deployment Preparation

For Google Cloud Run deployment:

# Build container image (requires Google Cloud SDK)
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/ava-agent

# Deploy to Cloud Run
gcloud run deploy ava-agent \
  --image gcr.io/YOUR_PROJECT_ID/ava-agent \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars "GROQ_API_KEY=your_key,QDRANT_URL=your_url"

The course's Lesson 6 specifically covers WhatsApp API webhook configuration and Cloud Run deployment with proper secret management.


REAL Code Examples from the Repository

The repository's architecture centers on LangGraph state machines. Let me walk you through the actual patterns implemented across the six lessons.

Example 1: Basic LangGraph Workflow Structure

From Lesson 2, here's how Ava's cognitive architecture begins—a state graph with conditional routing:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

# Define the agent's cognitive state
class AgentState(TypedDict):
    # Messages accumulate with operator.add for conversation history
    messages: Annotated[list, operator.add]
    # Track which sensory modality triggered this turn
    input_type: str  # "text", "voice", "image"
    # Working memory for current reasoning context
    current_thought: str
    # Flag for whether to generate image response
    should_generate_image: bool

# Initialize the state machine
workflow = StateGraph(AgentState)

# Define processing nodes (actual implementations in course)
def perceive_input(state: AgentState):
    """Route input based on detected modality"""
    if state["input_type"] == "voice":
        return {"current_thought": "Processing audio via Whisper STT..."}
    elif state["input_type"] == "image":
        return {"current_thought": "Analyzing visual input via VLM..."}
    return {"current_thought": "Processing text input directly"}

def generate_response(state: AgentState):
    """Core reasoning with Groq Llama 3.3"""
    # Groq API call with conversation context
    response = call_groq_chat(messages=state["messages"])
    return {"messages": [response]}

def maybe_generate_image(state: AgentState):
    """Conditional image generation branch"""
    if state["should_generate_image"]:
        image_url = generate_with_flux(state["current_thought"])
        return {"messages": [{"type": "image", "content": image_url}]}
    return {}

# Register nodes in the graph
workflow.add_node("perceive", perceive_input)
workflow.add_node("reason", generate_response)
workflow.add_node("visualize", maybe_generate_image)

# Define edges with conditional routing
workflow.set_entry_point("perceive")
workflow.add_edge("perceive", "reason")
workflow.add_conditional_edges(
    "reason",
    lambda state: "visualize" if state["should_generate_image"] else END,
    {"visualize": "visualize", END: END}
)
workflow.add_edge("visualize", END)

# Compile to executable agent
ava_agent = workflow.compile()

What's happening here? This isn't a simple request-response loop. It's a state machine where each cognitive step is explicit and testable. The AgentState TypedDict provides type safety across the entire workflow. The Annotated[list, operator.add] pattern ensures message history accumulates correctly. Conditional edges enable branching logic—should we generate an image? Should we escalate to human review?—that's impossible with linear prompt chains.

Example 2: Memory Integration with Qdrant

From Lesson 3, here's the long-term memory retrieval pattern:

from qdrant_client import QdrantClient
from langchain.embeddings import HuggingFaceEmbeddings
import numpy as np

class MemoryManager:
    def __init__(self, qdrant_url: str, api_key: str):
        # Connect to Qdrant Cloud vector database
        self.client = QdrantClient(url=qdrant_url, api_key=api_key)
        # Local embedding model for query encoding
        self.embedder = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        self.collection_name = "ava_memories"
    
    def store_interaction(self, user_id: str, message: str, 
                          response: str, timestamp: float):
        """Persist conversation turn to vector memory"""
        # Create rich text representation for embedding
        memory_text = f"User said: {message}\nAva responded: {response}"
        
        # Generate embedding vector
        vector = self.embedder.embed_query(memory_text)
        
        # Store with metadata for filtered retrieval
        self.client.upsert(
            collection_name=self.collection_name,
            points=[{
                "id": str(uuid.uuid4()),
                "vector": vector,
                "payload": {
                    "user_id": user_id,
                    "timestamp": timestamp,
                    "memory_text": memory_text,
                    "message_type": "conversation_turn"
                }
            }]
        )
    
    def retrieve_relevant_memories(self, user_id: str, 
                                   current_query: str, 
                                   limit: int = 5) -> list[str]:
        """Semantic search across user's conversation history"""
        query_vector = self.embedder.embed_query(current_query)
        
        # Filter to specific user, sort by semantic similarity
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_vector,
            query_filter={
                "must": [{"key": "user_id", "match": {"value": user_id}}]
            },
            limit=limit
        )
        
        # Extract memory texts for injection into prompt context
        return [hit.payload["memory_text"] for hit in results]

The critical insight: This isn't just "storing chat history." It's semantic memory—retrieving relevant past interactions based on meaning, not chronology. When a user mentions "that restaurant I liked," Ava finds the specific recommendation from six months ago because the embedding captures conceptual similarity, not keyword matching.

Example 3: Voice Pipeline Integration

From Lesson 4, the STT/TTS orchestration:

import groq
from elevenlabs import generate, stream
import tempfile
import os

class VoicePipeline:
    def __init__(self, groq_key: str, eleven_key: str):
        self.groq_client = groq.Groq(api_key=groq_key)
        self.eleven_key = eleven_key
    
    async def process_voice_message(self, audio_bytes: bytes, 
                                    user_id: str) -> dict:
        """Full pipeline: audio -> text -> agent response -> audio"""
        
        # Step 1: Speech-to-Text via Groq Whisper
        with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp:
            tmp.write(audio_bytes)
            tmp_path = tmp.name
        
        try:
            with open(tmp_path, "rb") as audio_file:
                transcription = self.groq_client.audio.transcriptions.create(
                    file=audio_file,
                    model="whisper-large-v3",  # Groq's fastest Whisper
                    response_format="text"
                )
        finally:
            os.unlink(tmp_path)  # Clean up temp file
        
        # Step 2: Pass transcribed text to agent
        agent_response = await ava_agent.ainvoke({
            "messages": [{"role": "user", "content": transcription}],
            "input_type": "voice"
        })
        
        response_text = agent_response["messages"][-1].content
        
        # Step 3: Text-to-Speech via ElevenLabs
        audio_response = generate(
            text=response_text,
            voice="Bella",  # Consistent persona voice
            model="eleven_multilingual_v2",
            api_key=self.eleven_key
        )
        
        return {
            "text": response_text,
            "audio_bytes": audio_response,
            "transcription": transcription
        }

Why this matters: The pipeline handles media format conversion, error recovery, and streaming optimization—production concerns that destroy naive implementations. Notice the temporary file cleanup in finally blocks, the async invocation for non-blocking operation, and the consistent voice selection for persona maintenance.


Advanced Usage & Best Practices

Having built the base system, here's how to make it production-grade:

State Persistence Strategies

LangGraph's checkpointer enables exactly-once message processing with SQLite for development, PostgreSQL for production. Configure checkpoint namespaces per user to prevent cross-contamination:

from langgraph.checkpoint.sqlite import SqliteSaver

# Production: use PostgresSaver with connection pooling
memory = SqliteSaver.from_conn_string(":memory:")
ava_agent = workflow.compile(checkpointer=memory)

Cost Optimization with Groq

Groq's speed enables speculative execution—run multiple reasoning paths in parallel, keep the fastest correct response. For image generation, implement prompt caching with Together AI to avoid re-embedding identical requests.

Security Hardening

Never commit .env files. Use Google Cloud Secret Manager for production credentials. Implement webhook signature verification for WhatsApp API callbacks to prevent spoofing. The course covers HMAC validation patterns in Lesson 6.

Observability

Add LangSmith tracing to every graph invocation. The branching logic in LangGraph makes traditional logging insufficient—you need visual execution traces to debug why Ava chose a particular reasoning path.


Comparison with Alternatives

Feature Ava (LangGraph) Simple OpenAI API Bot Frameworks (Dialogflow etc.) Custom FastAPI + LLM
Stateful Conversations ✅ Native graph persistence ❌ Manual implementation ✅ Built-in ❌ Manual implementation
Multi-Modal (Voice/Image) ✅ Integrated STT/TTS/VLM ⚠️ Requires external services ❌ Limited ⚠️ Fragile custom code
Long-Term Memory ✅ Qdrant vector search ❌ Context window only ⚠️ Entity extraction ❌ Manual vector DB
Conditional Logic ✅ Graph edges & branches ❌ Linear prompt chains ✅ Visual builder ⚠️ Code complexity
Production Deployment ✅ Cloud Run patterns ⚠️ DIY infrastructure ✅ Managed hosting ❌ Full DevOps burden
Learning Curve Medium (structured) Low (deceptively simple) Low (limited power) High (reinventing wheels)
Cost at Scale Low (Groq speed + free tiers) High (GPT-4 token costs) Medium (platform fees) Variable (ops overhead)

The verdict: Simple API wrappers fail at complexity. Bot frameworks suffocate at sophistication. Custom solutions consume engineering time Ava's creators already invested. LangGraph hits the sweet spot—structured enough for reliability, flexible enough for genuine intelligence.


FAQ

Q: Do I need prior LangGraph experience to follow this course?

No. Lesson 2 builds LangGraph fundamentals from scratch. Python proficiency and basic ML concepts are assumed, but the graph abstractions are taught incrementally.

Q: Can I really run this for free?

Yes. The free tiers cover: Groq (generous rate limits), Qdrant Cloud (1GB vector storage), ElevenLabs (10k characters/month), Together AI ($5 credit), and Google Cloud ($300 new account credit). The course explicitly optimizes for these limits.

Q: Is WhatsApp Business API approval required?

For production deployment to your own number, yes. Meta requires business verification. However, you can develop and test entirely through Chainlit without WhatsApp approval.

Q: How does Ava compare to GPT-4o with voice mode?

GPT-4o's native multi-modal is impressive but closed, expensive, and inflexible. Ava's architecture lets you swap models, customize memory behavior, and deploy on infrastructure you control. It's the difference between renting and owning.

Q: Can I use different LLM providers?

Absolutely. The LangGraph abstraction makes model swapping straightforward. Replace Groq with Anthropic, OpenAI, or local models via Ollama—the graph structure remains identical.

Q: What's the latency for voice conversations?

With Groq's LPU, end-to-end voice-to-voice typically completes in 2-4 seconds—faster than most human response times. The course includes optimization techniques for sub-2-second responses.

Q: Is this suitable for commercial projects?

MIT licensed—use commercially without restriction. The course itself is educational, but the resulting agent architecture is production-ready for real products.


Conclusion: The Future of Conversational AI Is Agentic

We've reached an inflection point. Prompt engineering is dead; agent engineering is beginning. Projects like Ava demonstrate that the next generation of AI applications won't be single-model API calls—they'll be orchestrated systems of specialized models, memory stores, and conditional logic, wrapped in interfaces users already love.

The ava-whatsapp-agent-course isn't just a tutorial. It's a blueprint for building AI that persists, reasons, and genuinely engages. Miguel and Jesús have done the hard work of architectural exploration, failure, and refinement—now they're handing you the map.

Whether you're an ML Engineer seeking production patterns, a Software Engineer expanding into AI, or an AI Engineer tired of demos that break in reality, this course meets you where you are and pushes you where you need to be.

The Turing Test isn't about fooling judges in controlled rooms anymore. It's about creating something people voluntarily return to, something that remembers, something that feels.

Ava gets closer than anything I've seen in open source.

👉 Clone the repository. Start Lesson 1. Build something that makes you slightly uncomfortable with how well it works.

The future of AI agents isn't coming. It's already in your WhatsApp.


Follow The Neural Maze for weekly AI systems engineering insights, and Jesús Copado's YouTube for hands-on build videos.

Comments (0)

Comments are moderated before appearing.

No comments yet. Be the first to share your thoughts!

Support us! ☕