Building Production-Grade Voice AI: Architecture Lessons from Processing 50,000+ Sales Calls Monthly

From Prototype to Production: Building a Scalable Voice AI Architecture

Building a voice AI prototype is one thing, but deploying a robust, production-ready system is a different beast entirely. Many underestimate the underlying complexity required to handle thousands of concurrent calls, maintain real-time responsiveness, and ensure high availability. A seemingly simple voice interaction involves a myriad of interconnected components, from voice streaming and AI processing to database updates and frontend refreshes.

At ConversAI Labs, we've learned these lessons firsthand while processing over 50,000 monthly calls for our clients. This experience has shaped our architecture principles and technology choices. In this post, we'll share our key learnings and insights into building a scalable and reliable voice AI platform.

We'll explore the core architecture stack, real-time call flow, scaling challenges we faced, our approach to monitoring and observability, cost optimization strategies, security best practices, and the crucial lessons we've learned along the way. Our goal is to provide a practical guide for anyone building their own production voice AI system.

Core Architecture Stack

Our voice AI platform is built on a modern and scalable technology stack designed for performance and reliability.

Backend Framework: FastAPI (Python 3.11)

We chose FastAPI as our backend framework for its speed, asynchronous capabilities, and developer-friendly features. Key advantages include:

Async/await support: Enables concurrent call handling, crucial for managing a large volume of simultaneous interactions.
Performance: Up to 300% faster than Flask in our real-time operations due to its asynchronous nature.
OpenAPI documentation: Automatic generation of API documentation, simplifying integration and collaboration.
Type safety: Utilizes Pydantic models for data validation and type checking, reducing errors and improving code maintainability.

Database: PostgreSQL on Cloud SQL

PostgreSQL on Google Cloud SQL provides a robust and scalable database solution.

JSONB support: Allows flexible storage of lead custom fields, accommodating diverse data requirements.
UUID primary keys: Ensures unique identification of records in a distributed system.
Connection pooling: Configured with a maximum of 20 connections to optimize resource utilization.
Read replicas: Enables offloading analytics queries to read replicas, minimizing impact on the primary database.

Voice Engine: Retell AI Integration

We leverage Retell AI for its advanced voice processing capabilities and seamless integration.

WebSocket connections: Facilitates real-time event streaming for low-latency communication.
Low-latency voice streaming: Ensures minimal delay in voice data transmission.
Built-in conversation memory: Provides context for AI decision-making during calls.
High Availability: Retell AI provides a 99.9% uptime SLA, crucial for reliable service delivery.

Deployment: Google Cloud Run

Google Cloud Run offers a fully managed, serverless environment for deploying our application.

Auto-scaling: Automatically scales from 0 to 100 instances based on demand.
Unix socket DB connections: Enables secure and efficient communication with the database.
Container-based isolation: Provides isolation and consistency across different environments.
Cost-effective: Infrastructure costs are approximately ₹12,000/month while processing 50,000 calls.

Real-Time Call Flow Architecture

The real-time call flow orchestrates the interaction between the user, our platform, and external services.

Overview:
User Action → API Request → Call Scheduler → Retell API → Twilio/Plivo → Customer → Real-Time Webhooks → Database Update → Frontend Refresh

Step-by-Step Technical Flow:

Lead Scheduling:
- The frontend submits lead data via a POST request to /api/v1/leads.
- Data is validated using Pydantic models to ensure data integrity.
- The lead is inserted into the PostgreSQL database with custom fields stored as a JSONB object.
- A UUID is returned for tracking the lead's progress.
Call Initiation:
- Cloud Scheduler triggers a job every minute.
- The job queries the database for eligible leads based on business hours and retry logic.
- An interaction_attempt record is created to track each call attempt.
- The system calls the Retell AI API with the appropriate agent configuration.
Live Call Processing:
- Retell AI connects to the customer via Twilio (or Plivo).
- Real-time transcription streaming is enabled for live call analysis.
- AI decision-making occurs with context from conversation memory.
- Webhook updates are triggered to update the call status (new → in_progress → done).
Webhook Handling:
- A POST request to /api/v1/calls/webhook receives events from Retell AI.
- The system updates the call status, transcript, and analysis in the database.
- A frontend refresh is triggered via polling to update the user interface.
- Call details are logged to analytics tables for reporting.

Performance Metrics:

Average call initiation time: 2.3 seconds
Webhook processing time: <100ms
Database query time: <50ms
Uptime: 99.8% over 6 months

Scaling Challenges & Solutions

As our call volume grew, we encountered several scaling challenges that required innovative solutions.

Challenge 1: Database Connection Exhaustion

Problem: Handling 100+ concurrent calls led to connection pool overflow.
Solution: Implemented PgBouncer connection pooling.
Result: Increased capacity to support 500 concurrent calls.

Challenge 2: Webhook Race Conditions

Problem: Multiple webhooks updating the same call record simultaneously.
Solution: Implemented PostgreSQL row-level locking and idempotency keys.
Result: Eliminated duplicate updates and ensured data consistency.

Challenge 3: Cold Start Latency

Problem: Cloud Run instances took 3-5 seconds to start.
Solution: Configured Cloud Run with a minimum of 1 instance always warm.
Result: Reduced average response time to <500ms.

Challenge 4: Call Scheduling Accuracy

Problem: Timezone mismatches and violations of business hours.
Solution: Used Pytz for robust timezone handling and implemented scheduled task validation.
Result: Achieved 99.9% compliance with scheduling rules.

Monitoring & Observability

Comprehensive monitoring and observability are essential for maintaining a stable and performant system.

Logging Strategy:

Structured JSON logging with Python's built-in logger.
Log levels: DEBUG (development), INFO (production), ERROR (alerts).
Correlation IDs for request tracing.
Sensitive data redaction (phone numbers, names).

Metrics Tracked:

API latency percentiles (p50, p95, p99).
Database connection pool utilization.
Call success rate by agent/lead source.
Webhook processing time.
Error rates by endpoint.

Alerting:

PagerDuty for critical errors (>5% failure rate).
Slack notifications for warnings.
Weekly performance reports.

Tools:

Google Cloud Logging for centralized logs.
Custom analytics dashboard in the admin panel.
React Query DevTools for frontend debugging.

Cost Optimization

Optimizing costs is crucial for maintaining a sustainable voice AI platform.

Monthly Cost Breakdown (50K calls):

Cloud Run: ₹8,000 (auto-scaling)
Cloud SQL: ₹12,000 (2 vCPUs, 8GB RAM)
Retell AI: ₹1.50/min × 3 min avg × 50K = ₹2,25,000
Twilio: ₹0.80/min × 50K = ₹40,000
Total: ₹2,85,000/month (₹5.70 per call)

Optimization Techniques:

Aggressive call timeout (max 5 minutes).
Scheduled jobs during low-cost hours.
Database query optimization (40% cost reduction).
Cold storage for old call recordings.

Security Best Practices

Security is paramount in any production system.

JWT authentication with 60-day expiry.
Environment-based secrets (migrating to Secret Manager).
SQL injection prevention via SQLAlchemy ORM.
CORS configuration for the frontend only.
Webhook signature verification.
Rate limiting (100 req/min per user).
Input validation with Pydantic.

Lessons Learned

Design for Failure: Implement retry logic everywhere.
Monitor Early: Instrument your system from day one.
Test at Scale: Load test before launch.
Keep it Simple: Avoid premature optimization.
Document Everything: API docs, architecture diagrams are crucial.

Conclusion

Building a production-ready voice AI system requires a robust and well-architected platform. Our experience processing over 50,000 monthly calls has proven the effectiveness of a stack based on FastAPI, PostgreSQL, and Retell AI. Crucially, a proactive approach to monitoring and observability is non-negotiable for ensuring reliability and performance.

Want to get started quickly? We're working on open-sourcing our architecture template to help you accelerate your development. Stay tuned for more updates!