AI Agent Infrastructure: A Practical Guide to Hosting 20-100 Agents in Production
Running a single AI agent on your laptop is trivial. Running 20-100 agents in production — with persistent memory, reliable uptime, GPU-accelerated inference, and cost controls — is an infrastructure problem that most teams underestimate until they hit it.
The question keeps surfacing on Reddit, Discord, and engineering Slack channels: how are you hosting your AI agents? The answers range from “AWS EC2 and prayer” to “I just use the API” to “we built our own orchestration layer and it took three months.” None of these are satisfying for a team that needs agents running reliably in production, talking to external services, and scaling without a dedicated platform engineering team.
This guide covers the infrastructure decisions you face when moving from prototype to production with AI agents. We break down GPU requirements by agent count, walk through the critical infrastructure components, compare self-hosted versus managed approaches with real pricing in INR, and lay out exactly what production-grade agent hosting looks like.

Why AI Agent Infrastructure Is Different
Traditional web applications serve requests: a user hits an endpoint, the server responds, the connection closes. AI agents are fundamentally different. They maintain state across interactions. They run long-lived processes. They make decisions that trigger chains of downstream actions — API calls, database writes, file operations, tool invocations. A single agent workflow might run for minutes or hours, consuming GPU cycles intermittently while holding memory and network connections throughout.
This creates a distinct set of infrastructure requirements:
-
GPU compute for inference. If you are running open-source models (Llama, Mistral, DeepSeek) locally rather than calling external APIs, each agent’s reasoning step requires GPU-accelerated inference. Latency matters: an agent waiting 3 seconds per inference call in a 15-step chain is already at 45 seconds of pure inference time.
-
Persistent memory and storage. Agents need to remember context across sessions. This means vector databases for semantic retrieval, key-value stores for session state, and persistent disk for logs and artifacts. A typical production agent stack combines ChromaDB or Qdrant for embeddings with Redis for fast state lookups and NVMe storage for everything else.
-
API endpoints and networking. Agents talk to external services — LLM APIs, databases, SaaS tools, internal microservices. Each agent needs reliable outbound networking, and your infrastructure needs an API gateway to manage inbound requests, rate limiting, and authentication.
-
Monitoring and observability. When an agent fails silently, the downstream effects cascade. You need health checks, structured logging, trace IDs across agent chains, and alerting that catches both crashes and logical failures (agent stuck in a loop, hallucinated tool call, cost runaway).
-
Auto-restart and fault tolerance. Agents crash. Models OOM. API rate limits get hit. Production agent infrastructure must handle restarts gracefully, preserve state through failures, and retry with backoff without duplicating side effects.
-
Scaling. Your workload is bursty. You might need 5 agents during off-hours and 80 during peak. The infrastructure needs to scale without manual intervention and, critically, scale back down to avoid burning money on idle GPUs.
Infrastructure Requirements by Agent Count
Not every agent deployment needs an 8x H100 cluster. The right infrastructure depends on your agent count, whether you are running local inference or calling external APIs, and how compute-intensive each agent’s workload is.
1-5 Agents: Single Server
At this scale, you are likely running a handful of specialized agents — a customer support bot, a code review agent, a data pipeline monitor. The infrastructure is straightforward.
| Component | Recommendation |
|---|---|
| GPU | 1x L4 (24GB) or 1x RTX 4090 (24GB) |
| RAM | 32-64 GB DDR5 |
| Storage | 500 GB NVMe SSD |
| CPU | 8-16 cores (AMD EPYC or Intel Xeon) |
| Network | 1 Gbps |
| Vector DB | ChromaDB or Qdrant (single instance) |
| Orchestration | PM2, systemd, or Docker Compose |
Monthly cost estimate (ZenoCloud managed): INR 30,000-50,000 for an L4 GPU server with managed support.
This configuration handles 7B-13B parameter models for local inference comfortably. If your agents are calling OpenAI or Anthropic APIs instead of running local models, you can skip the GPU entirely and use a standard compute instance — but your per-agent costs shift from infrastructure to API tokens.
When to upgrade: When inference latency starts affecting agent chain completion time, when you are queuing agent tasks because the single GPU is saturated, or when your vector database starts hitting memory limits.
5-20 Agents: Multi-GPU Server
This is where most serious agent deployments land. You are running multiple agent types — some doing inference, some orchestrating workflows, some handling retrieval-augmented generation (RAG) pipelines. The agents are talking to each other and to external services.
| Component | Recommendation |
|---|---|
| GPU | 2-4x A100 80GB or 2x H100 SXM |
| RAM | 128-256 GB DDR5 |
| Storage | 2-4 TB NVMe SSD (RAID 0 for throughput) |
| CPU | 32-64 cores |
| Network | 10 Gbps with dedicated VLAN |
| Vector DB | Qdrant or Weaviate (clustered) |
| Orchestration | Kubernetes or Docker Swarm |
| Load Balancer | Nginx or Traefik |
Monthly cost estimate (ZenoCloud managed): INR 1,50,000-4,00,000 depending on GPU configuration.
At this scale, you need proper orchestration. Docker Compose is no longer sufficient. Kubernetes gives you pod-level resource limits, automatic restarts, rolling deployments, and horizontal pod autoscaling. Each agent runs in its own container with defined CPU, memory, and GPU resource requests.
The A100 80GB is the sweet spot here. Its 80GB HBM lets you run 30B-70B parameter models for inference, or serve multiple smaller models simultaneously using vLLM or text-generation-inference (TGI) with model multiplexing. Two A100s can comfortably serve 10-15 agents doing local inference with sub-second latency.
When to upgrade: When you are running more than 4 GPUs in a single node and hitting PCIe bandwidth limits, when inter-agent communication latency matters, or when you need fault isolation between agent groups.
20-100 Agents: Multi-Node Cluster
This is production at scale. You are running a fleet of agents across multiple servers, handling hundreds or thousands of concurrent agent sessions, and your infrastructure needs to be as reliable as any production SaaS backend.
| Component | Recommendation |
|---|---|
| GPU | 8-32x H100 SXM across 2-4 nodes |
| RAM | 256-512 GB per node |
| Storage | 8-16 TB NVMe per node + shared NFS/S3 |
| CPU | 64-128 cores per node |
| Network | 25-100 Gbps InfiniBand or RoCE |
| Vector DB | Qdrant cluster or Pinecone |
| Orchestration | Kubernetes with GPU operator |
| Load Balancer | Dedicated L4/L7 load balancer |
| Monitoring | Prometheus + Grafana + custom agent metrics |
| Logging | ELK stack or Loki |
Monthly cost estimate (ZenoCloud managed): INR 8,00,000-25,00,000+ depending on cluster size.
At this scale, networking becomes the bottleneck, not compute. Agents sharing context, agents triggering other agents, agents reading from shared vector stores — all of this generates internal network traffic. Dedicated 25Gbps+ networking between nodes is not optional. NVLink between GPUs within a node handles multi-GPU inference, but inter-node communication needs InfiniBand or RoCE to avoid latency spikes.
You also need a proper service mesh for agent-to-agent communication, distributed tracing to debug multi-agent workflows, and cost attribution to understand which agents are consuming the most resources.
Key Infrastructure Components
Regardless of scale, every production AI agent deployment needs these five components.
1. GPU Compute Layer
The GPU layer handles model inference — the core reasoning step where your agent processes input and generates output. Your choices:
- Local inference with open-source models. Run Llama 3, Mistral, DeepSeek, or Qwen on your own GPUs using vLLM, TGI, or Ollama. Lower per-token cost, full data privacy, but you manage the infrastructure.
- API-based inference. Call OpenAI, Anthropic, Google, or other providers. Zero infrastructure for inference, but higher per-token cost and vendor dependency.
- Hybrid. Use local inference for high-volume, latency-sensitive agents and API calls for complex reasoning tasks that benefit from frontier models. This is what most production deployments converge on.
For local inference, use vLLM as your serving engine. It supports continuous batching, PagedAttention for memory efficiency, and tensor parallelism across multiple GPUs. A single H100 running vLLM can serve 50-100 concurrent inference requests for a 7B model with sub-200ms latency.
2. Vector Database
Agents with memory need a vector store for semantic retrieval. The vector database stores embeddings of past conversations, documents, and structured knowledge that agents retrieve during reasoning.
For 1-20 agents: Qdrant or ChromaDB running on the same server. Qdrant handles up to 10 million vectors on a single node with sub-10ms query latency.
For 20-100 agents: Qdrant cluster (3+ nodes) or managed Pinecone. At this scale, you need replication for availability and sharding for throughput. Budget INR 15,000-50,000/month for a dedicated vector database cluster.
3. Persistent Storage
Agents generate and consume data: conversation logs, tool outputs, retrieved documents, intermediate reasoning traces. You need three storage tiers:
- Hot storage (NVMe SSD): Agent state, active session data, vector indices. Fast reads, limited capacity.
- Warm storage (SSD): Recent conversation history, cached embeddings, frequently accessed documents.
- Cold storage (S3-compatible object store): Archived logs, training data, compliance records.
Plan for 50-100 GB per agent per month of total storage across all tiers, depending on how chatty your agents are and how much context they retain.
4. API Gateway
Every agent needs to receive requests and call external services. An API gateway handles:
- Request routing to specific agents
- Authentication and authorization
- Rate limiting (critical for cost control on API-based inference)
- Request/response logging
- TLS termination
Nginx, Kong, or Traefik work well. For 20+ agents, use a dedicated API gateway with per-agent rate limits and API key management.
5. Monitoring and Observability
Standard infrastructure monitoring (CPU, memory, disk, network) is necessary but not sufficient. Agent-specific monitoring includes:
- Agent health checks. Is each agent responding? Are response times within SLA?
- Inference metrics. Tokens per second, queue depth, GPU utilization, model latency percentiles (p50, p95, p99).
- Cost tracking. Per-agent spend on GPU hours and API tokens. This is where budgets blow up — a single misbehaving agent making recursive API calls can burn through thousands of rupees in hours.
- Quality metrics. Task completion rate, tool call success rate, hallucination detection (via automated evaluation).
- Trace IDs. When Agent A triggers Agent B which calls Tool C, you need an end-to-end trace to debug failures.
Use Prometheus for metrics, Grafana for dashboards, and structured JSON logging with a correlation ID propagated across the agent chain. LangSmith and Langfuse are purpose-built for LLM observability and integrate well with LangChain and LlamaIndex agent frameworks.
Self-Hosted vs Managed: Cost Comparison
The fundamental question every AI team faces: do you build and manage the infrastructure yourself, or pay someone to handle it?
Self-Managed on Cloud (AWS/GCP)
You provision VMs, attach GPUs, install drivers, configure networking, set up monitoring, and handle all maintenance.
| Cost Component (20-agent deployment) | Monthly Cost (INR) |
|---|---|
| 4x A100 GPU instances (AWS p4d.24xlarge, on-demand) | ~12,00,000 |
| Storage (2TB EBS gp3 + S3) | ~15,000 |
| Networking (VPC, NAT, data transfer) | ~20,000 |
| Load balancer (ALB) | ~8,000 |
| Monitoring (CloudWatch + Grafana) | ~10,000 |
| DevOps engineer time (50% allocation) | ~1,25,000 |
| Total | ~13,78,000 |
Reserved instances or spot instances can reduce the GPU cost by 30-60%, but spot instances get reclaimed — not ideal for long-running agent processes.
Self-Hosted on Bare Metal
You lease dedicated servers, install everything from OS upward, and manage hardware-level concerns.
| Cost Component (20-agent deployment) | Monthly Cost (INR) |
|---|---|
| 2x bare metal servers with 2x A100 each (colocation) | ~4,00,000 |
| NVMe storage (4TB) | Included |
| Networking (10 Gbps dedicated) | ~25,000 |
| Colocation fees (power, rack, cooling) | ~40,000 |
| DevOps/SysAdmin time (25% allocation) | ~62,500 |
| Total | ~5,27,500 |
Significantly cheaper than cloud, but you own the hardware risk and need staff who can handle bare metal operations.
Managed GPU Infrastructure (ZenoCloud)
You specify what you need. We provision, configure, secure, monitor, and maintain the infrastructure. You deploy your agents and focus on product.
| Cost Component (20-agent deployment) | Monthly Cost (INR) |
|---|---|
| 4x A100 80GB managed GPU cluster | ~4,00,000 |
| Managed storage (2TB NVMe + object storage) | Included |
| Managed networking (10 Gbps, VLAN, firewall) | Included |
| Load balancing + SSL | Included |
| 24/7 monitoring + alerting | Included |
| OS patching, driver updates, security | Included |
| DevOps engineer time | 0 (ZenoCloud handles it) |
| Total | ~4,00,000 |
The math is straightforward. Managed GPU infrastructure costs roughly 75% of what you would spend on cloud hyperscalers and 75-80% of bare metal when you factor in the DevOps time you are not spending. For teams without a dedicated infrastructure engineer, the managed route eliminates an entire category of operational risk.

Platform Options: Where to Host AI Agents
AWS + Self-Manage
Best for teams with existing AWS infrastructure and strong DevOps capability. Use p4d (A100) or p5 (H100) instances, EKS for Kubernetes, and SageMaker endpoints for model serving. Expensive, but deep ecosystem integration.
Pros: Broadest ecosystem, global availability, mature tooling. Cons: Highest GPU costs, complex networking, easy to over-provision.
GCP Vertex AI
Best for teams already on Google Cloud who want semi-managed ML infrastructure. Vertex AI handles model serving and scaling, but you still manage agent orchestration.
Pros: Integrated ML platform, TPU availability, managed model endpoints. Cons: Vendor lock-in, limited GPU availability in India, steep learning curve.
ZenoCloud Managed GPU
Best for AI startups and engineering teams that want to focus on building agents, not managing infrastructure. We deploy GPU servers on Indian infrastructure with dedicated networking, manage the entire stack from OS to monitoring, and provide 24/7 support.
Pros: India-resident infrastructure, fully managed, transparent pricing, 24/7 support, no DevOps overhead. Cons: Smaller ecosystem than hyperscalers, India-focused (which is a feature if your users are in India).
Self-Hosted on Bare Metal
Best for teams with specific compliance requirements or very high utilization (70%+ GPU usage, 24/7). You get maximum control and lowest per-hour cost, but you own everything from BIOS firmware to CUDA driver updates.
Pros: Lowest per-unit cost at high utilization, full control, no vendor lock-in. Cons: Highest operational burden, slow scaling, hardware procurement lead times.
Production Considerations
Auto-Restart and Health Checks
Every agent process needs a supervisor. At minimum:
- Container-level restart policies (
restart: unless-stoppedin Docker,restartPolicy: Alwaysin Kubernetes) - Application-level health endpoints that verify the agent can process requests, not just that the process is running
- Liveness probes that detect stuck agents (agent process is alive but inference is timing out)
- Graceful shutdown handling that persists agent state before restart
Cost Per Agent
Track the fully loaded cost per agent, including GPU time, API tokens, storage, and networking. For a typical production agent running local inference on shared GPU infrastructure:
| Agent Type | Monthly Cost (INR) |
|---|---|
| Light agent (API-only, no local inference) | 2,000-5,000 (mostly API tokens) |
| Medium agent (local 7B inference, RAG) | 15,000-30,000 |
| Heavy agent (local 70B inference, multi-step chains) | 50,000-1,00,000 |
These numbers assume shared infrastructure. A dedicated H100 running a single agent would cost INR 1,50,000/month — which is why multi-tenant GPU serving with vLLM is essential for cost efficiency at scale.
Scaling Strategies
Vertical scaling (bigger GPU): When your model does not fit in VRAM or inference latency is too high. Move from L4 to A100, or from A100 to H100.
Horizontal scaling (more instances): When you need more concurrent agent sessions. Add more inference replicas behind a load balancer.
Hybrid scaling: Run baseline capacity on reserved/managed infrastructure for predictable cost, and burst to on-demand or spot instances for traffic spikes.
Scale-to-zero: For agents that are not always active, use serverless-style cold starts. The trade-off is 10-30 seconds of model loading time on first request, which is acceptable for batch agents but not for real-time chat.
Getting Started
If you are running AI agents in production — or planning to — the infrastructure decision is one of the first things to get right. Under-provisioning leads to unreliable agents and frustrated users. Over-provisioning burns money on idle GPUs. The right answer depends on your agent count, inference requirements, and team size.
For teams that want to skip the infrastructure engineering and go straight to deploying agents: talk to ZenoCloud. We will scope your GPU and infrastructure requirements, provision a managed cluster, and have you deploying agents within days, not months.
Start with INR 5,000 in free GPU credits — enough to benchmark your agent workloads on L4, A100, or H100 GPUs and figure out exactly what your production deployment needs. Claim your credits here.