Skip to main content
AI Infrastructure

AI Agent Infrastructure: What You Need to Run 20-100 Agents in Production

A practical guide to hosting AI agents at scale. GPU requirements, memory, storage, networking, and the managed vs self-hosted decision for production AI agent workloads.

AI Agent Infrastructure: What You Need to Run 20-100 Agents in Production

AI Agent Infrastructure: A Practical Guide to Hosting 20-100 Agents in Production

Running a single AI agent on your laptop is trivial. Running 20-100 agents in production — with persistent memory, reliable uptime, GPU-accelerated inference, and cost controls — is an infrastructure problem that most teams underestimate until they hit it.

The question keeps surfacing on Reddit, Discord, and engineering Slack channels: how are you hosting your AI agents? The answers range from “AWS EC2 and prayer” to “I just use the API” to “we built our own orchestration layer and it took three months.” None of these are satisfying for a team that needs agents running reliably in production, talking to external services, and scaling without a dedicated platform engineering team.

This guide covers the infrastructure decisions you face when moving from prototype to production with AI agents. We break down GPU requirements by agent count, walk through the critical infrastructure components, compare self-hosted versus managed approaches with real pricing in INR, and lay out exactly what production-grade agent hosting looks like.

AI Agent Infrastructure: What You Need to Run 20-100 Agents in Production — concept

Why AI Agent Infrastructure Is Different

Traditional web applications serve requests: a user hits an endpoint, the server responds, the connection closes. AI agents are fundamentally different. They maintain state across interactions. They run long-lived processes. They make decisions that trigger chains of downstream actions — API calls, database writes, file operations, tool invocations. A single agent workflow might run for minutes or hours, consuming GPU cycles intermittently while holding memory and network connections throughout.

This creates a distinct set of infrastructure requirements:

  • GPU compute for inference. If you are running open-source models (Llama, Mistral, DeepSeek) locally rather than calling external APIs, each agent’s reasoning step requires GPU-accelerated inference. Latency matters: an agent waiting 3 seconds per inference call in a 15-step chain is already at 45 seconds of pure inference time.

  • Persistent memory and storage. Agents need to remember context across sessions. This means vector databases for semantic retrieval, key-value stores for session state, and persistent disk for logs and artifacts. A typical production agent stack combines ChromaDB or Qdrant for embeddings with Redis for fast state lookups and NVMe storage for everything else.

  • API endpoints and networking. Agents talk to external services — LLM APIs, databases, SaaS tools, internal microservices. Each agent needs reliable outbound networking, and your infrastructure needs an API gateway to manage inbound requests, rate limiting, and authentication.

  • Monitoring and observability. When an agent fails silently, the downstream effects cascade. You need health checks, structured logging, trace IDs across agent chains, and alerting that catches both crashes and logical failures (agent stuck in a loop, hallucinated tool call, cost runaway).

  • Auto-restart and fault tolerance. Agents crash. Models OOM. API rate limits get hit. Production agent infrastructure must handle restarts gracefully, preserve state through failures, and retry with backoff without duplicating side effects.

  • Scaling. Your workload is bursty. You might need 5 agents during off-hours and 80 during peak. The infrastructure needs to scale without manual intervention and, critically, scale back down to avoid burning money on idle GPUs.

Infrastructure Requirements by Agent Count

Not every agent deployment needs an 8x H100 cluster. The right infrastructure depends on your agent count, whether you are running local inference or calling external APIs, and how compute-intensive each agent’s workload is.

1-5 Agents: Single Server

At this scale, you are likely running a handful of specialized agents — a customer support bot, a code review agent, a data pipeline monitor. The infrastructure is straightforward.

ComponentRecommendation
GPU1x L4 (24GB) or 1x RTX 4090 (24GB)
RAM32-64 GB DDR5
Storage500 GB NVMe SSD
CPU8-16 cores (AMD EPYC or Intel Xeon)
Network1 Gbps
Vector DBChromaDB or Qdrant (single instance)
OrchestrationPM2, systemd, or Docker Compose

Monthly cost estimate (ZenoCloud managed): INR 30,000-50,000 for an L4 GPU server with managed support.

This configuration handles 7B-13B parameter models for local inference comfortably. If your agents are calling OpenAI or Anthropic APIs instead of running local models, you can skip the GPU entirely and use a standard compute instance — but your per-agent costs shift from infrastructure to API tokens.

When to upgrade: When inference latency starts affecting agent chain completion time, when you are queuing agent tasks because the single GPU is saturated, or when your vector database starts hitting memory limits.

5-20 Agents: Multi-GPU Server

This is where most serious agent deployments land. You are running multiple agent types — some doing inference, some orchestrating workflows, some handling retrieval-augmented generation (RAG) pipelines. The agents are talking to each other and to external services.

ComponentRecommendation
GPU2-4x A100 80GB or 2x H100 SXM
RAM128-256 GB DDR5
Storage2-4 TB NVMe SSD (RAID 0 for throughput)
CPU32-64 cores
Network10 Gbps with dedicated VLAN
Vector DBQdrant or Weaviate (clustered)
OrchestrationKubernetes or Docker Swarm
Load BalancerNginx or Traefik

Monthly cost estimate (ZenoCloud managed): INR 1,50,000-4,00,000 depending on GPU configuration.

At this scale, you need proper orchestration. Docker Compose is no longer sufficient. Kubernetes gives you pod-level resource limits, automatic restarts, rolling deployments, and horizontal pod autoscaling. Each agent runs in its own container with defined CPU, memory, and GPU resource requests.

The A100 80GB is the sweet spot here. Its 80GB HBM lets you run 30B-70B parameter models for inference, or serve multiple smaller models simultaneously using vLLM or text-generation-inference (TGI) with model multiplexing. Two A100s can comfortably serve 10-15 agents doing local inference with sub-second latency.

When to upgrade: When you are running more than 4 GPUs in a single node and hitting PCIe bandwidth limits, when inter-agent communication latency matters, or when you need fault isolation between agent groups.

20-100 Agents: Multi-Node Cluster

This is production at scale. You are running a fleet of agents across multiple servers, handling hundreds or thousands of concurrent agent sessions, and your infrastructure needs to be as reliable as any production SaaS backend.

ComponentRecommendation
GPU8-32x H100 SXM across 2-4 nodes
RAM256-512 GB per node
Storage8-16 TB NVMe per node + shared NFS/S3
CPU64-128 cores per node
Network25-100 Gbps InfiniBand or RoCE
Vector DBQdrant cluster or Pinecone
OrchestrationKubernetes with GPU operator
Load BalancerDedicated L4/L7 load balancer
MonitoringPrometheus + Grafana + custom agent metrics
LoggingELK stack or Loki

Monthly cost estimate (ZenoCloud managed): INR 8,00,000-25,00,000+ depending on cluster size.

At this scale, networking becomes the bottleneck, not compute. Agents sharing context, agents triggering other agents, agents reading from shared vector stores — all of this generates internal network traffic. Dedicated 25Gbps+ networking between nodes is not optional. NVLink between GPUs within a node handles multi-GPU inference, but inter-node communication needs InfiniBand or RoCE to avoid latency spikes.

You also need a proper service mesh for agent-to-agent communication, distributed tracing to debug multi-agent workflows, and cost attribution to understand which agents are consuming the most resources.

Key Infrastructure Components

Regardless of scale, every production AI agent deployment needs these five components.

1. GPU Compute Layer

The GPU layer handles model inference — the core reasoning step where your agent processes input and generates output. Your choices:

  • Local inference with open-source models. Run Llama 3, Mistral, DeepSeek, or Qwen on your own GPUs using vLLM, TGI, or Ollama. Lower per-token cost, full data privacy, but you manage the infrastructure.
  • API-based inference. Call OpenAI, Anthropic, Google, or other providers. Zero infrastructure for inference, but higher per-token cost and vendor dependency.
  • Hybrid. Use local inference for high-volume, latency-sensitive agents and API calls for complex reasoning tasks that benefit from frontier models. This is what most production deployments converge on.

For local inference, use vLLM as your serving engine. It supports continuous batching, PagedAttention for memory efficiency, and tensor parallelism across multiple GPUs. A single H100 running vLLM can serve 50-100 concurrent inference requests for a 7B model with sub-200ms latency.

2. Vector Database

Agents with memory need a vector store for semantic retrieval. The vector database stores embeddings of past conversations, documents, and structured knowledge that agents retrieve during reasoning.

For 1-20 agents: Qdrant or ChromaDB running on the same server. Qdrant handles up to 10 million vectors on a single node with sub-10ms query latency.

For 20-100 agents: Qdrant cluster (3+ nodes) or managed Pinecone. At this scale, you need replication for availability and sharding for throughput. Budget INR 15,000-50,000/month for a dedicated vector database cluster.

3. Persistent Storage

Agents generate and consume data: conversation logs, tool outputs, retrieved documents, intermediate reasoning traces. You need three storage tiers:

  • Hot storage (NVMe SSD): Agent state, active session data, vector indices. Fast reads, limited capacity.
  • Warm storage (SSD): Recent conversation history, cached embeddings, frequently accessed documents.
  • Cold storage (S3-compatible object store): Archived logs, training data, compliance records.

Plan for 50-100 GB per agent per month of total storage across all tiers, depending on how chatty your agents are and how much context they retain.

4. API Gateway

Every agent needs to receive requests and call external services. An API gateway handles:

  • Request routing to specific agents
  • Authentication and authorization
  • Rate limiting (critical for cost control on API-based inference)
  • Request/response logging
  • TLS termination

Nginx, Kong, or Traefik work well. For 20+ agents, use a dedicated API gateway with per-agent rate limits and API key management.

5. Monitoring and Observability

Standard infrastructure monitoring (CPU, memory, disk, network) is necessary but not sufficient. Agent-specific monitoring includes:

  • Agent health checks. Is each agent responding? Are response times within SLA?
  • Inference metrics. Tokens per second, queue depth, GPU utilization, model latency percentiles (p50, p95, p99).
  • Cost tracking. Per-agent spend on GPU hours and API tokens. This is where budgets blow up — a single misbehaving agent making recursive API calls can burn through thousands of rupees in hours.
  • Quality metrics. Task completion rate, tool call success rate, hallucination detection (via automated evaluation).
  • Trace IDs. When Agent A triggers Agent B which calls Tool C, you need an end-to-end trace to debug failures.

Use Prometheus for metrics, Grafana for dashboards, and structured JSON logging with a correlation ID propagated across the agent chain. LangSmith and Langfuse are purpose-built for LLM observability and integrate well with LangChain and LlamaIndex agent frameworks.

Self-Hosted vs Managed: Cost Comparison

The fundamental question every AI team faces: do you build and manage the infrastructure yourself, or pay someone to handle it?

Self-Managed on Cloud (AWS/GCP)

You provision VMs, attach GPUs, install drivers, configure networking, set up monitoring, and handle all maintenance.

Cost Component (20-agent deployment)Monthly Cost (INR)
4x A100 GPU instances (AWS p4d.24xlarge, on-demand)~12,00,000
Storage (2TB EBS gp3 + S3)~15,000
Networking (VPC, NAT, data transfer)~20,000
Load balancer (ALB)~8,000
Monitoring (CloudWatch + Grafana)~10,000
DevOps engineer time (50% allocation)~1,25,000
Total~13,78,000

Reserved instances or spot instances can reduce the GPU cost by 30-60%, but spot instances get reclaimed — not ideal for long-running agent processes.

Self-Hosted on Bare Metal

You lease dedicated servers, install everything from OS upward, and manage hardware-level concerns.

Cost Component (20-agent deployment)Monthly Cost (INR)
2x bare metal servers with 2x A100 each (colocation)~4,00,000
NVMe storage (4TB)Included
Networking (10 Gbps dedicated)~25,000
Colocation fees (power, rack, cooling)~40,000
DevOps/SysAdmin time (25% allocation)~62,500
Total~5,27,500

Significantly cheaper than cloud, but you own the hardware risk and need staff who can handle bare metal operations.

Managed GPU Infrastructure (ZenoCloud)

You specify what you need. We provision, configure, secure, monitor, and maintain the infrastructure. You deploy your agents and focus on product.

Cost Component (20-agent deployment)Monthly Cost (INR)
4x A100 80GB managed GPU cluster~4,00,000
Managed storage (2TB NVMe + object storage)Included
Managed networking (10 Gbps, VLAN, firewall)Included
Load balancing + SSLIncluded
24/7 monitoring + alertingIncluded
OS patching, driver updates, securityIncluded
DevOps engineer time0 (ZenoCloud handles it)
Total~4,00,000

The math is straightforward. Managed GPU infrastructure costs roughly 75% of what you would spend on cloud hyperscalers and 75-80% of bare metal when you factor in the DevOps time you are not spending. For teams without a dedicated infrastructure engineer, the managed route eliminates an entire category of operational risk.

AI Agent Infrastructure: What You Need to Run 20-100 Agents in Production — solution

Platform Options: Where to Host AI Agents

AWS + Self-Manage

Best for teams with existing AWS infrastructure and strong DevOps capability. Use p4d (A100) or p5 (H100) instances, EKS for Kubernetes, and SageMaker endpoints for model serving. Expensive, but deep ecosystem integration.

Pros: Broadest ecosystem, global availability, mature tooling. Cons: Highest GPU costs, complex networking, easy to over-provision.

GCP Vertex AI

Best for teams already on Google Cloud who want semi-managed ML infrastructure. Vertex AI handles model serving and scaling, but you still manage agent orchestration.

Pros: Integrated ML platform, TPU availability, managed model endpoints. Cons: Vendor lock-in, limited GPU availability in India, steep learning curve.

ZenoCloud Managed GPU

Best for AI startups and engineering teams that want to focus on building agents, not managing infrastructure. We deploy GPU servers on Indian infrastructure with dedicated networking, manage the entire stack from OS to monitoring, and provide 24/7 support.

Pros: India-resident infrastructure, fully managed, transparent pricing, 24/7 support, no DevOps overhead. Cons: Smaller ecosystem than hyperscalers, India-focused (which is a feature if your users are in India).

Self-Hosted on Bare Metal

Best for teams with specific compliance requirements or very high utilization (70%+ GPU usage, 24/7). You get maximum control and lowest per-hour cost, but you own everything from BIOS firmware to CUDA driver updates.

Pros: Lowest per-unit cost at high utilization, full control, no vendor lock-in. Cons: Highest operational burden, slow scaling, hardware procurement lead times.

Production Considerations

Auto-Restart and Health Checks

Every agent process needs a supervisor. At minimum:

  • Container-level restart policies (restart: unless-stopped in Docker, restartPolicy: Always in Kubernetes)
  • Application-level health endpoints that verify the agent can process requests, not just that the process is running
  • Liveness probes that detect stuck agents (agent process is alive but inference is timing out)
  • Graceful shutdown handling that persists agent state before restart

Cost Per Agent

Track the fully loaded cost per agent, including GPU time, API tokens, storage, and networking. For a typical production agent running local inference on shared GPU infrastructure:

Agent TypeMonthly Cost (INR)
Light agent (API-only, no local inference)2,000-5,000 (mostly API tokens)
Medium agent (local 7B inference, RAG)15,000-30,000
Heavy agent (local 70B inference, multi-step chains)50,000-1,00,000

These numbers assume shared infrastructure. A dedicated H100 running a single agent would cost INR 1,50,000/month — which is why multi-tenant GPU serving with vLLM is essential for cost efficiency at scale.

Scaling Strategies

Vertical scaling (bigger GPU): When your model does not fit in VRAM or inference latency is too high. Move from L4 to A100, or from A100 to H100.

Horizontal scaling (more instances): When you need more concurrent agent sessions. Add more inference replicas behind a load balancer.

Hybrid scaling: Run baseline capacity on reserved/managed infrastructure for predictable cost, and burst to on-demand or spot instances for traffic spikes.

Scale-to-zero: For agents that are not always active, use serverless-style cold starts. The trade-off is 10-30 seconds of model loading time on first request, which is acceptable for batch agents but not for real-time chat.

Getting Started

If you are running AI agents in production — or planning to — the infrastructure decision is one of the first things to get right. Under-provisioning leads to unreliable agents and frustrated users. Over-provisioning burns money on idle GPUs. The right answer depends on your agent count, inference requirements, and team size.

For teams that want to skip the infrastructure engineering and go straight to deploying agents: talk to ZenoCloud. We will scope your GPU and infrastructure requirements, provision a managed cluster, and have you deploying agents within days, not months.

Start with INR 5,000 in free GPU credits — enough to benchmark your agent workloads on L4, A100, or H100 GPUs and figure out exactly what your production deployment needs. Claim your credits here.

Need help with this?

Power your AI workloads with managed GPU servers.

Learn more