Deploy LLaMA 3 in India: Complete Guide with Indian GPU Pricing
Running LLMs in production is no longer optional for Indian startups and enterprises building AI products. But deploying LLaMA 3 on Indian GPU infrastructure comes with questions that US-centric tutorials never answer: which GPUs are actually available here, what does it cost in INR, and how do you set up an inference stack that handles real traffic?
This guide covers everything from GPU selection to a production-ready vLLM deployment, with actual Indian pricing so you can budget before you provision.

Why Deploy LLaMA in India?
Three factors make self-hosted LLaMA on Indian infrastructure the right call for most teams building AI products for the Indian market.
Data residency. DPDP Act compliance and enterprise procurement policies increasingly require that user data stays within Indian borders. Sending prompts to US-hosted APIs creates a compliance gap that grows harder to close as you scale.
Latency. A round trip to us-east-1 from Mumbai adds 200-300ms of network latency on top of inference time. For real-time applications like chatbots, coding assistants, or document processing pipelines, that overhead is unacceptable. A GPU in Mumbai or Noida cuts network latency to single-digit milliseconds.
Cost. Indian GPU pricing from providers like E2E Networks and ZenoCloud runs 40-60% lower than equivalent US instances from Lambda Labs, CoreWeave, or AWS. Over a 12-month deployment, that difference compounds into lakhs saved.
LLaMA 3 Model Variants and GPU Requirements
Meta’s LLaMA 3 family spans three sizes. Each has fundamentally different GPU requirements.
LLaMA 3 8B: The Workhorse
The 8B parameter model fits comfortably on a single GPU with 24GB VRAM. In FP16, the model weights consume approximately 16GB, leaving headroom for KV cache and batch processing.
Minimum GPU: NVIDIA L4 (24GB VRAM) Recommended GPU: NVIDIA A10G (24GB VRAM) or L40S (48GB VRAM) Indian pricing: Starting at approximately 49 INR/hr on E2E/ZenoCloud infrastructure Monthly estimate: ~30,000 INR for a dedicated L4 instance
The 8B model handles most production use cases: customer support bots, document summarization, code generation, and RAG pipelines. For teams starting out, this is where you should begin.
LLaMA 3 70B: Enterprise Grade
The 70B model requires serious GPU memory. At FP16, the weights alone consume approximately 140GB of VRAM, which means you need either a single A100 80GB with 4-bit quantization or multiple GPUs with tensor parallelism.
Minimum GPU: 1x A100 80GB (with AWQ/GPTQ 4-bit quantization) Recommended GPU: 2x L40S (48GB each, 96GB total) or 2x A100 80GB Indian pricing: Starting at approximately 220 INR/hr for A100 80GB instances Monthly estimate: Custom pricing based on configuration
The 70B model delivers noticeably better reasoning, instruction following, and multilingual performance. If your application requires high-quality outputs for complex tasks, the jump from 8B to 70B is significant.
LLaMA 3 405B: Research and Maximum Quality
The 405B model is Meta’s largest release and demands multi-node GPU clusters. At FP16, the weights require approximately 810GB of VRAM.
Minimum GPU: 4x H100 80GB (320GB total, with quantization) Recommended GPU: 8x H100 80GB (640GB total, FP16) Indian pricing: Custom pricing, typically arranged as dedicated clusters Monthly estimate: Contact ZenoCloud for enterprise GPU cluster pricing
Most teams do not need the 405B model. Unless you are running a research lab or building a product where marginal quality improvements justify 10-20x the cost, the 70B model with good prompting will serve you better.
Setting Up the Server
Start with a clean Ubuntu 22.04 LTS instance with your chosen GPU attached. The following steps assume you have SSH access to a freshly provisioned GPU server on ZenoCloud or E2E infrastructure.
Step 1: Install NVIDIA Drivers and CUDA
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install NVIDIA driver (check nvidia.com for latest version)
sudo apt install -y nvidia-driver-535
# Reboot to load the driver
sudo reboot
# Verify GPU is detected after reboot
nvidia-smi
You should see your GPU listed with driver version and CUDA version. If nvidia-smi fails, the driver installation did not complete correctly. Check dmesg | grep -i nvidia for errors.
Step 2: Install Docker with NVIDIA Container Toolkit
Docker with the NVIDIA runtime is the cleanest way to run inference workloads. It isolates dependencies and makes deployments reproducible.
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify GPU access from Docker
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Deploying with vLLM (Recommended)
vLLM is the production standard for LLM inference. It implements PagedAttention for efficient KV cache management, continuous batching for high throughput, and an OpenAI-compatible API server out of the box. If you have used the OpenAI API, your existing client code works with vLLM with a single base URL change.
Basic vLLM Deployment
The fastest path to a running LLaMA 3 endpoint:
# Pull and run vLLM with LLaMA 3 8B
docker run -d \
--name vllm-llama3 \
--gpus all \
--shm-size=1g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=your_hf_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192 \
--dtype auto \
--api-key your_api_key_here
Replace your_hf_token_here with your Hugging Face token (you need to accept Meta’s license agreement on the model page first). Replace your_api_key_here with a strong random string that will serve as your API authentication key.
Docker Compose for Production
For production deployments, use Docker Compose to manage the vLLM service alongside monitoring:
# docker-compose.yml
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-llama3
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- model-cache:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Meta-Llama-3-8B-Instruct
--max-model-len 8192
--dtype auto
--api-key ${VLLM_API_KEY}
--tensor-parallel-size 1
--gpu-memory-utilization 0.90
--max-num-batched-tokens 8192
--enable-prefix-caching
shm_size: "1g"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:
Create a .env file alongside the compose file:
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=your_strong_random_api_key
Start the service:
docker compose up -d
docker compose logs -f vllm
The model download takes 15-30 minutes on a typical Indian server connection. Once loaded, you will see a log line indicating the server is ready.
vLLM Configuration for Optimal Throughput
These flags in the command above deserve explanation:
--gpu-memory-utilization 0.90: Allocates 90% of GPU VRAM to the model and KV cache. Leave 10% headroom for CUDA overhead.--max-num-batched-tokens 8192: Maximum tokens processed in a single batch. Higher values improve throughput but increase latency per request.--enable-prefix-caching: Caches common prompt prefixes across requests. Critical for RAG applications where the system prompt is identical across calls.--tensor-parallel-size 1: Set this to the number of GPUs for multi-GPU setups (e.g., 2 for 70B on 2x L40S).
Testing the Endpoint
The vLLM server exposes an OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key_here" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the DPDP Act in two sentences."}
],
"max_tokens": 256,
"temperature": 0.7
}'
Deploying with Ollama (Simpler Alternative)
If you want something running in five minutes without Docker Compose files and configuration tuning, Ollama is the fastest path. It handles model downloads, quantization, and serving behind a single binary.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run LLaMA 3 8B
ollama pull llama3
ollama serve &
# Test it
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "What is vLLM?",
"stream": false
}'
Ollama is excellent for development, prototyping, and internal tools. For production workloads with concurrent users, vLLM is the better choice because of its continuous batching, PagedAttention, and higher throughput under load.
When to choose Ollama over vLLM:
- Internal team tools with fewer than 10 concurrent users
- Development and testing environments
- Quick proof-of-concept deployments
- Scenarios where simplicity matters more than throughput
API Endpoint Setup with Authentication
For production, you will likely want a thin API layer in front of vLLM that handles authentication, rate limiting, and request logging. Here is a minimal FastAPI wrapper:
# api_server.py
import os
import time
import httpx
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
app = FastAPI(title="LLaMA 3 API")
security = HTTPBearer()
VLLM_URL = os.getenv("VLLM_URL", "http://localhost:8000")
VLLM_API_KEY = os.getenv("VLLM_API_KEY", "")
VALID_API_KEYS = set(os.getenv("VALID_API_KEYS", "").split(","))
def verify_key(creds: HTTPAuthorizationCredentials = Depends(security)):
if creds.credentials not in VALID_API_KEYS:
raise HTTPException(status_code=401, detail="Invalid API key")
return creds.credentials
class ChatRequest(BaseModel):
messages: list[dict]
max_tokens: int = 512
temperature: float = 0.7
@app.post("/v1/chat")
async def chat(request: ChatRequest, api_key: str = Depends(verify_key)):
start = time.time()
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{VLLM_URL}/v1/chat/completions",
json={
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": request.messages,
"max_tokens": request.max_tokens,
"temperature": request.temperature,
},
headers={"Authorization": f"Bearer {VLLM_API_KEY}"},
)
latency_ms = (time.time() - start) * 1000
if response.status_code != 200:
raise HTTPException(
status_code=response.status_code,
detail="Inference backend error",
)
result = response.json()
result["latency_ms"] = round(latency_ms, 2)
return result
@app.get("/health")
async def health():
return {"status": "ok"}
Run it alongside your vLLM container:
pip install fastapi uvicorn httpx
VALID_API_KEYS="key1,key2,key3" VLLM_API_KEY="your_vllm_key" \
uvicorn api_server:app --host 0.0.0.0 --port 9000
This gives you a separate authentication layer, request logging, and a clean interface for your application to consume.

Performance Optimization
Quantization: Trade Precision for Speed and Memory
Quantization reduces model weight precision from FP16 (16 bits) to INT8 or INT4, cutting memory usage by 2-4x with minimal quality loss. This is how you run LLaMA 3 70B on a single A100 80GB.
AWQ (Activation-aware Weight Quantization) is the recommended approach for vLLM:
# Use a pre-quantized model from Hugging Face
docker run -d \
--name vllm-llama3-70b-awq \
--gpus all \
--shm-size=2g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
vllm/vllm-openai:latest \
--model TheBloke/Llama-3-70B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.92 \
--dtype auto
4-bit AWQ quantization reduces the 70B model from ~140GB to ~35GB of VRAM, fitting on a single A100 80GB with room for KV cache.
Continuous Batching and Throughput Tuning
vLLM handles batching automatically, but you can tune it:
- Increase
--max-num-seqs(default 256): Maximum number of sequences processed concurrently. Increase for high-traffic endpoints. - Adjust
--max-num-batched-tokens: Higher values batch more tokens per iteration, improving GPU utilization at the cost of per-request latency. - Enable
--enable-chunked-prefill: Splits long prompt prefills into chunks, reducing time-to-first-token for concurrent requests.
KV Cache Optimization
The KV cache stores attention key-value pairs for in-flight requests. vLLM’s PagedAttention manages this automatically, but the --gpu-memory-utilization flag directly controls how much VRAM is available for KV cache after model weights are loaded.
For the 8B model on an L4 (24GB):
- Model weights: ~16GB (FP16)
- Available for KV cache: ~5.6GB (at 90% utilization)
- This supports approximately 8-12 concurrent requests at 4096 context length
Cost Analysis: India vs US for LLaMA Inference
This is where Indian GPU infrastructure delivers its strongest advantage. Here is a direct comparison for common deployment configurations.
LLaMA 3 8B on NVIDIA L4
| Provider | Region | Monthly Cost | Notes |
|---|---|---|---|
| ZenoCloud / E2E | India (Mumbai/Noida) | Dedicated L4 instance | |
| Lambda Labs | US | $800+ (~67,000 INR) | On-demand L4 pricing |
| AWS (g6.xlarge) | US East | $900+ (~75,000 INR) | On-demand, no reserved |
| RunPod | US | $650+ (~54,000 INR) | Community cloud pricing |
Savings with Indian infrastructure: 40-60% lower than US equivalents.
LLaMA 3 70B on NVIDIA A100 80GB
| Provider | Region | Monthly Cost | Notes |
|---|---|---|---|
| ZenoCloud / E2E | India | Custom pricing | Dedicated A100 instance |
| Lambda Labs | US | $2,500+ (~2,10,000 INR) | On-demand A100 80GB |
| AWS (p4d.24xlarge) | US East | $3,000+ (~2,50,000 INR) | 8x A100 instance |
| CoreWeave | US | $2,200+ (~1,85,000 INR) | A100 80GB on-demand |
Indian GPU pricing benefits from lower data center operating costs, competitive energy pricing, and government incentives for domestic cloud infrastructure. The gap widens further with reserved or long-term commitments.
Total Cost of Ownership Considerations
Beyond raw GPU costs, factor in:
- Bandwidth: Indian data centers charge less for domestic bandwidth, and your latency-sensitive traffic stays local.
- Engineering time: A managed deployment from ZenoCloud eliminates the 2-4 weeks of setup, tuning, and monitoring configuration that self-hosting requires.
- Compliance: Keeping data within Indian borders avoids the legal overhead of cross-border data transfer agreements.
Monitoring and Scaling
Prometheus Metrics from vLLM
vLLM exposes Prometheus metrics at /metrics by default. The key metrics to watch:
vllm:num_requests_running: Currently processing requests. If this consistently equals your--max-num-seqs, you are at capacity.vllm:num_requests_waiting: Queued requests. Sustained values above zero indicate you need more GPU capacity.vllm:gpu_cache_usage_perc: KV cache utilization. Above 95% means requests may be rejected.vllm:avg_generation_throughput_toks_per_s: Your actual tokens-per-second output.
Scaling Strategy
Vertical scaling: Move from L4 to A10G to L40S to A100 as your traffic grows. Each step increases both throughput and maximum context length.
Horizontal scaling: Run multiple vLLM instances behind a load balancer (NGINX or Traefik). Each instance handles independent requests. This is the simplest scaling approach and works well to 10-20 concurrent users per GPU.
# nginx.conf - simple load balancing across vLLM instances
upstream vllm_backend {
least_conn;
server 10.0.1.10:8000;
server 10.0.1.11:8000;
}
server {
listen 443 ssl;
server_name llama-api.yourdomain.com;
location / {
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_read_timeout 120s;
}
}
Auto-scaling: For bursty workloads, use Kubernetes with the NVIDIA GPU Operator and Horizontal Pod Autoscaler. Scale on the vllm:num_requests_waiting metric. This is complex to set up but optimal for production workloads with unpredictable traffic patterns.
Common Deployment Mistakes
A few pitfalls that catch teams deploying LLaMA for the first time:
Skipping quantization for 70B. Running FP16 on insufficient VRAM causes OOM crashes mid-inference. Always quantize 70B+ models unless you have ample VRAM headroom.
Setting max-model-len too high. A 128K context window sounds impressive but consumes massive KV cache memory. Set this to the maximum your application actually needs. Most production use cases work fine with 4096-8192 tokens.
No health checks. vLLM can silently stop accepting requests if the GPU enters a bad state. Always configure health checks and automatic restarts.
Ignoring warm-up time. The first request after model loading is slow (CUDA kernel compilation). Send a few warm-up requests during deployment before routing production traffic.
Skip the Setup: ZenoCloud Managed LLaMA Deployment
Everything in this guide works. But it also takes 2-4 weeks of engineering time to get right: driver compatibility, CUDA version mismatches, Docker networking, vLLM configuration tuning, monitoring setup, security hardening, and ongoing maintenance.
ZenoCloud deploys your LLaMA model on Indian GPU infrastructure in 48 hours, fully managed. You get:
- Pre-configured GPU instances with CUDA, Docker, and vLLM optimized for your chosen model size
- An OpenAI-compatible API endpoint with authentication, rate limiting, and logging built in
- Monitoring and alerting through Zabbix and Prometheus, with 24/7 NOC support
- Data residency in India with servers in Mumbai and Noida data centers
- Scaling support as your traffic grows, with capacity planning and GPU upgrades handled for you
Start with 5,000 INR in free credits. Deploy LLaMA 3 8B or 70B and test it against your workload before committing to a long-term plan.
Get started with ZenoCloud GPU Servers or talk to our AI infrastructure team to scope your deployment.