Question 1

What is the difference between dedicated GPU inference and a shared API?

Accepted Answer

Shared APIs (OpenAI, Anthropic, Together AI) run your requests on multi-tenant infrastructure — you pay per token, face rate limits, and share GPU time. Dedicated GPU inference gives you your own GPU server. You get fixed monthly cost, no rate limits, predictable latency, and full control over your model and data. Cost crossover occurs at roughly 150,000–300,000 requests per month depending on model size.

Question 2

Which inference runtime does ZenoCloud use?

Accepted Answer

vLLM is the default for all LLM inference (Llama, Mistral, DeepSeek, Qwen). It uses PagedAttention for memory-efficient KV cache management and continuous batching for high throughput. For non-LLM workloads (vision models, ONNX exports, speech), we use NVIDIA Triton. TGI and Ollama are available on request for HuggingFace-native models and dev/test environments.

Question 3

How fast is cold start on a dedicated GPU?

Accepted Answer

Once deployed, vLLM stays warm — no cold starts on subsequent requests. The initial model load at startup (one-time per deployment) takes 30–120 seconds depending on model size. A 7B model loads in under 30 seconds. A 70B model loads in 60–120 seconds. For high-availability setups, we keep a warm standby node to eliminate restart latency entirely.

Question 4

What throughput can I expect on a 7B model vs a 70B model?

Accepted Answer

On a single A100 80GB with vLLM at batch size 8: a 7B model (FP16) achieves roughly 2,000–4,000 tokens/second. A 70B model (FP16) achieves roughly 400–800 tokens/second. H100 SXM provides approximately 2x the throughput of A100 for the same model due to higher memory bandwidth. Actual throughput depends on prompt length, context window, and concurrency — we benchmark your specific workload during onboarding.

Question 5

Does ZenoCloud support vision models, embeddings, and speech inference?

Accepted Answer

Yes. Beyond LLM text inference, we deploy: embedding models (BGE, E5, Nomic-embed) for RAG pipelines, vision models (LLaVA, InternVL, Qwen-VL) for multimodal inference, and Whisper variants for speech-to-text. NVIDIA Triton handles multi-framework workloads (ONNX, TensorRT). Contact us with your model type and we'll scope the right hardware and runtime.

Question 6

How does ZenoCloud ensure DPDP Act 2023 compliance for inference?

Accepted Answer

All inference runs in our Mumbai datacenter within Indian jurisdiction. Your prompts and completions stay on your dedicated GPU server — we collect only infrastructure metrics (GPU utilization, container health). We sign a Data Processing Agreement confirming no prompt data is logged, no data leaves India, and no data is used for model training. Air-gapped on-premise deployments available for highest-sensitivity requirements.

Question 7

What happens if my inference endpoint crashes at 2 AM?

Accepted Answer

Our Bangalore NOC monitors all deployments around the clock. systemd restarts vLLM within 10 seconds on process crash. If the restart fails (OOM, CUDA error), the NOC investigates and intervenes. Median time from alert to resolution is under 15 minutes. Scale-tier clients get a dedicated on-call engineer with direct phone access.

Runtime	Best For	Endpoint Format	ZenoCloud Default
vLLM	Production, high concurrency, PagedAttention for memory efficiency	/v1/chat/completions (OpenAI-compatible)	Yes — production default
NVIDIA Triton	Multi-framework: TensorRT, ONNX, PyTorch — vision, speech, embedding	HTTP + gRPC inference endpoints	Available on request
TGI (Text Generation Inference)	HuggingFace-native models, PEFT adapters, safetensors format	/generate + /generate_stream	Available on request
Ollama	Dev/test, low concurrency, quick iteration	/api/chat (native) + OpenAI shim	Available on request

Feature	OpenAI / Anthropic / Together API	ZenoCloud Dedicated Inference
Cost structure	Variable (per token, per request)	Fixed (₹30K–₹2L/mo per GPU)
Rate limits
Data residency in India
DPDP Act 2023 compliant
Inference on private / fine-tuned models
PHI / PII stays on your infrastructure
p99 latency SLA you control
Model capability (GPT-4 / Claude Opus)
Zero setup time
Cost-effective below 50K requests/month

Production AI Inference on Dedicated GPUs

What ZenoCloud Manages for Inference

Hardware + CUDA Stack

Runtime Configuration

OpenAI-Compatible Endpoint

Latency & Throughput Monitoring

Single-Tenant Data Privacy

Dynamic Batching & Scaling

Inference Runtimes Supported

Dedicated GPU Inference vs Shared Inference API

Frequently Asked Questions

Get Your Inference Endpoint Live in Days

Related AI Services

LLM Hosting

AI Model Training

ML Infrastructure

GPU Hosting Catalog

H100 GPU Servers

Security & DPDP Compliance