What is a self-hosted voice AI deployment compared to Vapi or Retell?

A self-hosted voice AI deployment runs speech-to-text, LLM inference, and text-to-speech on dedicated GPU infrastructure in an EU region or a client-owned VPC. Hosted vendors like Vapi or Retell process audio in their US or EU cloud with limited subprocessor control. Self-hosting moves control and supply-chain responsibility into the client's own perimeter.

Which GPU stack is the DACH industry standard for production voice AI?

The DACH standard is a dual-GPU pairing: an NVIDIA L40S for ASR and LLM orchestration (Ada Lovelace Transformer Engines handle regional dialects like Swiss German and Austrian Bavarian) and an NVIDIA L4 for cost-efficient TTS streaming that stays under the 200 ms latency threshold. The combination hits 0.3 seconds warm combined latency across multiple concurrent calls. For higher concurrency, scale horizontally on Kubernetes.

How does data residency work for self-hosted voice AI in the EU?

Audio data never leaves the selected EU region. STT, LLM, and TTS run locally on the same GPU node, the OpenAI-compatible endpoints are internal, backups, logs, and model weights stay inside the same data perimeter. This satisfies NIS2 supply-chain proof obligations without external subprocessors.

Can a self-hosted voice agent be migrated into a client-owned VPC?

Yes, in under 48 hours when a runbook exists. The deployment is documented as infrastructure-as-code. Because the API endpoints are OpenAI-compatible, voice-agent applications consuming them require no rewiring.

What does the business-outcome writeback contract include?

Five REST endpoints: call-session start, event, outcome, handoff-summary, and learning-items. The data model covers caller_profiles, call_sessions, call_events, handoffs, outcomes, and learning_items. Every session produces structured learning items that feed back into CRM, support, and knowledge-base systems.

Parloa vs Cognigy vs Vapi vs Retell — which voice AI vendor fits a European enterprise?

Parloa and Cognigy are EU-anchored hosted vendors with enterprise-grade EU hosting and a strong DACH sales motion. Vapi and Retell are US-centric platforms with faster time-to-first-call but more complex subprocessor supply chains from a GDPR perspective. Self-hosting complements those four as the path when NIS2 Article 21 or contractual data residency is stricter than what hosted contracts can guarantee.

When is self-hosting cheaper than Parloa, Cognigy, Vapi, or Retell?

Hosted vendors scale linearly with call minutes, typically between 0.12 and 0.40 EUR per minute (model-dependent). Self-hosting on a dedicated GPU node costs between 800 and 2,500 EUR per month regardless of volume. Break-even sits around 50,000 minutes per month — under stricter compliance requirements (NIS2, sector-specific data residency), often much earlier.

Which EU sovereign clouds run this stack beyond STACKIT?

The same dual-GPU configuration runs on PlusServer (DE), Hetzner Cloud (DE-Nuremberg, DE-Falkenstein, FI-Helsinki), IONOS Cloud (DE-Frankfurt), OVHcloud (FR-Gravelines, DE-Limburg, PL-Warsaw), Scaleway (FR-Paris, NL-Amsterdam, PL-Warsaw), and Open Telekom Cloud by T-Systems (DE). The runbook calls out provider-specific GPU SKUs and network paths. Migration between providers takes 48 hours of redeployment because the OpenAI-compatible endpoints stay identical.

Can this self-hosted stack overlay an existing CCaaS (Genesys Cloud, Amazon Connect, 8x8)?

Yes, as a programmable-voice extension. The voice agent terminates on the CCaaS-side SIP trunk via a dedicated session border controller, but STT, LLM, and TTS run on the self-hosted GPU node. The CCaaS handles call routing, queue management, and human-agent handoff while the AI conversation, recording, and transcripts stay inside the EU perimeter. This is the most common architecture for enterprises with a multi-year existing CCaaS investment that cannot be replaced.

How is PCI DSS handled when callers read card numbers in the conversation?

PCI DSS scope is removed via DTMF suppression and a separate payment-capture handoff. When the agent reaches a payment step, the call routes to a PCI-scoped IVR service that captures card digits via DTMF tones — the AI agent never hears the audio, never logs it, and the voice transcript explicitly redacts the segment. The agent resumes after the payment service returns a success token. This pattern keeps the voice AI stack outside PCI scope while still supporting card-present transactions over the phone.

Production Self-Hosted Voice AI Platform For Data-Residency-Sensitive Teams

May 21, 2026 · 4 min read · voice-ai, self-hosted, dograh, eu-data-residency, gpu, nlp, speech-to-text, conversational-ai

Ergebnis Outcome

Eine produktionstaugliche, selbst-gehostete Voice-AI-Bereitstellung mit gemessener Warmpfad-Latenz (0,3s kombiniert auf dem Dual-GPU-Stack L40S + L4), persistentem Zustand für schnelle Wiederinbetriebnahme und einem strukturierten Writeback-Vertrag, damit jeder Anruf zurück in Sales, Support, Produkt und Ops fließt — heute in EU-Infrastruktur deploybar, bei Bedarf in eine kundeneigene VPC migrierbar. A production-oriented self-hosted voice AI deployment with measured warm-path latency (0.3s combined on the dual-GPU L40S + L4 stack), persistent state for fast ramp-up, and a structured writeback contract so every call feeds back into sales, support, product, and ops — deployable in EU infrastructure today and migratable into a client-owned VPC when required.

0.3s

Latenz (warm), kombiniert Warm combined latency

L40S (ASR + LLM) · L4 (TTS-Streaming) L40S (ASR + LLM) · L4 (TTS streaming)

L40S + L4

Test-GPU First validation GPU

STACKIT (DE-Frankfurt) STACKIT (DE-Frankfurt)

300 GB

Persistentes Modell-Volume Persistent model volume

Schnelle Wiederinbetriebnahme Fast ramp-up after compute shutdown

Writeback-Endpunkte Writeback endpoints

session · event · outcome · handoff · learning session · event · outcome · handoff · learning

The problem

Voice AI platforms like Parloa, Cognigy, Vapi, and Retell are useful, but enterprise teams often need more control than a hosted voice SaaS gives them. The recurring concerns:

Where does call data go?
Can the system run in a trusted VPC?
Can STT, LLM, and TTS providers be swapped?
Can costs be controlled at scale?
Can learnings from calls feed back into the business?
Can the workflow be reviewed before changes reach production?

The solution

A production-oriented self-hosted voice AI setup with operational control as the design centre:

Dograh for open-source voice-agent orchestration.
GPU-backed local STT, LLM, and TTS inference behind OpenAI-compatible endpoints so individual providers can be swapped without rewiring callers.
Persistent model/runtime volume so compute can be ramped down and back up without long re-download cycles.
Evidence artifacts for machine proof, model preload, health checks, benchmark, and smoke tests.
Runbook for recreating the setup on STACKIT, PlusServer, or in a client-selected VPC.

Client context before the first word

A key design point is pre-call context lookup. Before the caller starts speaking, the system can retrieve known account or client information and use it to adapt: greeting, tone, product context, support tier, next best question, routing decision, handoff threshold. This makes the agent behave less like a generic bot and more like a prepared operator who knows who is calling.

What we measured

The validation environment ran on STACKIT in DE-Frankfurt with a dual-GPU stack — an NVIDIA L40S handling ASR and LLM orchestration, an NVIDIA L4 streaming TTS under the 200 ms latency threshold — and a dedicated 300 GB persistent volume. The pairing is the DACH industry standard for multilingual voice AI handling regional dialects (Swiss German, Austrian Bavarian). The local voice AI layer completed chat response generation, text-to-speech audio generation, speech-to-text transcription, health checks, and a warm benchmark.

Stage	GPU	Warm latency
ASR + LLM orchestration	L40S	combined path
TTS streaming	L4	< 200 ms
Combined warm round-trip		`0.3s`

Business outcome loop

The system is designed to write structured session outcomes back to a backend so calls become measurable business progress, not just “a voice bot answered.”

Per-session metrics tracked: call answered, call completed, successful handoff, resolved without handoff, qualified lead, disqualified lead, reason for disqualification, unresolved question, objection category, follow-up needed, estimated value, cost per completed call, cost per qualified lead.

A successful handoff means the agent identified that a human should take over, the destination was correct, the human received context, and the caller did not need to repeat the whole story. Sample payload:

{
  "session_id": "sess_123",
  "handoff_target": "sales_engineering",
  "caller": {
    "company": "Acme GmbH",
    "support_tier": "priority"
  },
  "reason": "VPC deployment and security review question",
  "summary": "Caller wants self-hosted voice AI in their own VPC and asked about data residency.",
  "recommended_next_action": "Schedule technical architecture call."
}

Each session also creates learning items so calls feed back into sales, support, product, marketing, and operations:

{
  "session_id": "sess_123",
  "type": "knowledge_gap",
  "source": "voice_call",
  "text": "Caller asked whether STT can run fully inside an EU VPC.",
  "recommended_action": "Add VPC-local STT section to security FAQ.",
  "priority": "high"
}

Minimum backend API contract:

POST /call-session/start
POST /call-session/event
POST /call-session/outcome
POST /handoff-summary
POST /learning-items

Data model: caller_profiles, call_sessions, call_events, handoffs, outcomes, learning_items, agent_versions, workflow_versions.

Guardrail

Do not let one session automatically rewrite production behavior. The recommended flow:

Write learning item.
Group similar items.
Review.
Update prompt / workflow / knowledge base.
Version change.
Test.
Publish.

Five capabilities this proves

Ultra-low latency path through local GPU inference.
Provider-agnostic model flexibility.
Human-style interaction handling through context and handoff logic.
Transient specialist agents behind the live voice agent.
Enterprise workflow control through visual guardrails and persistence.

Honest status

The infrastructure and local inference proof are complete. The next production proof is to wire Dograh orchestration into the local inference endpoints, add pre-call client lookup, and add backend learning writeback.

For orientation first: a full vendor comparison — Vapi vs Retell vs Parloa vs Cognigy vs Dograh with NIS2 scoring and cost matrix is available as a separate guide.

If you are running voice AI on a hosted SaaS today and the data-residency, provider-swap, or business-outcome questions above are starting to bite, I am happy to walk through what a self-hosted deployment would look like for your specific stack — thirty minutes, no slide deck.

Stack Stack

Dograh open-source voice-agent orchestration layer
Dual-GPU architecture: NVIDIA L40S (ASR + LLM orchestration) + NVIDIA L4 (TTS streaming under 200 ms), OpenAI-compatible endpoints
Persistent 300 GB model/runtime volume for fast ramp-up after compute shutdown
Pre-call context lookup (account, support tier, product, routing) before the caller speaks
Structured writeback contract: call sessions, outcomes, handoffs, learning items
Validation on STACKIT (DE-Frankfurt, NVIDIA L40S + L4 dual-GPU)

Bereit, ein ähnliches Projekt zu skizzieren? Schriftliches Konzept in 24 Stunden. Ready to scope a similar engagement? Written concept in 24h.

Mein Konzept in 24h → My concept in 24h →

Production Self-Hosted Voice AI Platform For Data-Residency-Sensitive Teams

The problem

The solution

Client context before the first word

What we measured

Business outcome loop

Guardrail

Five capabilities this proves

Honest status

Stack Stack

Before you go —

Almost there

Production Self-Hosted Voice AI Platform For Data-Residency-Sensitive Teams

The problem

The solution

Client context before the first word

What we measured

Business outcome loop

Guardrail

Five capabilities this proves

Honest status

Stack Stack

Scope my automation in 24h

Request received