Real-Time Voice Translation Solutions
Real-time voice translation converts spoken words from one language into spoken output in another language with latency low enough that a natural conversation can flow between people who do not share a common language. It is the hardest speech AI problem – combining automatic speech recognition, machine translation, and text-to-speech synthesis in a pipeline where every millisecond of added delay degrades the conversational experience.
Real-Time Voice Translation Pipeline
Pipeline latency target: ASR 80ms + MT 60ms + TTS 120ms = ~260ms total — well within the 500ms conversational threshold.
The technical pipeline for real-time voice translation has three stages. Automatic speech recognition (ASR) converts the speaker’s audio to text, traditionally using models like Whisper (OpenAI), Conformer-based models in Google Cloud Speech-to-Text, or Mozilla DeepSpeech. Streaming ASR that produces partial transcripts as the speaker talks (rather than waiting for a sentence boundary) is essential for low latency. Machine translation (MT) then converts the transcript into the target language using neural MT models – DeepL and Google Cloud Translation API offer commercial APIs; Helsinki-NLP and NLLB (Meta) offer open-source models. Finally, text-to-speech (TTS) with voice cloning preserves the speaker’s voice characteristics in the translated output, which is what makes real-time translation feel natural rather than robotic. Models like ElevenLabs, Coqui TTS, and Microsoft Azure Neural Voice now produce speech that convincingly matches a speaker’s timbre and cadence across languages.
ICT Innovations is actively developing real-time voice translation as an upcoming feature for its communications products. The capability is currently in development and is not yet available in any released product – but when it ships, it will enable agents and customers to speak different languages on the same call, opening contact centre operations to multilingual markets without requiring agents to be fluent in every customer language. The broader industry trend is towards AI-native communication stacks where voice translation, sentiment analysis, and real-time coaching coexist in the same platform. Google Meet, Microsoft Teams, and Zoom all have early real-time translation features in production, demonstrating that the technology is viable at scale – the challenge for specialist communications platforms is integrating these capabilities at the SFU/media-server layer with acceptable latency for live telephony workloads.
Frequently Asked Questions
How low does latency need to be for real-time voice translation to feel natural?
Human conversations naturally have gaps and overlaps measured in 150-300 milliseconds. A voice translation pipeline needs to add less than 500ms of end-to-end latency to remain conversational. Current state-of-the-art systems can achieve this with streaming ASR, optimised MT inference, and low-latency TTS, but it requires careful engineering at each stage and typically limits translation quality compared to non-real-time systems that can wait for full sentences.
What languages does modern voice translation support?
Commercial ASR and MT systems cover 100+ languages. Whisper Large V3 supports 99 languages for transcription. Meta’s NLLB model covers 200 languages for translation. However, quality varies significantly: high-resource languages (English, Spanish, Mandarin, French, German) have excellent accuracy across all three pipeline stages, while low-resource languages may have noticeably higher error rates. Voice cloning quality also drops for less well-represented languages in training data.
What is the difference between real-time voice translation and post-call transcription and translation?
Post-call transcription processes a complete recording after a call ends and can take minutes to hours. It produces high-accuracy results because the full audio context is available and there is no latency constraint. Real-time translation must produce results during the call within milliseconds, which requires streaming models that make do with partial context and therefore have lower accuracy. The two serve different use cases: post-call for compliance, analytics, and coaching; real-time for live conversations between people speaking different languages.
Is ICT Innovations planning to add voice translation to its products?
Yes. ICT Innovations is currently developing real-time voice translation as an upcoming feature for its communications products, including its contact centre and broadcast platforms. The capability is in active development and has not yet been released. When it ships, it will allow agents and customers to communicate in different languages on live calls without needing a human interpreter. Updates will be announced on the ICT Innovations blog as development progresses.
