Open Source VoIP & ICT Solutions for Businesses Worldwide

Real-Time Voice Translation Solutions

#20 of 20 Innovations

Real-Time Voice Translation Solutions

Real-time voice translation is genuinely one of the hardest problems in applied AI. You’re chaining three complex systems – automatic speech recognition, machine translation, and text-to-speech – in a pipeline where every millisecond of added delay degrades the conversational experience. Human conversations have natural gaps of roughly 150-300ms. Add more than 500ms of pipeline latency and the conversation stops feeling natural. So you’re not just solving a quality problem; you’re solving a hard real-time engineering problem at the same time. And the two objectives often pull in opposite directions.

Real-Time Voice Translation PipelineSpeakerAudio InputStreaming ASRWhisper / Conformerpartial transcripts~80ms latencyNeural MTDeepL / NLLB-200200 language pairs~60ms latencyVoice TTS + CloneElevenLabs / Azure Neuralspeaker voice preserved~120ms latencyAudioOutTotal ~260ms — under the 500ms conversational thresholdICT InnovationsIn DevelopmentStreaming ASR produces partial transcripts — MT starts before speaker finishes, cutting total pipeline latency

Pipeline latency target: ASR 80ms + MT 60ms + TTS 120ms = ~260ms total — well within the 500ms conversational threshold.

The pipeline has three stages, each with its own latency budget. ASR (automatic speech recognition) converts audio to text. Streaming ASR – where the model produces partial transcripts as the speaker talks rather than waiting for a sentence boundary – is essential for keeping latency under control. OpenAI’s Whisper Large V3 (released late 2023) covers 99 languages and is the most widely used open-source ASR model, though Conformer-based models in Google Cloud Speech-to-Text are faster at similar accuracy for many languages. Neural machine translation (MT) then converts the transcript to the target language: DeepL offers the best quality for European language pairs; Meta’s NLLB-200 covers 200 language pairs in open-source form. Finally, TTS with voice cloning (ElevenLabs, Microsoft Azure Neural Voice) preserves the speaker’s voice characteristics in the translated output – that’s what makes it feel natural rather than robotic, and it’s the part of the pipeline where quality has improved most dramatically over the last two years. A good end-to-end system targets roughly 80ms ASR + 60ms MT + 120ms TTS = about 260ms total, well under the 500ms conversational threshold.

Voice Translation Quality by Language TierMetricHigh-Resource LanguagesLow-Resource LanguagesEnglish · Spanish · Mandarin · FrenchSwahili · Bengali · Punjabi · AmharicASR Accuracy95-98% WER75-88% WER (higher error rate)MT Quality (BLEU)40-55 BLEU score20-35 BLEU scoreTTS QualityNatural voice cloningLimited voice varietyPipeline Latency(ASR+MT+TTS)~220-280ms total~280-400ms total

Low-resource language quality is improving rapidly — Meta NLLB-200 and Google Translate cover 200+ languages with acceptable quality for customer service.

ICT Innovations is actively developing real-time voice translation as an upcoming feature for its communications products. This capability is in active development and hasn’t been released yet – but when it ships, it will let agents and customers speak different languages on the same call, opening contact centre operations to multilingual markets without requiring multilingual agents. That’s a meaningful operational change for any organisation serving international customers. The broader industry is already moving in this direction: Google Meet, Microsoft Teams, and Zoom all have early real-time translation features in production, which demonstrates the technology is viable at scale. The challenge for specialist communications platforms is integrating these capabilities at the media-server layer with acceptable latency for live telephony – which is harder than the consumer conferencing implementations, but also where ICT Innovations’ deep FreeSWITCH expertise is directly relevant.

Frequently Asked Questions

How low does latency need to be for real-time voice translation to feel natural?

Human conversations have natural gaps of roughly 150-300ms. A voice translation pipeline needs to add less than 500ms of end-to-end latency to remain conversational. Current state-of-the-art systems achieve this with streaming ASR, optimised MT inference, and low-latency TTS – but it requires careful engineering at each stage and usually means accepting slightly lower translation quality than non-real-time systems that can wait for full sentences.

What languages does modern voice translation support?

Commercial ASR and MT systems cover 100+ languages. Whisper Large V3 supports 99 languages for transcription. Meta’s NLLB-200 covers 200 language pairs for translation. Quality varies significantly: high-resource languages (English, Spanish, Mandarin, French, German) have excellent accuracy across all three pipeline stages, while low-resource languages typically have higher error rates. Voice cloning quality also drops for languages less well-represented in training data.

What is the difference between real-time voice translation and post-call transcription and translation?

Post-call transcription processes a complete recording after a call ends, producing high-accuracy results because the full audio context is available with no latency constraint. Real-time translation must produce results within milliseconds using partial context, so accuracy is lower. The two serve different use cases: post-call for compliance, analytics, and coaching; real-time for live conversations between people speaking different languages.

Is ICT Innovations planning to add voice translation to its products?

Yes. ICT Innovations is currently developing real-time voice translation as an upcoming feature for its communications products, including its contact centre and broadcast platforms. The capability is in active development and has not yet been released. When it ships, it will allow agents and customers to communicate in different languages on live calls without a human interpreter. Updates will be announced on the ICT Innovations blog as development progresses.