12 May 2022

The Dubber AI Platform unlocks value from meetings by transforming conversations into instantly accessible and actionable business data.


Our process starts with Speech-to-Text (STT) technology to transcribe what was said, but it doesn’t end there. Beyond the convenience of an accurate transcript, complementary AI technologies do post-processing on both the audio and transcript to deliver substantially greater business value by detecting key moments, summarising the most important information, and discovering trends and insights.


Conversations are complex

Meetings and other conversations provide one of the most challenging domains for speech technology. 


Siri and Alexa are just the beginning…

When people speak to devices, via voice assistants like Alexa or Siri, the interactions are simple. A single person utters a wake word and then issues a clear, succinct command or question. Any ensuing dialogue with the assistant is structured into clear speaking turns.


Real world conversations are complex

However, when people speak to other people, the interactions are orders of magnitude more complex due to the reality of spontaneous conversational speech between several people. 

  • Spontaneous conversational speech lacks consistent grammatical structure, like sentences and paragraphs, and people consistently interrupt or talk over each other. Conversations evolve in threads, head off in tangents, and rarely follow any preset script or agenda.

  • Due to the cognitive load of interacting in spontaneous conversational speech, speakers also display so-called disfluencies. Examples of disfluent speech include repeated words or syllables, false starts to phrases or sentences, corrections to mispronunciations, and the insertion of filler words that contribute no meaning, such as "um" or "like".

  • Other complexities can arise from acoustic conditions, specialised or informal vocabulary, non-native accents or poor pronunciation.


Tackling the complexity

While conversations may be complex, our goal at Dubber is simple: to deliver the most accurate and complete record of the conversation.  

How do we achieve this?

First, the highest possible speech-to-text accuracy is achieved through rigorous, continuous benchmarking of various STT systems - and insistence on only delivering the best output. 

To ensure we deliver state-of-the-art STT performance without bias, we use an internal evaluation dataset consisting of: 

  • A balanced mix of languages

  • Regional dialects or accents

  • Male and female speakers

  • Variation in the number of speakers

We also control for different noise conditions and audio codecs.  

Building on this, the transcript is enriched and structured using several complementary technologies, including:  

Voice Activity Detection to identify sections with noise, low-quality audio or overlapping speech, and mark up these regions in the transcript.  

Speaker Turn Segmentation to detect speaker changes and identify voices, organising the transcript according to who was speaking.

Fluency Detection to filter out disfluent speech to improve readability and help focus on the most fluent and meaningful speech.  

Topic Segmentation to break down the full conversation into a series of topics or threads of the conversation with a set of associated keywords.  


The result

By understanding and systematically addressing the complexities of conversational speech, we deliver much more than just a raw transcript of what was said.  

Our target is a rich record of the meeting that can be used to help get real work done.  Users can easily search for moments of interest, navigate to relevant parts for review, ensure efficient follow-up, and communicate decisions to colleagues.  

Ultimately, it’s our mission to ensure everyone gets the most value possible from the time we invest in meetings.

