From Audio Transcripts to Automatic Summaries

We've been talking to machines, websites and programs for a while now, but usually only in simple one-way conversations, at best. Applied correctly, however, AI speech recognition can positively change both how we humans communicate with machines and how we interact with each other. A quick look into the near future.

Game-changer “deep learning”

Automatic speech recognition (ASR), also known as “speech to text” technology, has been around for several decades, but for many years it produced fairly little innovation worth mentioning. Recently, with the rise modern AI approaches such as “deep learning” this has changed abruptly. Thanks to them, ASR technology has made great gains in accuracy and efficiency by integrating both grammar, syntax, structure and configuration of audio and speech signals to better “understand” and process our language. 


ASR works through a system of programs and algorithms that interact with each other, such as pronunciation and acoustic models “hearing” and recognising spoken language and language models that determine the most likely meaning. Using audio samples and transcriptions, the system learns to recognise and interpret more complex speech patterns, vocabulary and meaning. To do this, the ASR must also be able to account for differences in accents, syntax and local expressions.


Conversational data is quintessential

In every industry today, AI speech recognition can be efficiently used to simplify processes, interactions and access. The first prerequisite for this is a controlled data collection.


If you gather too much similar data and then switch from a generalist to a specific model, for example, a so-called “overfitting” can occur. “The algorithm then becomes very specialized in individual areas, which leads to a sudden decrease in performance in all others. To avoid such an unwanted scenario, we continuously adapt our entire system to the needs and use cases of our customers,” explains Michael Schramm, CTO and co-founder of 


Tucan’s AI is trained primarily with customer data. Therefore, the system continuously learns, with increasing usage, to better understand relevant speech flows and patterns. When collecting customer-specific data we different between basic types of data sets: monologues and dialogues. As the terminology suggests, they differ in the number of recorded speakers. focuses on conversations with multiple participants. “Our AI recognizes far more than two speakers. So far, we have been able to achieve positive results with up to 24 participants,” says Schramm. “For reaching the best possible results, though, we recommend a maximum of 10 participants at this stage.”



To ensure the highest possible level of privacy and security, all the data is anonymized and processed with an in-house engine and exclusively stored on the company’s own servers in Frankfurt. This means that all customer data remains in Germany and never leaves the EU.


Once the data has been collected, it must first be transcribed. A complete speech dataset contains not only audio files but also transcriptions which help the model learn to correctly identify words based on their sound. This combination is crucial for a successful training. 


All speech recognition systems essentially aim to achieve similar error rates as humans. MIT fellow Richard P. Lippmann estimated the human error rate in understanding words at about 4 percent in a 1996 study. So far, however, no computer has been able to replicate that result on a sustained basis. 


“Our goal is to use speech recognition to effectively optimize professional communication which takes on multiple forms and characteristics in praxis,” Michael continues, “You therefore have to approach user behavior step by step and train the AI model under controlled circumstances to meet specific needs”. Thanks to this approach, already achieves a record-breaking hit rate of up to 96 percent for dialects, accents and idioms. 


Via sentiment analysis towards “smart summaries”

While AI is already very good at taking verbatim minutes, it still struggles when it comes to summarizing content or recording results. Currently, Tucan’s focus lies on automatic content analysis through which the AI recognizes text flows in team meetings, sales and customer calls, as well as in interviews of all kind. The goal is to present a marketable solution for automatic summaries as soon as possible.


The development team puts a lot of effort into QA training and the recognition of questions and content including potential answers. Tucan’s machine learning experts are currently focusing on improving the model’s ability to distinguish between different entities including their types – such as specific days, people and places. Building on this, the next step will be the implementation of a sentiment analysis model, aiming to provide more (automatic) insights into the emotional aspects of meetings, interviews and other conversations.

So stay tuned! It pays off.


You want to test for your Company?

Leave a Reply

Your email address will not be published. Required fields are marked *