Erik Neuman - Audio Parser

Audio Parser & Editor

The audio parser is a system that is split into two parts: a system that finds all words in an audiofile with a corresponding transcript of the audio. The second part is an editor for this system so that a designer could edit the input variables to the parsing system and edit the output of that system if the system would spit out weird data.

The idea behind the system is to make a karaoke kind of thing. So when playing an audiofile the words in the text would be highlighted. So the system needed to find and return whatever timestamps it found with the corresponding transcript. The system did not have to run in real-time and was more of an automation process when saving this data to whatever handler that would use the timestamps.

So when trying to find the correct points of where a word ends the best way would probably be to use some sort of voice recognition software to get the most accurate information. Now we didn't have the time or means to implement an interface for such a software and decided to make our own algorithm.

To start things off the programme must find points where a word could end and then sort through them. but firstly one must indentify data candidates for where it could end.

When trying to figure out the best way to go around this I thought first of converting the soundfile into a function, and use some math to find the global and local low points inside of that function. Although I think this would have been the most accurate algortihm it would be a little time consuming to implement and take longer to run through, therefore we decided that a faster approach was needed.

The algorithm I came up with in the end would look for where datapoints are very level with each other for longer periods off time. This would indicate that it is a brief pause between when something is spoken and when something isn't. This approach is a bit easier to implement which was favorable.

So to make identifying plateaus more easily I took an average of the audiodata height to smooth out the audio, Then I'd go through the data to see which datapoints were close in value. I added a public variable to the algortihm that dictate the minimum length of a plateau. This means that the algorithm, when looking at how close in value datapoints are is looking for a string of these datapoints with the minimum size. So if it finds 6 consecutive datapoints with similar value it will register them as a plateau (granted minPlateuWidth is <= 6)

After all plateaus are identified the algortihm can now begin to guess which plateau is the end of a word. The way I went with this is to use the transcript I know we will have access to & another variable that I called 'timePerLetter'. TimePerLetter is a variable that estimates how many seconds it takes to say a letter, this coupled with the transcript makes me able to do a simple calculation of 'timePerLetter * wordLength', this would give the algortihm a rough estimate of where the plateau should be located. It then searches for the nearest plateau and returns that plateau. The plateau that it returns is the time of where it begins and ends, this is if whatever system wants to handle the data differently.

Green - Identified Plateaus

Red - First Estimate (timePerLetter * WordLength)

Blue - Chosen Plateau

Editor

After the audio parser algortihm was finished we wanted to make an editor so designers could tweak parsing values and save the timestamps to an object. We didn't know exactly how each implementation of this system would want to handle the output so we made an interface class that can be attached to a script that each implementation of this system can handle the data locally.

The editor was made from scratch using Unity's own ImGUI system. We wanted the display of the audio file to look similar to Unity's animation window. Here I put in all the public variables that is used in the parsing algorithm.

It was a lot of fun and a new experience making a custom tool window in the unity editor. The most interesting things I got to use was 'reflection' in the c# language, it is essentially finding a method by searching name, variables etc. You can find and use this method with reflection even if it is private, granted it is not something to be used during run-time as it is slow. But it was essential in getting some of unity's functions for audio from behind the scenes. It was also a fun challenge to handle the zooming and scrolling of the audio waveform.

The editor also features a 'DataView' where you can see the algortihm data which shows how the algortihm decides the time stamps, it is useful when the designers are changing the variables to see how they affect the output.

Erik Neuman+46 723 049699erik.neuman.dev@gmail.comwww.linkedin.com/in/ErikNeuman/

Erik Neuman
+46 723 049699
erik.neuman.dev@gmail.com
www.linkedin.com/in/ErikNeuman/