By Larry Magid
(scroll down for podcast audio & transcription)
Larry is now using Otter.ai to transcribe his CBS News Eye on Tech segments:
I don’t usually transcribe my podcasts, but this one is with Sam Liang, the CEO and co-founder of AISense, the company behind Otter, a free app for iOS, Android and the web that automatically transcribes conversations, panel discussions or any other audio that your phone or computer hears. It’s a pretty amazing piece of technology that could, someday, put stenographers and maybe even court reporters out of business. It even has the potential to be used for simultaneous translations if integrated with products like Google Translate. The app can also be used as a note taker and as a way to index published the transcripts of podcasts and broadcasts so that they can be picked up in search engines.
What follows (after embedded podcast) is an edited transcript of my interview. It’s similar — but not identical — to the machine transcription. One reason is to make it more reader friendly, and another because, as impressive as it is, Otto’s automatic transcriptions aren’t perfect. It did make mistakes, but transcribed most of the interview right. You can Click here for the unedited machine automatic transcription on Otter’s site, which also allows you to listen as you read to see exactly what mistakes it made. The transcript was created by importing the MP3 file of our interview into Otto’s web app, but I was also able to make an identical transcript, in real time, by recording the conversation on my smartphone.
Edited transcript
Hi, I’m Larry Magid of CBS News. I had an opportunity to sit down with Sam Liang, who’s the co-founder and CEO of AISense, a company which has a free app for Android, iOS, and the web that will transcribe audio into text automatically, and in real time. In fact, if you want to follow along, or just read the transcript of this interview, you can do so at LarrysWorld.com, because I used Otter to transcribe both the raw interview in real time and the edited interview that you’re listening to.
We began by having him tell me what Otter is, and how it’s different from other speech recognition engines that you might know from the likes of Google, Amazon and Apple
Sam Liang: Otter is a mobile app and a web application that transcribes human voice conversations. So this is very different than Siri or Alexa and Google Home. They handle a conversation between the human being and a robot. You can ask a short question like, what’s the weather tomorrow? And the robot will answer that question. However, Otter is doing something totally different. It listens to human to human conversations and transcribes the conversation in real time. You can also upload an old recording to Otter,it will also transcribe it. In addition, it’s able to recognize different speakers’ voices and separate them properly. It uses new technology called diarization and speaker ID. Diarization is a technology to separate one person’s voice from another speaker’s voice.
Larry: I would imagine, for example, that it knows your voice but has never met me before. How will it know when it transcribes this podcast? If you’re speaking, or I’m speaking?
Sam: When it hears a new person, it doesn’t know the identity of the person, but it does, know it’s a new person. So, at the end of the recording, you can tag a segment of the speech and you tell it this is Larry and that technology in the cloud will create a voice profile or a voice print similar to your fingerprint which can be used to match the rest of the recording.
Larry: Okay, so it’s similar to the way for example, Google and Apple recognize faces.
Sam: Absolutely it’s conceptually very similar to face recognition. Once you label a few faces, it’s able to remember
Larry: So if I were to record, let’s say, a debate between Hillary Clinton and Donald Trump’s, would it because they’re famous? Would it know their voices, for example?
Sam: It could. Because they have a lot of public speech recordings, we can easily download their speech in advance and label their voices in advance and create their voice print in advance. So, when they are engaging in a debate, the system is able to recognize their voice in real time as well. Although it right now the real time speaker ID recognition is not available in the current product, yet.
Larry: And what are some of the practical applications of the technology?
Sam: We see a very broad range of applications. Obviously, for reporters, they do a lot of interviews. So this definitely helps them a lot. Traditionally, you have to spend $1 per minute for human being to transcribe it for you. And turn around time, could be slow. With Otter, you can get it instantly.
Larry: So we’re recording this in real time as we’re conducting this podcast, I can actually see our words on the screen. But I’ve got hundreds of podcasts and broadcasts on my hard drive. Can I go back and retroactively transcribe those?
Sam: Absolutely. It’s very easy to upload an old podcast on to the website. It’s called Otter.AI. Once you login, there is a import button. You can use that button to import all your old podcasts.
Larry: As I recall Dragon Systems (AKA Nuance), which does very good speech recognition, does well when you’re talking into it. But it hasn’t done that well, for example, when you load an old recording. Is that your understanding?
Sam: I think they can do old recording too, but the accuracy is very different. Nuance has been doing this for 20 years. But they are actually lagging behind in the last few years. Because their technology is pretty old. The new technologies are all based on deep learning. So that’s what we have created in the last few years.
Larry: Could you use it, for example, if you wanted to just write a letter and have it transcribed?
Sam: Yes, for that purpose it’s called dictation. Dictation is one person using voice to write a letter or email. Otter can definitely support that very well. However, Otter actually does something even more difficult than dictation. When people dictate they usually speak a little slower and more clearly. But when people are engaged in a conversation with another human being or several other speakers, they speak much faster.
Larry: Let’s say I’m riding in a car or I’m an airplane. I have this great idea that I want to write down. I could just go ahead and load Otter and use it for that purpose and speak my idea. And I’d have I’d have a transcription?. That’s not the main purpose. But it’s a you could you could use it that way.
Sam: We actually just saw a YouTube video yesterday. Somebody said, I’m using Otter to write a book. Sure. When they’re walking their dog, they’re actually writing a book using Otter
Larry: Does it transcribe barking, if the dog if getting into the conversation?
Sam: Unfortunately, we’re not able to understand dogs barking yet. But eventually, with deep learning, you could figure that one. Yeah. When the dog barks, you know, what does it mean? is he hungry?
Transcribe meetings
Larry: There actually are people working on that on that very problem. But seriously, what’s unique about this, I think, is the fact that he does for allow for a group conversation. So for example, if I were at a conference and there was a panel and a number of people were speaking, I could essentially transcribe every one of those speakers at the panel. And ultimately, we would know who the speaker was, that would actually label their names once I trained to do so.
Sam: Absolutely. We have actually done that many times already ourselves. In addition, we actually used the product ourselves in our own company, we actually record all our project meetings, marketing interviews, so we actually eat our own dog food. All our company meetings are in the Otter system.
Larry: And so you have a transcript of who said, what at all of your meetings?
Sam: Yes.
Larry: If somebody had a great idea and someone else takes credit for it, it’s not going to work,
Sam: Right. Sometimes people have different interpretations about who about somebody’s opinion, and we can always go back and listen to it again.
Larry: In other words, if you don’t, for whatever reason, don’t trust the transcript, you can go back and get the actual audio? ,
Sam: Yes. Both the audio and transcript is available and the audio and transcription is synchronized word by word.
Larry: This operates on iOS and Android. And then there’s a web app, which I presume, operates on all web browsers. Mac, Windows, whatever. And I did notice it on the web. I could play back an audio simultaneously while reading the transcript, which was really great. I actually uploaded an audio portion of one of my broadcasts — my Eye on Tech broadcasts — and I found one or two mistakes, but when I listened to it, I could see exactly what I had said. Is that true if you listen to what on the phone as well?
Sam: It does the same thing on the phone. It works on iPhone, Android, you know, so you mentioned any web browser on PC or Macintosh. We built the speech recognition engine or by ourself, we’re not using Google technology.
Larry: So just to review, this is a free app that runs on iOS and Android, anybody can get it just from the Play Store. Search for Otter in the App Store on iPhone, or in Google Play Store on Android. Or you can go to Otter.AI. I, that’s how you get it. And you can use it on the web as well. So for example, if there’s a YouTube video, for example, where television program or a Netflix show that you would like a transcription of if you had the MP3file, you could load it in. If you didn’t, I suppose you could run the Otter app and simply listen to the speaker and get a transcription.
Sam: Yes,
Privacy implications
Larry: Now, I have to ask you about privacy. You used to work for Google, so you understand the complexity. In fact, you used to work on Google Maps. And that’s an example of a wonderful product that I use every day, which has enormous privacy implications. It strikes me that this product is also quite useful. But it brings up some interesting privacy implications as well. ,
Sam: Yes, we definitely take privacy extremely seriously. We see this as a personal tool, and the user owns the data himself. Whenever the user wants to delete the data, we erase everything. Absolutely, we’re not going to sell the data for advertisements. We have a freemium model, so that we make money from the user subscription, and also for enterprises, so we don’t need to. We don’t want to sell the data.
Larry: So speaking of law, the state law varies as to whether it’s allowed to record without telling the other person. I think in New York, it is legal — I know this because of the Michael Cohen case. In California, you’re required to disclose it. But either way, it might be useful to have a transcript of your phone calls. Is that a possibility? With this technology,
Sam: It can. You can record phone calls, as long as you tell the other guy you’re doing it.
Larry: Does the app allow that on both iOS and Android?
Sam: On iOS, you cannot record a phone call on the same iPhone. You could use another phone or use a PC or Macbook to record the phone call when you have it on speakerphone. It’s technically possible to record a phone call on the same Android phone.
Larry: You and I have different accents, but of course there are many other accents. Is that an issue for some speakers? Is it harder for them to understand their voice?
Sam: It does make the speech recognition engine more difficult to build. However, you know, with the deep learning technology we are building, we actually collected a lot of different speeches with different accents and train the engine to teach it how to do the mapping between different accents to English words. So specifically, here in Silicon Valley, there are a lot of Indian engineers and Chinese engineers. So we did a lot of enhancement and training for Chinese accent and Indian accent. But in addition to that, there are UK accents from Britain, from Australia, even in southern America. You know, people have different accents. So we have in on the internet, there are a lot of a public speech, and we use that to train the speech recognition.
Larry: And of course, also people with speech impediments, I imagine you can eventually get to the point where you could translate transcribe virtually any speech. Right?
Sam: Right. But if somebody has very different pronunciation, different pace or style, they could do a personal training as well. Right now, it’s for the general public.
Larry: And the the other thing that excites me about this technology is simultaneous translation. I go to a lot of conferences that are sponsored by the UN and they have very highly skilled people who not only translate it in real time, but also put it up on the screen in real time. I’m always amazed how they can do that. But I presume we could get to a point where I could be speaking in English and my words could be in French or Chinese or Spanish or Arabic on a screen in real time as I’m speaking.
Sam: Yes, absolutely. We already transcribe the sound into English words. And we can easily use another API to translate it into Spanish or Mandarin or Japanese and show that in real time as well
Larry: Well Sam, speaking of time, we’ve run out of it. This is great. And of course, as I mentioned there’s transcript of this entire conversation at LarrysWorld.com So if you go to LarrysWorld, you’ll find both the audio and a transcript of the conversation.
Larry: Sam, thank you very much.
Sam: Thank you, Larry.
Note: The preceding transcript has been edited for transcription errors and to improve the readability. Click here, for the actual machine translation from Otter.AI