Nov. 22, 2023

The Future is Speaking: Unveiling the Power of Voice Technology with Tobias Dengel

The player is loading ...

In this episode of This Anthro Life, Tobias Dengel, a leading expert in digital transformation, explores the transformative power of voice technology. Dengel discusses how integrating voice tech into existing platforms offers a safer approach, emphasizing the familiarity and trust already established. The conversation extends to the increasing presence of voice assistants like Alexa and Siri in our daily lives. Dengel also highlights the role of voice technology in enhancing security measures in banking applications. Tune in for captivating insights into how voice technology is revolutionizing our interactions with devices and services, envisioning a future where our voices effortlessly transform the way we navigate our surroundings.

What new possibilities do you see emerging with voice technology? How might it influence our interactions with businesses and services in the future and what if your voice could transform the way we interact with technology? Imagine a world where your voice can effortlessly interact with devices and transform the way we navigate our surroundings.

In this episode of This Anthro Life, we explore the world of future technology with guest Tobias Dengel, a leading expert in digital transformation, and discuss the power of voice technology and its potential to transform how we interact with devices and the world around us. Dengel sheds light on the reasons why integrating voice technology into existing platforms is perceived as a safer approach compared to building entirely new platforms from scratch. He emphasizes the importance of leveraging the familiarity and trust already established with these platforms, enabling a smoother transition for users. Additionally, Dengel delves into the widespread adoption of voice assistants such as Alexa and Siri, highlighting their increasing presence in our daily lives. Furthermore, the discussion extends to the role of voice technology in banking applications, where it plays a crucial role in enhancing security measures and making our lives safer. The exploration of voice technology in this episode showcases its transformative potential and the various ways it is revolutionizing our interactions with devices and services.Tune in to discover Dengel's captivating insights and expertise as we envision the transformative power of voice technology.

Key takeaways:

Voice technology is evolving and becoming increasingly sophisticated, with the adoption of voice assistants like Alexa and Siri skyrocketing.
Adding voice to existing platforms feels safer than creating new ones altogether, as users are already familiar with the platform and trust it.
Voice technology solves the problem of faster communication, as humans speak three times faster than they type.
The interface of voice technology needs to be redesigned to be more efficient, as listening to machines is slower than reading or interacting with visuals.
The more human-like voice assistants become, the less users trust them, as they feel like they are being tricked.
Multimodality is important in voice technology, as it allows for a combination of voice, visuals, and other forms of communication to enhance the user experience.
Voice technology has applications in various industries, such as law enforcement, warehouses, retail, and safety in industrial settings.
The combination of generative AI and conversational AI is where the magic happens in voice technology, allowing for more accurate interpretation and response.
Conversational designers will play a crucial role in designing effective voice experiences, considering factors like speed, efficiency, and user preferences.
Voice technology has the potential to reshape business processes and models, such as centralized restaurants, telemedicine, and global healthcare access.

Timestamps:

00:00:07 Voice technology is evolving.
00:05:14 Design voice experiences in multimodal.
00:09:37 Voice is a powerful interface.
00:18:38 Conversational AI and generative AI.
00:21:08 Context is crucial for conversation.
00:28:31 The blend of generative AI and conversational AI is creating a user experience breakthrough.
00:29:15 Voice experiences are becoming multimodal.
00:34:06 Voice technology revolutionizes business processes.
00:39:03 The future of technology is voice-based.
00:43:30 Spread anthropological thinking to audiences.

Tobias Dengel is a seasoned technology executive with over 20 years of experience in mobility, digital media, and interactive marketing. He currently holds the position of President at WillowTree, a TELUS International Company, a global leader in digital product design and development. Dengel's expertise and leadership have contributed to WillowTree's continuous growth and recognition as one of America's fastest-growing companies, as listed by Inc. magazine for 11 consecutive years. He is also the author of the book "The Sound of the Future: The Coming Age of AI-Enabled Voice Technology," where he explores the transformative potential of voice technology in various aspects of business and society.

About This Anthro Life This Anthro Life is a thought-provoking podcast that explores the human side of technology, culture, and business. Hosted by Adam Gamwell, we unravel fascinating narratives and connect them to the wider context of our lives. Tune in to https://thisanthrolife.org and subscribe to our Substack at https://thisanthrolife.substack.com for more captivating episodes and engaging content.

Connect with Tobias Dengel
Linkedin: https://www.linkedin.com/in/tobiasdengel/
Twitter: https://x.com/TobiasDengel?s=20
Website: https://www.tobiasdengel.com/
Facebook: https://www.facebook.com/tobias.denge.7/

Connect with This Anthro Life:
Instagram: https://www.instagram.com/thisanthrolife/
Facebook: https://www.facebook.com/thisanthrolife
LinkedIn: https://www.linkedin.com/company/this-anthro-life-podcast/
This Anthro Life website: https://www.thisanthrolife.org/
Substack blog: https://thisanthrolife.substack.com

Adam:

Welcome to This Anthro Life, the podcast that explores the fascinating and sometimes unexpected ways that humans navigate the world. I'm your host Adam Gamwell. Today we're diving into the world of future technology. You know, the kind of stuff that makes us dream about flying cars and robot butlers. It's a world that's constantly evolving and sometimes can be hard to keep up. So imagine this. We're in the future where your voice has the power to transform the way that we interact with all of our devices and the world around us. It's a world where you can simply speak your thoughts and have technology respond effortlessly, carrying out your requests. Sounds incredible and probably also somewhat familiar, right? Well, today's guest knows a thing or two about this fascinating world. Tobias Dengel is a leading expert in digital transformation and he's here to shed light on the sound of the future. And trust me, it's not what you might expect. Tobias has spent his career helping businesses seamlessly transition into the digital age. From the early days of the internet to the mobile revolution, he's witnessed firsthand how technology can change the game. But what really caught my attention was his bold claim that adding voice to existing platforms feels safer than creating new ones altogether. That got me thinking, how can voice tech shape our lives? You know, with voice technology becoming increasingly sophisticated, the adoption of voice assistants like Alexa and Siri is skyrocketing. But what about the voice behind your banking app? Ever talked to that? Does it feel safer to trust a voice that you're already familiar with? Or what technologies actually make our lives safer and how and in what ways should we be adopting them? In other words, adding voice technologies to existing platforms raises a key question. How do we ensure that our interactions with voice technology are both beneficial and secure? So Tobias joins us today to explore these questions and so much more. So buckle up and get ready. It's time to dive into the fascinating world of voice technology with Tobias Dengel.

Tobias:

Mobile was this huge wave, 2008, 9, 10, 11. And then since then, there've been three or four false positives, let's say. If you were at the computer electronics show or consumer electronics show in Vegas in 2013, everything was about 3D TV. And now none of us have a 3D TV. or very few anyway. Then it was about AR and VR and Google Glass, and that hasn't really taken off in any kind of meaningful way except very specific industrial uses. And it might change now with Apple Pro, and I think it will, but it's taken a very long time. Then we had self-driving cars, 2016, 2017. The prediction was that by 2023, today, half of the cars in the United States would be self-driving, basically zero. And so there've been all these false positives of what's next. We started thinking hard about voice because there's this huge proliferation of tools like Siri and Alexa. And we wanted to take a step back and say like, why, why do people want voice? What, what problems does this technology solve? Right. And then how do you, how do you design the user interface to take care of that? As we studied it. What became apparent real quick is the reason users want voice is because we speak so much faster than we type. We speak three times as fast. And so that's the baseline use case. And then there's all these ancillary use cases around how it helps with illiteracy and there's all these other things, but that's the baseline use case. The problem with that use case is… Whenever a new technology gets implemented, we tend to view it in the same lens that we viewed the old technology. And so like early TV shows were basically videos of radio shows or videos of like plays. I mean, it took a long time to evolve to reality television as the highest, highest and best use of that media. What we found about voice is we are so used to programming things in a single interface that when voice came out, we did the same thing as the technology community. But it turns out voice is a really crappy way for… machines to communicate with us because it's so slow. We can read so much faster than we can listen, or we can see a graphic, or we can interact with something. And so, the whole interface was wrong. We want to speak to machines because it's faster, but we don't want to listen to machines. And you study this stuff, but there's so many ways that this manifests itself. But one of the most obvious is that we call these things smart speakers, which implies that we're listening to them. They should really be smart mics because we don't want to listen to Siri respond to us. What we want Siri to do is actually do something or give us a visual back. So the example I always used it. cut to the chase here is it really is a not crappy experience to ask Siri or Alexa what movies are playing tonight at Regal Cinemas and listen to them respond with four showtimes each. That's the phone going back 20 years. What we want is to say what movies are playing and I'd see those movies on your app and then say, awesome, get me three tickets for Star Wars at 7 p.m. and we're already authenticated and it's all this seamless multimodal experience. Once you get there, this voice will be a massive breakthrough in conversational AI, but it will be in a multimodal world. It generally won't be in a voice-to-voice world. And that's where we've kind of gone wrong.

Adam:

That's really interesting. So it's kind of like we've approached, even to your point, like we've kind of approached the idea of voice with Alexa or Google as backwards, where it's the smart speaker that we're selling. But really it's, we don't want something that's smart talking to us. We want something that's smart at listening to us. Oh, so that's really interesting, right? So it's almost like we flip it and say, we're the input, we're the interface, I guess, right? We're the voice that we're trying to get through to the machine to have it accomplish tasks for us versus it's really smart at telling me the weather and that's it, right?

Tobias:

A hundred percent. That's accurate. And then there's all these ancillary pieces, right, as we're studying this. We started really getting into trust and do we trust Alexa or Siri or any of these voices? And what we started to figure out is that the more human-like these voices are, the less we trust them, provided that we understand that they're not actually human. And I don't know if you've ever come across this theory of the uncanny valley. was developed in the 1970s in Japan, which was about robots and humanoids, whatever. But basically, the thesis was, and it's been proven out in multiple experiments, that the more human-like you make something, but it's not human, the more freaked out and less trusting we become. And so, as we've gone down this road of trying to make these assistants seem like humans, we've actually, as designers, shot ourselves in the foot from a trust perspective because the user intuitively knows they're not human and feels like we're trying to trick them. And when that trust level is broken, it's really, really hard to recover. Whereas if we're designing these things in a multimodal, then the end user never thinks it's human, just judges it as a machine and trust is actually much higher. And so that's another kind of really from a design perspective and a human behavioral perspective, I think. a really interesting path that we've discovered over the last three, four years as we've gone down this voice journey.

Adam:

There's something that's interesting too, as we think about the consumer level kind of voice assistance or tools that we see so far, and we've seen a proliferation of them, obviously with smart speakers, but then more and more apps have been adopting voice elements into their UIs and user interfaces and just parts of how we can interact with them. And it's one of the pieces that you write about in the book, I thought that was really interesting too, is that on one level, we don't want the voice technology that can do everything because we see huge problems. We can kind of dive into chat GPT and one of the issues around trying to be able to do everything. But then to your point about trust too, is this, this issue that I don't, you know, I, I feel a little bit safer, I guess, as a user, if I know there's some parameters around what, what the voice, what my voice could do with it, I think it is part of it. And so I wonder if you can break that down too, because I think there is this interesting piece about having the right kinds of constraints when we're designing new technologies and new products, right? Of like, how do I know what I can do with, with an app? You know, if I'm using my voice versus typing versus reading with it. But then also there's parameters on what I expect it to do. I don't think my bank app is going to book me a movie ticket, at least not yet anyway. But that's an interesting question too of like, do we feel safer when there's these kinds of parameters also when using new tech?

Tobias:

Yeah, so I think you've hit on a couple of things that are really important and another constraint that's existed in the voice only ecosystem. So this concept of discoverability and knowing what's out there and what those things can do. Like one of the problems we've really had with Siri and Alexa is we don't know which of our apps actually are interfacing with it or how it's set up or what it's doing. It just is this amorphous giant assistant. And so, our thesis is that the entry point for most of these voice experiences is actually going to be the app related to a brand that you're already working with because then you know what the context is, right? You're not going to ask your airline app for your bank balance. And when you open your airline app, which is one tap, you're already pre-authenticated. You can start your voice interaction, whereas if you start with Siri or Alexa and be like, hey, open my American Airlines. what voice experience, and then you're trying to interface with that, very complicated. Whereas if you open your American Airlines app and you say, hey, I need to change my flight and app starts doing things for you, it's just so much of a more powerful interface and it's related to that exact concept of discoverability. Inherently, if you open an app and you start talking to it, you kind of know roughly what it might or might not do. The big advantage, you might ask, well, what the hell? I can do some of this stuff by tapping and swiping and blah, blah, blah. A, much more time consuming, but B, as soon as you get into the longer tail use cases, that's where voice also really shows its power. If you have a banking app, the average banking app, as an example right now, has about 300 functions that it does for you, but you can't organize 300 functions on a tiny little smartphone screen in any kind of findable way. It's just an impossible task. So, if I ask you to do something, Adam, that you don't do very often with your bank, like reorder checks in the app right now, your heart rate will probably go up because you're like, this is going to be a giant pain in my ass. Whereas if I could just say, reorder 100 checks and then you get confirmation on the screen, that's like a perfect long tail use case of something you don't do very often. but that a voice-powered app is going to be perfect too. And you're probably not going to ask Siri to do that, right? Because you don't even know what's happening if you say, hey, Siri, ask Bank of America to order me a hundred checks. That just is not a use case that any of us trust and secure.

Adam:

That's interesting too, because I mean, that gets into the question of third party apps and third party kind of crossovers in terms of, you know, what am I using Apple products to then go to my bank to then order checks, right? And it has my address, it has my bank information, it has, you know, blah, blah, blah. I mean, to your trust point above too, I think that's really interesting where, especially as we think about long tail use cases, which I think is an interesting idea too, of things that we don't do very often, right? But that will still have some use at some point and they're going to have a use for a long time. As much as we don't use checks, they're not gone yet for whatever reason, right? We still need them. I don't remember the last time I ordered a check through my bank, but you know, we need them every once in a while, right? I know plenty of folks that have to like still kind of check for a landlord or something, you know, and it's not, some of them use Venmo now, but right, you know, but typically it's like, you still need these kinds of cases that haven't gotten yet. But an important point that like, they still exist, right? And knowing that I can trust my banking app to be able to do that, and I would rather use Bank of America or Santander or whatever, you know, banking app versus going through this third party, you know, through Alexa or Google or something else that feels like this kind of, you know, what pieces am I connecting there? And it's funny because even today, I still, you can sign in through a service with Apple or Google or something, right? You know, kind of single sign on. Still always, I'm like, do I click that box that you can select all my emails or my, my calendar to do whatever, still kind of gives me that pause question in terms of trust, even though everything else is like on Google and Apple. So there is this interesting kind of point of whenever we're trying to adopt something new, do I feel safe? And so I kind of already feel safe if I know the banking app, right? And so if you add voice to it, versus a whole new kind of third way of connecting things like a Zapier, or like a make kind of way of connecting training together different apps. It feels safer to kind of go in one ecosystem. And I think that's an interesting question I'm trying to think about in terms of how we both build, and you write about this a little too, where there's like the consumer adoption side of voice tech. There's also the business application processes and modeling that we can do with it too. And As we think about if we're adding more voice, you noted up top too that your clients are kind of asking always, what's next? And so we're seeing voice become more and more part of that conversation in more sophisticated ways. And so as we think about that, is this happening, I mean, do you see this happening more on the consumer side, more on the business side? Is it kind of happening at the same time in terms of pace of change?

Tobias:

It's happening at the same time, but the use cases and the approach is quite different. I think on the consumer side, it's all about speed and convenience for the most part. On the industrial side or the commercial side, there are more use cases, but they're varied. So one use case is anyone who's working, there's this concept, this industry now of Heads up, hands on. So, if you're doing something that requires your eyes, your attention, and your hands, voice is the perfect interface because today you have to take your hands off and interface with the screen, whatever. And so, law enforcement is an obvious one that is easy for us to imagine where voice commands and it's already happening. In fact, across emergency services, the book actually ends with an example of a fire department that has adopted voice because they're coordinating with each other real-time using voice tech in the middle of events and it just massively improves their communication in ways that historically have been very difficult to do. And so there's that concept and that also applies to warehouses, right? People can be stocking stuff, et cetera, and then using voice commands or retail. Then there's a whole category of implementations around safety where historically, right, most major industrial accidents from Deepwater Horizon to the 737 MAX Boeing incidents ultimately were driven by the fact that humans and machines couldn't communicate effectively with each other. those 737 MAX planes were crashing because the pilots didn't know what the plane was doing. And they couldn't unlock it. And if they have voice commands that say, turn off the autopilot, they would have been able to stop that instantly. And instead, they're working with all kinds of controls that are typing on screens, et cetera. And it's interesting that both the US Air Force and The Russian Air Force actually are pretty advanced right now in the voice controlled cockpit for all those reasons. And so, the industrial applications are varied. And then ultimately, anyone who's using a keyboard at work, if you were to use voice, you are 50 to 70% faster than typing. A lot of us still spend big chunks of our day on a keyboard. And so those are kind of the categories on the industrial side.

Adam:

It's interesting too, and such an important piece. That's something that I enjoyed kind of as a thread throughout the book too, is that it's kind of the idea that the heads have hands on where there's moments in which like split seconds can make a difference. And the fact of like, what am I able to do with my physical self? Can I put my hands on something? Is my head looking up and down? What can I pay attention to in order to have to make a split second decision? It's such an interesting and important use case for us to think about, especially when it comes to emergency services. And because you had a couple of examples in there too, in terms of like when there's a car accident and someone can call out and say, Siri, call 911 or whatever it is. And that was an example of like why that's how far the tech needs to go. But then also it was great that that work that was able to even call that out and eventually was able to call. 9-1-1 trying to call some friends, I think, was the example that you talked about in an early chapter. And this interesting idea in terms of when accuracy counts, it really makes a difference. And so that kind of gives an interesting direction of also how we're pushing forward with the technology as these use cases. And I guess this kind of feels like it's in between, like emergency kind of feels like it can be in between business and consumer, obviously, because we can all have emergencies and need to be able to deal with those split second changes. And then part of that too, is like how important is accuracy is that issue? And so, I mean, again, I know it's like, I feel like the lame example, but since many of us have used like an Alexa or a Google, we understand like when you ask it a question, it's like, it just answers the wrong thing or does no idea what you're talking about and says, I can't find that exact Spotify song in artists that you said, but I still can't find it for some reason. You just have no idea why. It doesn't do that, but there's a question of accuracy in terms of how well it understands you. And so I'm curious to kind of get your thoughts on this and like how we're moving towards a higher than 95% accuracy, but that's not good enough for an emergency scenario. But for a regular use case of trying to order ice cream or something, it might be okay. So let's think about that idea in terms of accuracy and development. How have you seen those changes happening over time?

Tobias:

So, what's interesting is it takes a long time to write a book and get it published. So, we started working on this in about 2020, late 2020, early 21. And then in November of 22, all of a sudden, ChatGPT 3.5 comes out, kind of watching over the next month, the world explode with all the ChatGPT stuff. And I was like, huh, did we kind of completely miss the boat on conversational? Our view, this book is primarily about conversational AI and that's generative AI, but in the book already, We'd been talking about how generative AI is the engine that powers conversational AI. And I think that's ultimately gotten us super excited because the marriage of these two pieces of the AI ecosystem is where all the magic is going to happen. And the way all these conversational AI tools get really, really good at accuracy and interpreting what you're saying is by the use of generative AI. So, if you think about how the technology works, right, there's the transcription piece and then there's the analysis piece. So, the transcription basically translates what you say into words on the screen. And so that's, I don't know, it's 98 to 99% accurate. Again, if you're more strongly accented or from certain regions, it might be less so. This is in English, obviously. But it's getting better all the time and we intuitively know that it's going to get really, really good. What's also interesting about there, you actually, because I think really good voice experiences show you, what you're saying while you're saying it. I think even if you're trying to ask Spotify for a song, it's much more effective if you can see the words you're saying so you know, oh, wait a second, they misinterpreted the song I asked for, and then you know what went wrong. What's frustrating is what you just said. When you said it right, it transcribes it right, but it still can't find the song. That's going to get fixed by generative AI and these large language models because that's where the breakthrough is for them is they're so much better at interpreting and responding than any technology we've had before. So, the marriage of these two is going to be the sweet spot and why I think over the next… I'm so excited about the timing of this book because I think now everyone can see it. Whereas we were talking about this two years ago and people were like, I don't know what you're talking about. Now they're seeing the voice show up in the apps, obviously playing with chat, GPT, and a bunch of other generative AI. People can see all this coming together and saying, hey, when these generative AI experiences are voice powered, That's, I mean, that's when we get to the Jetsons basically. Oh yeah, that's super cool.

Adam:

And I agree too. It's interesting to see that, to me, I mean, what you're writing also felt right in the spot, because it is interesting that you wrote it before Jet TPT blew up on the scene. But yeah, you can very much see how we're getting this interesting marriage of the two you write. I mean, one of the things that was interesting about that too, that sat with me, and this may just be because I'm an anthropologist nerd, is At one point you noted that, especially with generative AI, there's a power of having the idea of remembering your conversations over time, right? Or at least remembering some of the context that you're talking about. And there's a piece, I don't know if this is a stat that you put in there or something that you found through your research, but basically that social context, we need that if we're going to have more than three sentences of conversation, that kind of ongoing. After that point, if AI is just responding to you, it feels very weird at that point in terms of like, what's the weather? What should I wear outside today? And then you ask something else about food you might get, and it has no relationship to the weather. And it says, go eat something very spicy and hot. You're like, wait a minute, shouldn't I eat something cool? Or whatever it is. But so there's this interesting kind of point that you noted that social context becomes really important the longer our conversations go. We know that intuitively as humans, right? But it's taken a ton of research and work to be able to do that. in a gen AI space. And so I'd love to kind of get your thoughts about this idea and how can we think about the importance of the role of social context in our conversations with gen AI and also voice too, in terms of how we can feel like, I don't know if it knows us, but just having this idea that it has a wider context to be able to respond to us in appropriate ways that make us want to use it.

Tobias:

That's still in its infancy, right? But it was one of the breakthroughs of chat GPT 3.5 and especially 4 is how much context it can remember. Now, if you use these tools heavily, I think you realize that it starts to erode. Not after, you know, historically it's been two or three turns of conversation prior to TPD 3.5, but now it might be 15 or 20 turns. It starts to kind of lose track of the full history, right? And I think that's the place that there's going to be so much investment on how to keep these contexts for much longer and how to keep the concept of Atom alive so that you're not starting, because right now it's really just tracking your Your most recent conversation might have some parameters around preferences, et cetera, but doesn't really, really know your full history. Now, again, what's the main application for this? The main application is these big all serving assistance or tools. I still think that's more of a future state that those are super helpful. What's more important in the near term is the practical, I mean, what, you know, What we joke about a lot is ChatGPT can do everything, but it can't even change my mailing address for American Express. And so switching these LLMs from being more knowledge kind of help and content creation help to actually doing things. In that space, Alexa and Siri are a little ahead, right? Because you can actually get Alexa and Siri to do something, whereas ChatGPT can't do very much at all because it hasn't been wired into anything. And so again, these two things are going to come together. And then all of a sudden, ChatGPT will be able to change your Amex mailing address. The all-knowing bots or the all-knowing assistants will get more and more.

Adam:

Yeah, that's a really interesting point. This is something that I've talked with Rob Wilson at OneReach and UXMag about this too. And the idea of a conversational UI on the one level is one that we can speak to and have it work through voice. But then the other side of it is actually, it might push us to think more functionally about what we want software to do instead of saying, I'm going to go jump on my Alexa, or I'm going to hop into chat GPT to ask you to do X, Y, Z. It's more like, I need to change my address. I'm going to use whatever devices around me to then, you know, it's going to then auto route itself through these, the pathways to do it. I mean, I think it comes to the question of trust that we said before too, but it's an interesting question too, in terms of the multi modality of this, right? There's something else that I think is really important. And you mentioned this a few times too. When we say that, if folks are not familiar with what that means, how do we think about the multimodality? Because voice, we can think of it as one modality of how we're interacting with an interface, but say a bit more about what multimodality might be when we bring voice into it.

Tobias:

Yeah. Modality in this context is the mode in which we are communicating back and forth to devices. And that is super complex. It's historically just using computers and iPhones or smartphones. typically not using your voice more and where you are now, but you're using your fingers via the keyboard and you are swiping, tapping, et cetera, et cetera. Touchscreen was a big modality advance. And then you're getting responses via a screen, et cetera. But it already is multimodal because I'll give you a great example is when you're typing a password and it's an incorrect password, your phone buzzes, right? And so these are multimodal ways of communicating and human beings have been doing that forever, right? Research, depending on who you believe, says that 70 to 90% of our communication with each other isn't actually just purely through the voice in the sense that the words we're saying, it's through your tone, it's through your your bearing, your eye contact, your blah, blah, blah, blah, blah. That's how we as humans communicate. And all that is just part of multimodal. So when we, but specifically to this voice ecosystem, I think it takes on a couple of very practical manifestations. So one is how can the device be communicating back to us while we're speaking in things like I'm listening or, I mean, a visual cue that, which Siri already does, right? It has a spinning or while you're talking. So that's a multimodality. but getting more advanced about that and having these communications kind of go back and forth. But the core of it is that the call and response, like human conversations are call and response. I say something, you say something. is A, that any of the response from the machine does not have to be voice. It can be visual, it can be text, it can be a bus, it can be the machine just doing something, it can be a thermostat turning down the temperature. But the core concept is here that this call and response isn't voice to voice. And that's, I think, the big improvement in design that we have to make. And as I'm talking about this, Especially folks that are in design start to appreciate the complexity of this and the opportunity, right? I think this concept of conversational designers is going to be a huge deal over the coming years, decades, et cetera, because it's so complex. You can do this so many different ways. And when things like that happen and they're new, incredible opportunities start presenting themselves for entrepreneurs, for designers. It's a super exciting time. I feel like I've lived through it twice before in my life. I lived through it when I first started messing with the internet and then when I first saw the iPhone, and this is kind of the third time that all of a sudden the world's being potentially recast.

Adam:

That's a powerful way to think about the seismic shifts that have been really in the past 20 years, right? Of these ideas. I guess the internet's a little older than that, but just the risers are like the broad consumer adoption of that space. So when you're saying this too, am I understanding correctly that the current kind of way we're thinking about is, is it the blend of generative AI and conversational? Is it one of the two of them? Or how are you thinking about this third wave? That's kind of catching your attention.

Tobias:

I think the blend unleashes it. Right. And that's where we are right now is that generative AI as of last November has gotten to a point where it's so good that it can truly emulate certain human responses and experiences. And then when you combine that with conversational AI, you're going to get to this user experience breakthrough that, I mean, it's happening in front of our eyes, right? I think over the Just 30 to 45 days, most of the major ordering apps like DoorDash have all launched voice experiences where you can basically ask, put your order in, in real time, the screen changes. puts your order and then you're like, confirm it, boom, go. But it's fully multimodal. And that's super exciting because that's what we predicted a while ago. And to see it actually happening is exciting and gratifying. But it all got unleashed because of the generative AI tools. So these two things are working hand in hand, change.

Adam:

That's really cool. I think something I'm curious about too is, as we put that in conversation with conversation design, both as a practice and as we may see more jobs kind of coming up in this space, when we think about that as a role that folks might take on, Traditionally, or at least what I'm thinking of is like conversation trees we might think about. But as you noted too, that so much of human conversation is actually also non-verbal. Does that come into play as folks are thinking about not just a conversation tree, but here's a great moment in which the phone should ding or it should show the text that it's collecting as I'm talking to show me the keywords it's highlighting. How do we think about that? Does conversation design all those pieces? Is it tree branches? How do I think about that?

Tobias:

No, I think it has to be all those pieces. And that's why it's, you know, when it became, when it was these trees and it's been that way for a while, it was kind of. you know, rote might be an exaggeration, but it was very systematic and it was time consuming, et cetera. So, A, I think generative AI itself is going to help us design these conversations, but they're going to be so much more complex than they have been. And most interesting thing I've seen in the last six or 12 months is this concept of not, I don't think the industry has developed the right word for it yet. We're calling it current communication. where it's not call and response anymore, but as I'm speaking to an app and putting my order in as an example, and that's just an example I use because it's so easy to understand, like talking about McDonald's app, blah, blah, blah. As I'm speaking, my words are showing up on the screen and the app behind it is changing and populating my orders that by the time I'm done speaking, my whole order is in the app and I can just say approve. And that's just a massive breakthrough in terms of time and efficiency. And we all know how impatient we are as users of tech. One of my favorite stats is for every second that a website is slower in loading, you lose about 7% of the audience. So, if one experience is five seconds slower than another, you've lost more than a third of your audience. As designers, that's gotta be the key. Like, how do you make it so, so efficient versus whatever you're competing against? And, and that will always win. The speed.

Adam:

That's really, that's really interesting too. I've heard some people have talked how this has also made us pathologically impatient. But I think you're onto something in terms of that. When speed matters, how do we think about that? And I say that both tongue-in-cheek, but at the same time, it's worth thinking about because there are moments in which speed does make a difference, like in an emergency context where you need it to work right away. ordering a hamburger, it'd be nice, you know, but, but at the same time, either way, we might lose users, you know, for different reasons. And that's important for us to recognize that like speed doesn't have to be, I guess I'm trying to say it doesn't have to be a negative, right? I mean, it's something that we have to, we have to work with. So it is an interesting kind of question back and forth of like, how do we take the impulses that technology gives us, you know, to move quicker and, and to foster good habits out of it at the same time is also like. Helping life-saving moments when it needs to take place too. It's always interesting kind of back and forth in conversation with those parts.

Tobias:

Well, I mean, if you want to get, that's a philosophical question for another time. If we're freeing up all this time for human beings, what are we going to do with it? Unfortunately, being the father of teenagers, I can tell you it's going to get sucked up by more time. rather than other things that might be philosophically more valuable. But it doesn't change the fact that from a competitive perspective, if you're in a business environment, delivering experience is the most the most efficient way you can for the end user is a huge competitive advantage.

Adam:

And it seems like, I mean, obviously in terms of adding voice input and recognition and kind of multimodality to existing apps is definitely one of the main pieces. I think you talk about three waves of like how this might get adopted over time. And that's kind of the first one is that we're seeing integration of voice into existing apps, UIs, and applications. But then I think what's interesting is that you kind of then do a bit of speculation of where this might go and how this might rearrange. I think it's business processes and then business models, right? In terms of where voice might change how things work. So I'd love to kind of get a little prognosticatory. That's not a word, but it should be, you know. Where are we headed with all this in terms of how is voice going to help us rethink and redesign some business processes and some business models in the future that we may not even recognize a restaurant or kind of healthcare as we do today.

Tobias:

Yeah, and I think the example I use and that I love to use is when the iPhone in its current form came out in 2008, I don't think anyone predicted that the taxi industry would be one of the victims of that innovation through burn lift. So, it's really hard to predict where these kinds of things are going. But I would say this is, again, using restaurants because it's something we use and are all familiar with. Step one, right, is just making the apps and the experiences better. Step two is then redesigning the, and again, this is already happening to some extent, especially in fast food, that if you go, there are places to pick up your mobile order, your voice order. If you go into fast food chains, they've got kiosks now that you're ordering at. I would argue most of those kiosks today are slower than just talking to a human being. But also full-service restaurants are going to… Think about this intellectually for a second. When I make an order to a waiter or someone standing behind the counter at a quick-serve restaurant, Their primary job is to take what I've just said and translate it usually into an input screen, right? And it's just super inefficient, right? The second I've said it, the information is out in the universe. It couldn't get picked up, processed real-time, right? So, that's just like an obvious thing. And as that happens, the companies will have to reorganize. They need humans doing different things than they might be doing today. They might need a different physical setup. But step three is someone coming in and saying, you know what? This whole model of how we do things doesn't really optimize for the current technology. And all innovation has ultimately been about using a new technology and applying it to an existing industry. But we talk about restaurants as an example, why do we need all these distributed restaurants that are making food locally. There's going to be seemingly a tie as you do more digital ordering, voice power makes it incredibly efficient to have more centralized restaurants. And then you might have eating areas or not even eating areas because we're not actually going there anymore for the most part. But this whole model and the original founder of Uber, that's his whole business right now is trying to take advantage of that. And I think that's just one example now. I think you start applying this logically and thinking two, three clicks ahead in any industry, and you can quickly start getting to places where this industry is going to completely change from how it's organized today because of the release Conversational AI combined with Gen AI offers. And healthcare is another one, right? I think the most basic example is when we go visit a doctor right now, for most of us, our doctor is spending most of the visit there on their keyboard typing in our responses, which is an incredibly inefficient way for someone as highly educated as a doctor to spend their time when the information is just out there. Really what's going to happen is that information is going to get, should be interpreted real time, blah, blah, blah. And then you're like, well, if that's going to happen, why do I need to be next to the doctor? Everything should be telemedicine. If that's going to happen, what does that mean for the whole concept of the physical plan of primary care offices, et cetera? And what does that mean for global healthcare? And how much of that care should be delivered by intelligent, conversational, and gen AI platforms versus human beings? This is an awesome time, right? Because all this is going to get reinvented.

Adam:

Even to that point too, as we've seen with telemedicine as the giant rise of that during the pandemic. The models and tools developed during that time can totally change how we're able to deliver care, to your point, globally, in underserved populations and areas that we can't get to easily. If you can get a smartphone, you can then maybe have access to doctors or different kinds of healthcare that you wouldn't otherwise. So even this idea too, in terms of that the technology is not only scalable, but it's, I don't want to say exportable, it's shareable in different places. It can kind of add in access in areas where we typically have not had things like healthcare or banking too, is our interesting kind of applications that we could think about. And I agree, it's very exciting that voice is helping us rethink so much of this. And it's funny because voice is the most fundamental thing that we as people do, right? We talk, we communicate through this weird vibration in our throat, and yet it's like the newest thing that's helping us rethink technology beyond like a touchscreen, which blew our minds when that came out earlier this century, which sounds weird to say, but also true.

Tobias:

Well, right. As an anthropologist, I mean, think about this. We've only been typing. Typing seems so ubiquitous. We've only really been doing it for a hundred years and most of this only for 30 or 40 since the advent of the computer and the ubiquitous screens. There may be a world in 20, 30 years where kids, they'd never even learn how to type or really write. I could argue a lot of people don't learn how to write now, but, or at least handwrite, that's what I mean. You know, that, that we all go back to what we were doing for all of human existence was speaking to each other. Yeah.

Adam:

I mean, I think that's, it's the crazy part is that's not crazy, right? That, that's, that's, that seems like an entirely plausible future scenario. It reminds me of the movie, Her, you know, if you remember with the Spike Jonze film with, with Joaquin Phoenix, like nobody typed, they just spoke to the computer. I agree, a really interesting time that we're headed into. Tobias, I want to say thank you so much for joining me on the pod today. This has been a great conversation. I appreciate you ping-ponging with me across the conversational universe here. I agree, a really, really fascinating time. I'm excited to get folks to check out the book. and check out WillowTree and your work and what's happening there. So I guess if anybody's kind of interested in either finding their way into this kind of field or this area of working with voice tech, or if folks have been in there and they're trying to figure out like, how do I digitally transform my organization? What advice do you have for folks that are trying to make sense of how to kind of walk into this space?

Tobias:

I think it's a lot of the work that needs to be done. If you're thinking about an organization, it's more organizational than it is. And this is always a case around technology is how you organize the teams that work on this thing. And what's interesting is it really needs to be multifunctional. It needs to be people that understand The users, so user research, design, obviously engineering and technology, and then clearly solving business problems. But the adoption of new technology in organizations is almost always primarily constrained by very quickly putting the right cross-functional teams together and then giving them the autonomy to come up with ideas and solutions and implement those. I remember in the late 1990s, every company set up an internet department, said, all right, we need to do something with the internet. And now it seems crazy because the internet impacts everything and it's going to be the same kind of thing. I think getting these cross-functional teams organized very quickly is what every organization should be doing.

Adam:

Right on. Well, I think, and hopefully we'll have some folks calling in to check in. Help us do that because we need to figure out that idea. It's not just one element, but it's actually building the right kinds of teams. So awesome. Well, thank you once again, Tobias. It's been great to talk with you. I appreciate you sharing your wisdom and perspective with the listeners and excited to get the book in their hands. So thanks so much. Awesome. Thanks, Adam. And that wraps up another episode of This Anthro Life. I hope you enjoyed exploring the ever-evolving world of voice technology with me. Tobias Stengel offered some incredible insights into the power and potential of voice interfaces, as well as the impact they might have on our lives and industries. I want to extend a huge thank you to Tobias once again for joining me today and sharing your expertise. Now, as we reflect on today's episode, I encourage you to consider how voice technology is already playing a role in your own life and in the world around you. What new possibilities do you see emerging with this technology and how might it influence how we interact with businesses and services in the future? I'd love to hear your thoughts and experiences. Now, remember, our pod is driven entirely by your curiosity and engagement, and I can't do it without you. So I'm grateful for your continued support for being a part of the vibrant community. And if you're thirsty for more knowledge of what we've dived in today, I recommend you check out Tobias Stengel's work, including The Sound of the Future. You can find additional resources and info on the website, thisanthrolife.org, as well as on the anthro curious sub stack blog. As always, your feedback and suggestions for future episodes and topics is welcome with open arms. So get in contact with me on This Anthro Life website or on social media. And of course, if you haven't already, go ahead and subscribe to the pod, leave us a review, check out the sub stack or other episodes on This Anthro Life. And, you know, if you love it, share it with someone who you think will find it valuable. It's one of the best ways to help spread anthropological thinking to bigger audiences. Thank you once again for joining me and until next time, stay curious and keep exploring the fascinating corners of our shared human experience. I'm Adam Gamwell, and this is This Anthro Life.

Tobias Dengel

President

Tobias Dengel is the President of WillowTree, a TELUS International Company. With over 20 years of technology expertise, Tobias has been recognized by Glassdoor as a Top CEO and featured as a VOICE Summit speaker at the Consumer Electronics Show in 2022 and 2023. Under his leadership, WillowTree has become a leading provider of premium digital products and experiences for iconic global brands, including HBO, PepsiCo, Johnson & Johnson, and many more. The digital product consultancy was acquired by TELUS International in 2023 for over $1.2 billion USD.

Prior to WillowTree, Tobias helped steward strategic partnerships and acquisitions and drive innovation on behalf of companies including AOL and Kearney. Tobias is passionate about using technology to create meaningful, human-centric experiences for people and drive business success.