Best Practices for Conversational Design for Multimodal Dialogue Skills

There are several different interfaces for conversational design, each providing different ways that the user interacts. The presence of a screen, for example, opens up the possibilities for multimodal dialogue. Even if a device doesn’t have a screen, if the user has another connected device that does—say, a mobile phone—then your Alexa skill could hand the conversation off to it. This is why it’s important for you to consider all possibilities for multimodal dialogue in your conversation design. These include:

Screen-first devices, like smartphones, smartwatches
Voice-only devices, like smart speakers
Voice-first devices, which include things like the Echo Show or Google Home Hub
Smart appliances. These are unique in that they aren’t screen-first, but they aren’t really voice-first, either; rather, the presence of voice augments the primary use case of the appliance.

Best Practices for Conversation Design of Multimodal Experiences

When designing your Alexa skill or Google action, don’t think about voice alone. Always consider the full breadth of features available for the devices your user might call upon your skill or action on. Consider opportunities to add visual information to responses that are relevant to the use case. For example, a taxi-hailing Alexa skill might employ Alexa presentation language to render a map of where the car is.

Researching how users interact with such devices is especially important now that Amazon is integrating Alexa into so many different devices. If your skill is already published, take stock of voice analytics to look for oversights in your conversation design and how you might better support multimodal dialogue.

Consider Distance for Multimodal Dialogue

Okay, so you’ve considered how visuals might integrate with the skill’s use case. But don’t rest on your laurels yet! Also consider the distance from which users might be away from the device. If you plan on implementing visuals through Alexa presentation language, that is. Is the user sitting right in front of the device, or looking across the room? Again, consider how you envision the use case: are visuals legible? When should you support or require touch?

Divide and Conquer

When supporting screens, remember that on-screen text doesn’t have to match the speech output exactly. Consider adding more detailed or contextual responses on the screen, while reading out only the most important information to the user. Alternatively, you might provide a simple response via voice and push options or a detailed answer to a user’s mobile phone

Provide a Consistent Experience Across Multimodal Dialogue

Navigation with your skill should be predictable and intuitive no matter the interface. This might mean using suggestion chips for voice-only interactions or incorporating wayfinding (like headers) to help the user navigate visually on a screen.

Always look for opportunities where one type of interface might provide a better experience. For example, if a user is shopping via an Alexa skill but needs to include shipping or payment information, the process might be better handled on a screen instead of through a long series of voice inputs. Remember, every aspect of your conversation design should seek to eliminate friction in the user experience. By keeping abreast with new device features that provide more multimodal dialogue possibilities, you can provide a much better and more consistent experience to users.