Build Effective Speech Apps

Use Microsoft Speech Server 2004 and its new Speech Applications SDK to improve the user experience in your speech apps.

Technology Toolbox: ASP.NET, XML, Microsoft Speech Server 2004

Enterprise speech applications are gaining in acceptance and importance. Unfortunately, this powerful new technology is pervaded by bad design, undermining many of the benefits of moving to speech systems. But Microsoft's Speech Application SDK (SASDK) enables you to design effective speech apps that conserve manpower and improve end-user experiences.

Star Trek is the bane of speech application developers. Many users, developers, and their managers have come to believe they can talk to a computer as if it were another person. The truth is that conversational speech apps are still years away—however, simple but polished apps are possible today.

Speech recognition was introduced prematurely in the 1980s. Today, faster processor speeds and open standards services are up to snuff, and Microsoft Speech Server 2004 opens up speech technology for .NET developers. It's the first platform to integrate all the necessary technologies into Visual Studio and the Windows platform.

I'll cover the telephony side of speech apps here. Speech Server supports multimodal applications (think of a voice-driven PDA application), but its full-fledged arrival is still years away.

Speech Server is based on Speech Application Language Tags (SALT), which extends existing markup languages such as HTML and XML with new tags for speech input, output, and call control. Speech Server has a scalable, Web-based architecture. You build apps using ASP.NET; they serve up markup text to a browser. The markup text here is SALT, and the browser runs on Speech Server (since telephones lack browsers). The browser renders the SALT tags into speech on behalf of the caller. When the page finishes, it does a postback to the ASP.NET app and the process repeats until the call ends.

You create apps in VS.NET using the SASDK. It comprises ASP.NET speech controls and plug-ins for editing grammars and prompts. The speech debugging console and built-in speech recognition and text-to-speech engines allow you to develop and test on a laptop before deploying to Speech Server.

You can create a variety of apps. Those requiring hands-free access are good candidates, but most popular are self-service apps such as travel booking or bill payment. Try 1-800-USA-RAIL if you want to experience a great speech app. I'll outline some tips for creating a good user experience in your own speech application.

It's hard to create user interfaces in any medium. The GUI code for desktop apps often consumes 60 to 80 percent of overall development effort. It's no different for a voice interface, which is a new medium with its own challenges. The audio-only nature of the medium differs fundamentally from a visual medium.

GUIs are parallel; a screen can show a menu, title bar, sidebars, and main text simultaneously. Humans are adept at scanning and jumping around to points of interest quickly. Readers can skim text or read it carefully. Also, a GUI can create a strong sense of context using sidebars, menus, and other widgets that give users an indication of where they can go next or, with overlapping windows, where they've been.

Keep It Simple—and Serial
Voice user interfaces (VUIs) are serial; they present information one word at a time. Users can't scan, so try to avoid long blocks of text. Content should be trim and concise. However, the most important difference in a voice app is the need for a simple structure. It's difficult to convey a sense of context, so try not to require much. Avoid deeply nested menus. Have a single "home" or top menu for all points in the app.

Consider a hotel reservation app. It needs to gather several pieces of information. The simplest dialog structure is called "directed dialog." The app asks the caller for one piece of information at a time:

Computer: What type of room would you
Human: Single.

Error and help prompts relate to the current question. The active grammar is solely about room type, and the app won't proceed until the user specifies a room type. Your context is clear, and the caller is led through the app by answering the questions. You can augment directed dialog, but it's still the best overall application structure.

Apps used regularly can let callers supply several pieces of information at once. This is called a "mixed initiative" dialog, and it involves having several grammars active at the same time:

C: What type of room would you like?
H: Single non-smoking for two nights.

Novice callers can still answer the questions one at a time:

C: What type of room would you like?
H: Single.
C: Smoking or non-smoking?
H: Non-smoking.
C: For how many nights?
H: Two.

Good speech apps have a conversational tone, yet guide callers to say the right things. You can do so with unambiguous prompts and grammars that anticipate the various ways callers might say something. Confirmation, feedback, and error handling help avoid state errors (where the app is in a different place than the caller expects).

You can avoid such problems by testing your app on users frequently. Use either actual end users as part of a pilot test or colleagues acting as stand-ins. I find it effective to have a weekly "call fest" where colleagues are invited to call the app, then submit comments about problems or areas for improvement. Speech Server provides Speech Application log analysis tools to generate call activity reports. The tools also allow drill-down into individual QA controls on a per-call basis to see how callers fared, record callers' speeches, and make them available for review.

User testing provides a guide to development effort. Plan to do several cycles of development and testing. This will also serve you well when you're developing grammars.

Speech Server uses the W3C GRXML format for grammars. The SASDK provides an excellent GUI-based Grammar Editor. It displays a grammar rule as a tree, and it lets you type in sample phrases, then test them against the grammar (see Figure 1).

A grammar defines everything a caller might say at a given point in the app. It's structured as a decision tree, with each path representing a possible utterance. For example, "single," "single room," and "suite" represent three possible paths in the RoomType rule. The speech recognizer selects the path through the active grammar with the best match. Callers using words not in the grammar trigger a NoReco event.

Don't Kill Users With Kindness
You'll be tempted to try to think of every way every caller might say something, but this approach can increase recognition errors. Many possible paths through the grammar provide many possibilities for the recognizer to pick the wrong one. Instead, start with a minimal grammar and add to it as user testing indicates.

Look for ways to use the ruleref tag in your grammars. Ruleref specifies a sub-rule; it lets you chop a large grammar into bite-sized chunks. In fact, try to arrange the grammar for even a simple question such as "What type of room?" as two rules. First, use a low-level rule to define your room type phrases: single, double, and suite (see Figure 1). Then use a high-level rule, AskRoomType, which defines the full utterance a caller might say (see Figure 2).

The application could use the high-level grammar for the "What type of room?" question, then reuse the low-level rule for various types of confirmation or for adding mixed-initiative later.

The SASDK provides several hundred prebuilt grammars for numbers, dates, yes-no, and other common scenarios. Microsoft put a lot of effort into creating and tuning these grammars, so make them your first choice wherever you can. You can even use them when you need to create your own grammar. They provide a useful resource to see how the various tags are used. Get the prebuilt grammars from the library.grxml file in each speech app's Grammar folder.

Errors will happen no matter how well you build your grammar, so make sure your error prompts provide help. A caller might be distracted or unsure of what to say, causing the speech app to timeout with a Silence error. Or a caller might say something the recognizer doesn't understand, causing a NoReco error. Either way, you need to play an error prompt indicating the failure. Don't blame the caller—apologize. Then provide suggestions for callers who are unsure or are phrasing their responses in unexpected ways:

C: What type of room would you like?
H: Um ? a large.
C (NoReco): I'm sorry. I didn't get that. 
    Please say "single," "double," or 
H: Double.

A QA control in an SASDK application represents a single question. Its default prompt here is, "What type of room would you like?" Change this when recognition errors occur. Use a prompt function, which is a client-side JScript function that returns the prompt text. Associate a prompt function with the QA so the control can present appropriate error prompts:

var prev = History.length == 0 ? "" :  
if (prev == "Silence") {
   return "I'm sorry, I didn't hear 
   you. Please say single, double, or suite";
} else if (prev == "NoReco") {
   return "I'm sorry, I didn't get 
   that. Please say single, double, or suite";
} else {
   return "What type of room would you like?";

The variable History is a system-supplied list of the results of previous recognitions. A length value of 0 indicates this is the first recognition being done on the QA control.

Don't Sound Like a Broken Record
You also need to deal with repeated errors. Apps that simply repeat the same prompt sound like they're broken:

C: What type of room would you like?
H (mumbling): Um ? with ? ah ? two beds ?
C (NoReco): I'm sorry. I didn't get that. 
    Please say "single," "double," or "suite."
H: A double room.
C (NoReco): I'm sorry. I didn't get that. 
    Please say "single," "double," or "suite."
H (too loud): DOUBLE!
C (NoReco): I'm sorry. I didn't get that. 
    Please say "single," "double," or "suite."

Instead, have your speech app play a different error prompt each time. This is called "escalating prompts." The difference can be minor, or more commonly, it can provide additional help information on repeated errors (see Listing 1).

This is another reason to keep apps fairly simple. Error prompts describing a myriad of possible actions can be daunting: "You can say a city name, or an airline, or a date, or 'main menu' or ?"

Sometimes you will get repeated errors, especially when there's high background noise, such as when the caller is driving or in a busy public place. You should transfer callers who fail to answer a question more than three times or so to an operator or customer service representative. Use a bailout QA control on each page that watches for three consecutive errors. This QA has two JScript functions. First, it calls a client activation function every time one of the other QA controls on the page finishes. It returns a bool indicating whether the bailout QA should be activated:

if (RunSpeech.ActiveQA == null) 
var h = RunSpeech.ActiveQA.History;
var len = h.length;
return ((len > 2) && 
   ((h[len-1] == "Silence") || 
      (h[len-1] == "NoReco")) &&
   ((h[len-2] == "Silence") || 
      (h[len-2] == "NoReco")) &&
   ((h[len-3] == "Silence") || 
      (h[len-3] == "NoReco")));

The bailout QA is configured as an output-only QA. It has a prompt but no grammar. When activated, it plays a prompt saying, "You seem to be having some difficulty. Transferring you to a customer service representative." When the bailout QA finishes playing its prompt, the OnClientComplete function is called. This function navigates to another page where the Transfer control is used to transfer the call:

function BailOut_OnClientComplete()

You also need to deal with errors in your speech recognition system itself. Speech recognition engines are improving as CPUs improve. Still, background noise, speaking style, and similar words all affect recognition accuracy. Short words, such as "six," that contain the consonants h, s, x, or f are especially hard, because their sound resembles noise. Speech apps deal with this by confirming uncertain results with callers. The speech recognizer returns a confidence value with the recognized phrase, indicating how confident the recognizer is of a match.

The SASDK defines two thresholds. The reject threshold has a value between 0 and 1. Answers with confidence values less than the reject threshold are ignored and cause a NoReco event. The confirm threshold also has a value between 0 and 1. Answers with confidence values greater than this value don't need confirmation.

Answers with confidence values between the two thresholds get marked as needing confirmation by a separate QA control, whose prompt reflects what the app thought it heard. Say something like "do you want" instead of "did you say," because the grammar may allow several ways to say something:

C: What city?
H: L.A.
C: Do you want Los Angeles?

Hold Results With SemanticItems
The SASDK uses objects called SemanticItems to hold results, with one semantic item for each piece of information being gathered. You can use semantic items both server-side (ASP.NET) and client-side (JScript). You say the current value of a semantic item siCity by adding a parameter City to the prompt function whose value is siCity.value:

function ConfirmCityPrompt(City)
return "Do you want" + City + "?";

Use the prebuilt YesNo grammar for this type of simple confirmation. The semantic item is cleared if a caller says "No," and the original QA runs again.

The next type of confirmation is confirm-and-correct. The confirmation QA requires a grammar that allows Yes, No, or No I said <city>. It lets callers provide corrections:

C: What city?
H: Boston.
C: Do you want Austin?
H: No, I said Boston.

Confirmation can be tricky. When the user answers the confirmation question with "No, I want Boston," the recognizer returns a new confidence value on "Boston." The app uses "Boston" and proceeds to the next question if it exceeds the confirm threshold. Otherwise, the confirm threshold runs again:

C: Do you want Boston?
H: Yes.

More advanced forms of confirmation allow confirmation of a previous question as part of the next question.

Speech apps should allow either voice or Dual Tone Multi Frequency (DTMF) digits (telephone keypad) when asking for numeric data. DTMF isn't as convenient as speaking, and it requires callers to free up a hand for typing digits. However, it's accurate. It also ensures privacy when callers don't want to say account numbers or PINs aloud.

QA controls permit multiple active grammars. You can enable DTMF input for a QA by adding a DTMF grammar in addition to the speech grammar. Both grammars must output the same result. Results are encoded as Semantic Markup Language (SML) and defined in the grammar using the Script Tag item. A grammar including the prebuilt grammar Digit4 followed by a script tag containing "$.Account = $$" means the SML will contain its results in a node called Account. Here's the SML output for a caller saying "three two four oh":

<SML confidence="1.000" text=
   "three two four oh" 
      <Account confidence="1.000" 
         text="three two four oh">

The QA control binds the SML to a semantic item (see Figure 3). Binding SML/Account to siAccount means the previous SML fills semantic item siAccount. Either of two grammars can fill the semantic item if both output the same SML.

Consistency matters in any user interface—it reduces training time. Global commands in speech apps let callers say certain "hot words" at any point in the app. These command words control the dialog. Keep their number small, because you must describe them all to callers. Consider using not much more than these four standard commands: Help, Repeat, MainMenu, and Operator.

Callers who say "help" should get context-sensitive help about the current question, followed by more general help. Remind callers that they can start speaking whenever they've heard enough. Callers saying "repeat" should hear the most recent prompt again. This helps in noisy environments where a caller misses the prompt. Callers saying "main menu" get the main menu, but confirm this first, so you don't make them start over again inadvertently. Finally, "operator" gets them an operator or customer service representative—again, after confirmation.

Define commands using the SASDK's Command control, which is active across all QAs on the page (or fewer if desired). Set a high AcceptCommandThreshold on the Command control. This way, the command is triggered only when the recognizer matches the command grammar with high confidence—especially when QA controls have large grammars (such as city or employee names) where some names might sound like command words.

Use Prebuilt Components
Collect standard data types with SASDK prebuilt components called application controls, including ListSelector, DataTableNavigator, AlphaDigit, CreditCardDate, CreditCardNumber, Currency, Date, NaturalNumber, Phone, SocialSecurityNumber, YesNo, and ZipCode.

Application controls shrink speech application development effort. Think of the complexity of collecting a date, which might include "tomorrow," "the 12th," "12 of June," and "June 12th." In addition, you need DTMF input and confirmation, and the Date control has some advanced fallback strategies. Collect a whole date because users speak whole dates more naturally. The control can fall back to collecting a date as date components if problems such as background noise cause repeated recognition failures. Then collect day, month, and year, which is slower but more reliable.

Application controls come with their own prompts. You can change them and replace the default text-to-speech (TTS) output with your app's own prerecorded prompts. You can wrap a lot of functionality into one control. Hopefully, third-party vendors will start selling additional controls for a wide variety of data types.

Having good-quality prerecorded prompts does a lot for an app's perceived quality. TTS output is good, but callers still prefer the natural sound of real speech. Many professional prompt recording firms have a stable of voice talent for hire. Trim prompts aggressively, though. Nothing slows down an app more than dead air surrounding prompts.

I recommend TTS for dynamic output, such as stock reports, where thousands of possible company names may be spoken. You can mix TTS with prerecorded prompts, but try not to use TTS for isolated words.

The SASDK provides a prompt editor. A prompt comprises its transcription, such as "Welcome to Joe's Travel," and a recording (a WAV file). Sets of prompts are compiled into prompt databases. At run time, Speech Server first looks in the prompt database when it has a prompt to play, such as "Welcome to Joe's Travel." The server plays the associated recording if it finds a match. Otherwise, it synthesizes the prompt using TTS.

Now you have a good starting point for creating an effective speech UI. Download the free SASDK from MSDN -- it has an excellent tutorial about creating a pizza-ordering app. Get all the sample code for this article here).

comments powered by Disqus


Subscribe on YouTube