(disclaimer: this is based on demonstrations I have seen - I don't have access to JB and thus I can only speak of Siri):
In many of these demonstrations I noticed one thing that was bugging me: Even though the voice recognition in Android seems really cool (I don't have JB yet), it doesn't give definite audible confirmation of the command in many cases and sometimes it even requires user interaction with the screen.
Now personally, I already believe that speech input is kind of a gimmick in itself (try using english voice recognition with my address book filled with german names...), I believe that to even have a chance to move from gimmick to useful feature, it must work without user interaction on the screen.
"Play <whatever band name>" followed by "beep" and the a button to press on the screen doesn't help me. A useful response to "play <insert band name here>" is "playing <insert band name here>" followed by actually playing it.
Or "call <some name>" - if you just get back "calling" or even just a beep - how would you know whether the recognition was successful or not and the correct name has been recognized?
Some commands on Android seem to be doing fine (the weather example), but others fail in one way (the play example seems to require user interaction on the screen) or another (the "turn on wifi" command doesn't produce any audible confirmation or error message - just the same beep sound as if it worked).
Siri, while it might not have as good a recognition as the Android solution, is much better in that regards: It always confirms your command. As such Siri seems moderately more useful as an additional input method whereas Android, by forcing you to look at the screen when inputting a voice command, reduces this to a gimmick and nothing else.
> ... it doesn't give definite audible confirmation of the command ...
I don't have this either, but can tell you that Google's voice recognition does have a confidence rating on a per word basis. (See Google voice messages. Also use the recognition API and it provides alternatives.)
In their navigation product they directly do whatever was said if there is a high confidence. If not it shows the recognized speech with a "pie" based countdown to using the displayed recognition. You can press OK to go ahead (or wait), or cancel/try again.
They could obviously do something similar with this.
They also use context for their voice recognition. I grew up in a town named Piggs Peak (note two 'g's). If you say "piggs peak" you will get that spelling, but for example saying "peak pigs" gets you the spelling with only one g. This explains why wooster/worcestor doesn't confuse them. I don't have a siri capable device so I don't know what they do.
It appears that for queries that don't have direct "answer" type response and will perform an action it displays a progress bar with the query it understood. Presumably this is so that you can cancel the action if voice recognition was wrong.
It doesn't require any input though - I just tested it. Once the progress bar reaches the end (it seems to take ~7 seconds) it will complete the action.
Are you trolling? I understand you said in your parent that you don't have JB and are strictly going by the video and demo.. but how could you miss this? It's in the first minute.. multiple times.
You do not have to "wait ~7 seconds before you notice that it was wrong" so you can "re-issue the change". The result is displayed immediately. You can cancel the auto-action (which you first said didn't even exist), or force it through before the ~5 seconds (not ~7) elapses.
Go re-watch the entire video in the foreground, please.
I think pilif meant that you'd have to wait for the progress bar (about seven seconds) to finish before you noticed something was wrong. For example:
"Call the Drake Hotel in Toronto."
*bling* "Calling..."
(wait seven seconds)
"Hey, this is Drake. What's up?"
Versus what Siri does:
"Call the Drake Hotel in Toronto."
*bling* "Calling Drake Smith..."
"No, wait! Stop!"
Think about using the voice commands when you can't see the device. Like when you're driving or running. It's useful to have the audible feedback in addition to whatever's displayed on the screen.
It works the same way as Voice Search has always worked (at least since Froyo): The found action is shown to you for about 3 seconds along with a timed progress meter (3 seconds?) and buttons to proceed or cancel. When the timer ends, the action proceeds.
This is the correct solution, IMO. It would be quite frustrating to have the wrong phone number instantly begin to dial, for instance. One time when I said "call <name of restaurant>", it came up with "Call <name of restaurant>" with the address of the location I didn't want shown beneath. This gave me time to tap Cancel, which then showed me a list of the alternative results/locations.
So lets say you have your phone in your bag, not looking at it. Then you enable the voice command and say "Call Foo Burgers"
Your phone understands this as "Call Bar Burgers" and shows on the screen "Calling Bar Burgers". The phone makes a "beep" sound and then proceeds to show a progress bar which you don't see because your phone is in your pocket.
Then the phone connects and you learn of your mistake as the person at the other end answers with "This is Bar burgers, Mr. Foobar speaking".
The only way around this is to enable the voice command, take the phone out of your pocket and then check what it says above the progress bar.
With siri, if you say "Call Foo Burgers", Siri would respond (in audio over your headphones) with "Calling Bar Burgers", giving you a chance to cancel before you annoy the person at the other end and without forcing you to take the phone out of your pocket to check (which is the point of voice commands)
I already have to hit a button to enter a voice a command, I dont mind at all hitting another one to accept it and complete the deal.
I use Voice commands for pretty much all input to Google Maps and Navigation (and nowhere else). That it works flawlessly in my experience even with my mumbling is plenty good enough.
The point of voice commands, when I use them, isn't to avoid any visual interaction, it's just so I don't have to type something I don't want to type.
If you look closely, after it understands "play" command, there's a progress bar. If the user doesn't respond it would go to the music player automatically. He cut it short and clicked "Play", probably for TL;DW people.
You are exactly correct; It does this for anything that may require an action. The action can be cancelled prior to the progress bar completing, but none-interaction will result the progress bar completing and the music playing, alarm being set, etc..
I don't understand how parent can write such a lengthy comment without watching the entire video and understanding what they're talking about. Even the parent's disclaimer states they are only going by the video, but it clearly shows they didn't watch it fully, since what they missed is contained within the very first minute of the video, multiple times.
I was impressed that most of these searches already work on regular google search (just tried them). I had no idea the knowledge graph could do those already. This is not only a great demo of android voice search. It's a great demo of google search.
Google Now is awesome, but it's way worse at calling my friends than Voice Search was. It's like it doesn't index my address book.
The other day, I told my phone "Call Nico Thornley." Instead, it searched for "call me-so-lonely." Not the first time Google Now has completely botched a friend search either.
Voice search seemed to quit indexing contacts with the 2.1 upgrade. Which is too bad, since putting addresses in my contacts made it remarkably efficient for driving to a friend's house on a lark.
This is what pisses me off. We clearly have the capability for really obvious stuff like this to just work ("navigate to wendy's house"), yet it still often doesn't.
I think these are great times to be a gadget consumer. Apple made the Siri interface 'mainstream', and Google is great at throwing brainpower at their own version of the interface. (I hope I'm not mistaken that Apple integrated the 'vertical list' interface first.) The fact that JB's dictation works offline whereas Siri needs a connection is the icing on the cake.
Same for the race to having better maps or the better browser.
I agree though, it's a great time to be a gadget consumer. I love Siri's conversational style. It actually doesn't seem too far off before I can actually start having a conversation with a computer though something like Siri which is both awesome and terrifying at the same time.
I meant that the interaction with Siri looks basically like a chatlog (conversational).
Here's what I misunderstood - I thought Jelly Bean was the same, but he would always just tap away the last item in the video so we don't see it. Maybe there is no scrollable backlog in Jelly Bean.
Anyone have a hypothesis on how Google's offline line voice recognition works? it definitely seems like they have moved more voice recognition work onto the client even in the online mode. My understanding of Google's approach to voice recognition was that it was big data dependent. This would make it hard to move to client devices....
"One of the biggest issues with Siri is that it requires access to Apple’s servers in order to work. In Jelly Bean, however, Google will provide full offline voice dictation to users. Granted, that’s not a full Siri competitor, but the fact that the search company has been able to take it offline in a mobile setting is very important."
I wonder whether the offline performance is comparable?
If it was then surely you'd do all the voice parsing locally so I'm guessing that it's not. Unless anyone can think of another reason you'd push it through the servers?
- online should have more training data and can be improved much more easily. Don't see that advantage going away.
- power efficiency and/or speed. Sending 5s of audio across the net can be less strain on the device (esp. older ones) versus parsing the audio locally.
Pure speculation, here, but some of the transcription looked like it relied on pretty heavy statistical inference stuff (like knowing that "Worcester" comes before "Mass" but "Wooster" comes before "College," even though they're both pronounced the same). I would be really surprised if the client-side recognition was that smart, so I'd guess that they do it server-side if they can.
It'll be interesting to see if the response changes based on localization (once this is available in more languages than US English, of course). Having the response to that query change if the locale is es_VE would be pretty slick!
Thinking about this, do systems like Android typically offer localization that specific?
Indeed; do questions about the Cardinals give different answers during football season? And if it's the time of year that people are playing both baseball and football, does it distinguish by whether you live in St. Louis or Arizona?
What's up with it showing directions to Moscone Center and then when he closes it there is something about Wooster College? I don't wanna call fake, but that's definitely strange… Then later on he asks a question about Wooster College
Edit: Also, he says "Where is that museum with Egyptian stuff in San Jose?" and after he closes it, it shows "where is the tallest building in the world"
Disclaimer: the only edits I made were to cut time between each of my queries, as well as re-order some of the demos from the original order I recorded them in, so they would fit into categories. None of the queries themselves have been edited or cut down, and the sequences are intact. The processing time happened exactly as you see. This demo is made on the early build of Android 4.1 (JRN84D, takju build for Galaxy Nexus I/O edition), on a wifi connection. Consider this beta.
So what you're seeing are places where the queries were reordered.
if you read the page he says he cut the video down a bit, and also recorded some of the clips to fit them in to categories and not the order he record them.
That could explain why some of the results have other things shown when he closes them.
In many of these demonstrations I noticed one thing that was bugging me: Even though the voice recognition in Android seems really cool (I don't have JB yet), it doesn't give definite audible confirmation of the command in many cases and sometimes it even requires user interaction with the screen.
Now personally, I already believe that speech input is kind of a gimmick in itself (try using english voice recognition with my address book filled with german names...), I believe that to even have a chance to move from gimmick to useful feature, it must work without user interaction on the screen.
"Play <whatever band name>" followed by "beep" and the a button to press on the screen doesn't help me. A useful response to "play <insert band name here>" is "playing <insert band name here>" followed by actually playing it.
Or "call <some name>" - if you just get back "calling" or even just a beep - how would you know whether the recognition was successful or not and the correct name has been recognized?
Some commands on Android seem to be doing fine (the weather example), but others fail in one way (the play example seems to require user interaction on the screen) or another (the "turn on wifi" command doesn't produce any audible confirmation or error message - just the same beep sound as if it worked).
Siri, while it might not have as good a recognition as the Android solution, is much better in that regards: It always confirms your command. As such Siri seems moderately more useful as an additional input method whereas Android, by forcing you to look at the screen when inputting a voice command, reduces this to a gimmick and nothing else.