Real Time Communications Featured Article

Microsoft Building Key Value-added Services Applicable to WebRTC Developers

November 12, 2015

I know everyone has big love for WebRTC to add voice and video into everything under the sun, but the real value and profits are going to be made by adding intelligence in processing media streams. Microsoft is rapidly building and rolling out APIs to process voice, video, and language that are likely to be "killer apps" for WebRTC developers recognizing the future.

Project Oxford, available at, is the name of Microsoft's machine learning artificial intelligence (AI) project.  Currently, developers can work with beta APIs in three categories: vision, speech and language.   The public beta tools are currently available for a limited free trial.

The latest beta vision API is able to perform emotional recognition on an image, first recognizing faces via a Face API and then using an Emotion API to provide a weighted scale of 8 emotional states showing on the face: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear, and Contempt.  

Microsoft suggests developers will be able to use the tools for marketers to gauge people's reaction to things such as a store display, movie, or food. A messaging or photo app might offer different options based upon what emotion it recognizes in a photo; think about how you'd like to sort out "Happy moments" when building family albums.  And I can see the call center functions to rate customer satisfaction and to escalate a call from a stock agent up to a specialist or supervisor.

Other tools in the Microsoft Project Oxford toolbox today include an enhanced spell chat that includes slang words, brand names, and expression (available today and I really wish was available for an upgrade/plug-in on my somewhat dated version of Microsoft Word) and the aforementioned face API.  The face API is being updated to include facial hair and smile prediction tools, with improved visual age estimation and gender identification; you may recall the social media buzz when Microsoft rolled out a camera-based "guess my age" teaser earlier.

By the end of the year, Microsoft is rolling out a lot of heavy lifting APIs in speech and video.  The video tool lets customers easily analyze and automatically edit videos by doing things such as tracking faces, detecting motion, and stabilizing shaky video. Based on some of the same technology in Microsoft's Hyperlapse video app, it's easy to imagine these services applied to alarm systems, marketing tools, and videoconferencing systems.

Speech tools will be equally important.  Speaker recognition -- a high-value app for voice apps -- will be available to apply as a security measure by the end of the year.  Also coming out as an invite-only beta by end of year is Custom Recognition Intelligent Services (CRIS), enabling developers to customize speech recognition in challenging environments.  It could help in a noisy public space, such as a shopping center or shop floor. It could also be applied to better understand people who have trouble with "traditional" voice recognition, such as those with disabilities or non-native speakers of a language.

If you are working with WebRTC apps and want to add value and/or intelligence to the baseline app, I'd suggest taking a look at the Project Oxford APIs. Keep in mind they are currently in a free beta.  Microsoft hasn't said when it will take them into pay mode or what the business model for using them will look like. Hopefully, the company will take a page from the WebRTC services industry and offer reasonable usage and pay-as-you-grow terms.

Edited by Stefania Viscusi

Article comments powered by Disqus

  Subscribe here for RTCW eNews