Joy, pain and unknown manatees: a Microsoft Video Indexer test

Sven Krumrey

Recently, Microsoft's Build 2017, an annual developer conference, took place in Seattle and the company once again displayed their crown jewels along with big plans for the future (as always). With a Colgate smile, attendees were shown what smart services can do with cloud data - and data protection officials were probably reaching for their suicide pills right away. Video Indexer is a great example of how to leverage these new technologies and you can give it a try - if you dare. After all, Microsoft didn't lie when they spoke of a "democratization of surveillance tools".

A bright outlook into the future at Build 2017 A bright outlook into the future at Build 2017

Artificial intelligence is slowly but noticeably making its way into our everyday lives. With Video Indexer, Microsoft introduces a tool to analyze videos on multiple levels to the public and they didn't cut corners. Aside from processing the usual length, file format and file name attributes, the tool also scans for speech and emotions. Will the Azure server know when I am happy? Reason enough to upload a few videos of my own and have Video Indexer take a go at them.

Since Microsoft Video Indexer is an online-based service, you'll first need to register at to upload your movies. I went with a colorful mix of different videos (including some related to Ashampoo) and transferred them into the cloud. After just a few minutes I was stunned. While the results for the nature video were empty (the tool probably knows nothing about manatees!), there was a considerable depth to the analysis of the interview I uploaded. Central issues were recognized and appeared as keywords in the list to the right of the video. Once clicked, playback instantly navigated to the corresponding time index. It's also possible to filter through entire video collections based on keywords. That's impressive and, once again, likely a wet dream of all secret service agencies.

Does everything it's supposed to do - and maybe more Does everything it's supposed to do - and maybe more

Transcripts are also worth mentioning. Video Indexer analyzes and transcribes all spoken words. Even instant translations are possible with the quality depending on the source material's audio quality, naturally. Amateur recordings yield fragmented texts but professional interviews recorded through a good microphone result in almost flawless transcriptions though a few funny bloopers are bound to pop up here and there. My boss certainly didn't reply with "Kidney. Life, never" when asked about data security. The technology is still in development (engineers are currently working on body language recognition) but the current state is already astonishing. Textual information also gets recognized. If you ever wanted to see how states manage to capture and analyze license plates, here's your chance.

Naturally, face detection is included as well. Celebrities are instantly recognized through Bing while the identification of private individuals is left to the user and will be permanently stored, once provided. Again, video quality matters. Blurry or badly lit source material will quickly turn your colleagues into Hollywood stars or politicians in the eyes of the tool. This may be flattering but it's still an error. Video Indexer furthermore tries to determine the situation and relevance of each character to a scene. For my panel show, the tool concluded the loudmouth had about 40% of on-screen time and beat all other participants in that respect. That's a life lesson learned!

Speech sentiment is where it got really interesting. Video Indexer analyzes a speaker's genuine feelings about their words and classifies them as either neutral, positive or negative on a neatly colored timeline. The hit ratio is high. A video with a slightly tipsy friend of mine talking about her vacation trip received a deep green (positive) while a scientific lecture was mainly rated gray (neutral). When an elder person started talking about the government everything turned red, Microsoft definitely detected the anger! Roaring cheers, however, were misinterpreted as aggression. There's an option to scan for "explicit content". Since I had no such videos, I skipped this feature. I swear!

A quick analysis for starters

Also still in the works is advanced object and gesture detection among other things. Though the current state is labeled "preview" by Microsoft, it is easy to see where this already powerful tool is going. If you own a ton of videos (that contain people and lots of talk), you'll save hours processing them with Video Indexer as long as you can warm up to the whole cloud concept. That's what it'll ultimately come down to since the product won't be available offline.

It's difficult to draw a final conclusion because, aside from the technical brilliance, the product raises various questions. How will private consumers, employers or government institutions use these new abilities? Microsoft provides a freely accessible API (application programmers interface) so other developers can integrate the technology into their own products. Where will they draw the line and who'll enforce their usage policy especially in view of potentially unethical use? Once widely available, third parties may use the technology to foster surveillance, censorship and tracking. It's the old story of anything that can be done will be done. So is it worth it?

Pictures: Microsoft (Azure)
Back to overview

Write comment

Please log in to comment