Semantic video indexing app

The newest version of Mac OS, called “Mountain Lion,” includes “Dictation,” which is a piece of system software that takes speech and converts it to text. This is nothing new, of course. I remember that I had a piece of dictation software for my old Windows 98 PC. You had to “train” the software to understand what you said, and even then it was wildly inaccurate, but in principle, this sort of software has existed for a long time. Dictation on Mac OS is much better than the one I had back in 1998, but of course it is not perfect.

That particular piece of software I had on my PC was not built in to the operating system. I had to pay for it. Not only that, but because it didn’t work very well, I never got another dictation programme again. But now that this one is built into the OS, I think I’m going to try an experiment.

Here’s my inspiration: In Star Trek, every character keeps a “log,” and because it’s the future, it’s an audio log. In The Next Generation, they were often shown as video (b)logs. Sometimes, in order to advance the plot, a character would be shown searching through his own (or another person’s) logs. What was interesting was that the search would usually be a semantic keyword search. Something like, “Computer, show me all log entries relating to the warp core” (or whatever they were interested in at the time). With dictation software now a standard feature in OS X, we’re at a point where we could write an app that does exactly what the computer did in Star Trek.

The workflow will be as follows: Take a video (or a set of videos) that you’re interested in, and extract the audio. Divide the one big audio file into hundreds of smaller (say, ten-second-long), overlapping audio files that are annotated with their start time in the original video. For each of these smaller files, pass them through the dictation software and generate a text file that includes the text that has been generated by the system’s text-to-speech dictation software. And voilà, you have generated a time-encoded text index for your video—just like the one on YouTube, but you wouldn’t have to upload the file.

Wrap this all up in a shiny OS X app wrapping and put it on the App Store. Sell it for $0.99.

Then, if you had a bunch of videos—say, seasons 5–6 of Doctor Who, and you wanted to find all references to “the Silence,” you could install the app, have it index your iTunes library, and then do a search through your videos for certain keywords or phrases.

Actually, this might work. If anyone wants to collaborate with me on this one, hit me up in the comments.

Edit: I take it back. A quick experiment with Dictation indicates that we are nowhere near having the technology to be able to do this.

Published by

The Grey Literature

This is the personal blog of Benjamin Gregory Carlisle PhD. Queer; Academic; Queer academic. "I'm the research fairy, here to make your academic problems disappear!"

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.