Now here’s a little something to blow your mind: Google has developed new image recognition software that doesn’t just represent an object in the image as text (e.g., a motorcyclist) but actually tells you in a sentence what’s going on in any piece of image content it processes. Actual example: “A person riding a motorcycle on a dirt road” to describe an image of … you guessed it, a person riding a motorcycle along a dirt road). Its added verbs to the standard (and not especially useful) repertoire of nouns.
Google research scientists Oriol Vinyals, Sammy Bengio, Alexander Toshev and Dumitru Erhan have just blogged about Google’s new machine learning technology – and it goes well beyond the most advanced capabilities of image recognition to date (things like object detection, classification and labelling, all of which have recently been greatly improved). Google has been using huge collections of simulated neural networks to process visual content, networks which “learn” not by being programmed with rules by a human engineer but by consuming data.
In a nutshell, this is how it works: two neural networks, trained by their data consumption exercises to perform different tasks, are plugged together. One is trained to process images into a mathematical representation, while the other is trained to generate full verbal sentences (in English only at the present). The latter is part of automated translation software. The first network “looks” at an image, and then passes its mathematical account of what it’s “seen” onto the second, which then sets about processing that data into a humanly intelligible statement.
When shown tens of thousands of images accompanied by descriptions written by humans, the combined network actively learned to generate increasingly accurate descriptions. On a 100-point scale, the Google combined network managed unprecedented scores in the 60’s (humans typically reach the 70’s).
The software, Vinyals says, is still at an early research stage, and it has some way to go (it does make mistakes, for example by describing an image depicting a white-backgrounded road sign covered it stickers as “A refrigerator filled with lots of food and drinks). But it has enormous potential and could, as Vinyals noted, eventually help visually impaired people to access and navigate online content (it could also revolutionise image searches for search engines).
We await further developments with bated breath.