Google has once again stepped up the game with their announcement that users can now search by image. My first reaction when i read this was simply, “Wow”. Reading on, however, i came to the part about the “nascent nature of computer vision”. Anecdotes put the start of research into computer vision back to 1966 when an undergraduate student was directed to “solve the vision problem” over the course of the summer. Such is the degree to which we underestimate the complexity of the problem. About one third of our cortex is dedicated to processing information from our eyes in probably hundreds of distinct ways and then integrating all of the results together into what is consciously deemed to be an integrated whole. Nothing shows just how galactically complicated human vision is more than when over 40 years later one of the top technology companies on the planet calls our understanding of it “nascent”.
To be sure, some truly inspired work has been done in that time. Advances in some areas such as OCR and specialized machine vision have revolutionized productivity in certain industries. And depending on who you ask there are vision technologies peeking (no pun intended) over the horizon that will transform the world.
But did we start off on the wrong foot altogether? When someone decides to get into artificial intelligence, i’d say there’s probably more than a 50% chance that they’ll go into computer vision first. And almost without fail the first stop on the long, long vision path is image recognition. And why not? It seems simple enough to begin with a static image and process it until it concedes some kind of understanding of the scene. Certainly our computing technology most readily lends itself to this approach.
The problem is that this is not how human vision works, and naturally it is human-level vision that everyone is really after (just as it is a human-level intelligence that all AI researchers are after). Everyone has seen pictures of animals that are masters of camouflage, how they appear to be just another leaf or twig or rock or bump of sand until they suddenly move, and they are revealed. And those who read Jurassic Park remember how, if you just stayed still, the dinosaurs wouldn’t see you because they could only see motion; they couldn’t decipher a static scene. (Although the characters later – and unhappily – learned that the dinosaurs were in fact more cerebrally advanced.)
Could Michael Crichton have been right? Could it be that motion recognition is simpler than image recognition? It certainly seem plausible. Especially when an undeniable natural defense against predators is blending into your environment by being appropriately coloured and not moving. And if you consider how much simpler it would be to build a neuron circuit that detects visual change than one that detects arbitrary static forms, the argument becomes very convincing.
If we accept then that motion recognition came before image recognition then we can apply the old rule of evolutionary conservation and assume that the later is an advanced form of the former. Again, in practice we can see (again, no pun intended) how this may be the case. Detecting motion is better than not detecting anything, but only responding to relevant motion is better than responding to everything. And the better our assessments of relevancy, the better our survival.
But let us return now to how to build computer vision. Perhaps it is wrong to start off with image recognition. Perhaps the right approach is to build a computer vision that sees motion first, and through this can build up a repository of objects with a complete set of visual perspectives. Such a repository could then be used to decompose an image into the objects the scene contains. Given that the edges and gradients of an object move more or less together in a moving scene, it should be easier to associate features of a moving image than one that is static.
A significant problem with this approach is that our computer technology is not designed for it. I personally know of no computer languages in which time – even in an abstract sense – is a key feature. For example, if you wanted to model the potential of a neuron – how it changes based upon interactions with neurotransmitters and either tends back toward its resting state or fires an action potential – it would be entirely up to you to manage the changes that happen only due to time. Likewise, the release of adrenaline in an appropriate organism causes distinct changes, the effects of which diminish over time. Modeling such effects are completely up to the programmer. I believe this is why vision researchers tend to prefer static images. Our technology provides a simple way work with them. But very quickly the same simplicity ends up being a roadblock.
In GoiD i plan to introduce scripting features that will make time-based change simple to implement. Hopefully players will find this useful, and such concepts will spread beyond the game.