ICT Projects - Giving images a voice

Images dominate our world, but most software still can’t accurately read them. Through Deep Learning, BRANDON BIRMINGHAM’s PhD is set to change that.

The majority of us experience reality in terms of images: what we see shapes the way we interact with the world. Even so, the fact that our brains can easily interpret what we’re looking at without much action from our part is something we rarely take into account. But what happens when computers ‘look’ at images? Well, at this point in time, computers cannot truly understand what an image is of and can’t accurately describe it. Nevertheless, as Brandon Birmingham’s PhD in Communications & Computer Engineering takes shape, that is all set to change.

Brandon is working on creating Artificial Intelligence (AI) software that can recognise not just what is in an image, but that can also understand the setting, the ambiance and the relationship between the elements within that said image. Just like humans would.

“The idea here isn’t just to get the software to say that Picture X is a picture of a cat, but to get the software to interpret the photo in the same way a human being would,” Brandon explains. “Where is the cat? What does it look like? Is it asleep, yawning, lounging, eating? And what’s surrounding it? Is it all just backdrop or is there anything related to it that is of interest to the viewer?”

One of the main challenges of getting this done is to get the software to see two-dimensional images in a three-dimensional manner, similar to how we, as humans, see the world. And, surprisingly, one of the hardest things to teach a machine is the difference between prepositions, which it can then use to determine where an object sits in relation to another.

“We are using neural networks to get the machine to learn spatial prepositions,” Brandon continues. “These neural networks are basically algorithms inspired by the architecture of the human brain that can be trained to identify underlying relationships between things. In this case, two or more objects in a photo.”

Predicting spatial prepositions between image objects, however, is only the beginning as Brandon’s plan is to create software that can autonomously create rich and detailed descriptions of images. To begin with, he started with a number of images and scoured the web looking for the most detailed descriptions he could find of them through the use of a mathematical model designed to automatically find relevant captions.

The plan is now to extend this web-retrieval based architecture by constantly accumulating knowledge extracted from the vision and language domains through the use of Deep Learning. In contrast to the majority of the current state-of-the-art supervised captioning models, this proposed self-learning based model will be trained in an unsupervised lifelong learning based approach. This autonomy and independence from humans is extremely important to make Brandon’s vision a success.

“I am envisaging three main uses for this software,” he explains. “Firstly, through the detailed descriptions, we will be able to retrieve images more accurately off the web or personal image collections – it won’t be just about keywords anymore, but about rich explanations. Secondly, visual content will become more accessible to the visually impaired as richer descriptions will bring the images to life. It will also add the possibility to help them understand and navigate in the visual world through the use of Artificial Intelligence-based smart-glasses. Thirdly, this is a further step in the continuation of human-to-robot interaction: in the future, we may be able to simply tell our autonomous car to ‘park in front of that green door’ and it would be able to understand us in the same way a human would.”

Of course, mapping the divide between visuals and language is a challenging task, but it’s one that could solve numerous shortcomings and problems. And that is the spirit of ICT in our world, because it’s the promise of a solution that makes the work worthwhile.

As for Brandon, he’s never let a challenge get in his way. In fact, while he began this project for his Master’s degree, he realised straight away that it would be quite a feat to get the software to work, which is why he’s persevered and turned into a PhD study.