MARTINA MELI is a master’s in computer information systems student whose work is seeking to create an algorithm that can automatically replace complicated words with simpler ones, helping non-native speakers and those with learning difficulties read in Maltese. Here, we see how she’s using artificial intelligence to turn this vision into reality.
Language is one of the most powerful communication tools we have as it defines the way we interact with other human beings, learn new things, and receive information. Even so, as anyone who’s ever learnt a second or third language would know, reading in that new language can sometimes feel tedious as complicated words force us to spend more time researching their meaning rather than understanding the context of the piece. But what if software could make life easier by automatically replacing these pesky words with simpler ones?
“Lexical Simplification (LS) is the process of replacing complex words in a sentence with simpler substitutes while ensuring that meaning remains unchanged,” says Martina, a full-time master’s student at the Faculty of ICT. “Yet, despite the existence of such LS systems in other languages, there is no similar technology for content written in Maltese.”
This is why Martina’s research proposes an LS system for Maltese content. Called Supporting Maltese Content Accessibility Through Lexical Simplification, this project is, in theory, straightforward: the system would discover complex words in Maltese, such as ‘ħolistiku’ (holistic), and replace them with simpler terms, like ‘sħiħ’ or ‘komplut’. But, as one can imagine, creating a system that can do this is not child’s play.
“We were required to develop an algorithm for potential complex word identification, substitute generation, substitute selection, and substitute ranking. So, to start off with, we got permission to compile a database of sentences from four Maltese news websites, namely TVM News, Newsbook, One, and Illum. These were then split into two sets of 20: the dev set, which is used to tune the system and find its optimal parameters, and the test set, used for testing and evaluating the system’s output.”
Martina then had to create a number of filters due to the subjectiveness of whether a word can be considered complicated or not. She started by giving each word a speech tag that specifies whether it’s a verb, noun, adjective, or adverb. She then moved on to exclude certain words from ever being replaced, such as people’s titles (Ms, Dr, Professor, etc), words that are spelt the same way in both English and Maltese, and titles of books, films, organisations, and so on.
Then, to help with generating substitute words, Martina used BERTu, a Maltese monolingual model that is trained on a Maltese corpus (a dataset of raw Maltese sentences). This software also assigns a number of tokens to words, for example, the word ‘jien’ (I) would be assigned one token, but ‘edukazzjoni’ (education) would have two or three.
The final piece in the jigsaw puzzle was the substitute ranking, which means that both the target word and the substitute are given a score based on several factors, including character count, frequency, and probability of it replacing the target word.
When all this is completed, the system can use Assistive Technology (AT) to help readers understand text with uncommon words or phrases. But this can do more than just help non-native speakers.
“The system will also allow users to pick the substitute that they feel works best for them through a drop-down menu. Older generations, for example, may prefer to have a substitute that’s of Semitic origins, while readers with autism may prefer shorter or longer substitutes.”
Now that the algorithm has been completed, Martina has packaged it as a browser-based extension, turning something that is innately complex into something that is extremely easy to use. Indeed, what we particularly like about this project is how it takes people’s differences and needs into consideration, which is an integral part of an any ICT project.