How Gulf-developed large language models like Jais are bringing Arabic into the AI mainstream
DUBAI: When ChatGPT made its debut last year, the artificial intelligence program caused a global sensation, as users found themselves communicating with a machine that could pass as another human being.
However, the enthusiasm among techies in the Arab world was somewhat diminished by ChatGPT’s limited grasp of Arabic, in part the result of the language’s complexity, diacritical markings, inflection system and regional dialects.
Although ChatGPT, which is based on a large language model, or LLM, can understand inquiries in Arabic and is able to translate, especially when using Modern Standard Arabic, answers can come across as unnatural, while literal translations do not always convey the right message.
That is why Jais, an LLM designed to support Arabic, was unveiled in July, bringing one of the world’s most widely spoken, though occasionally overlooked, languages into the AI mainstream.
Jais, a name that recalls the UAE’s highest peak in Ras Al-Khaimah, is the brainchild of a team of academics and engineers who embarked on the project because they felt too few LLMs were credibly multilingual.
Downloadable on the machine learning platform Hugging Face, Jais is the result of a collaboration between Cerebras Systems, Mohamed bin Zayed University of Artificial Intelligence, or MBZUAI, and a subsidiary of the Abu Dhabi-based G42 called Inception.
“It is vital that large language models are developed for languages other than English to ensure that innovation is accessible to everyone,” Andy Jackson, CEO of Inception, told Arab News.
“A quality Arabic LLM is critical for all sectors, businesses and organizations, as well as individuals. Innovation thrives when we collaborate, and Jais sets a new standard for AI advancement in the Middle East, ensuring that the Arabic language, with its depth and heritage, finds its voice within the AI landscape.
“Jais demonstrates our commitment to excellence, and our dedication to democratizing AI and promoting innovation.”
LLMs are functional machine learning models that use deep learning algorithms to process and understand natural human language. These models are then trained on large amounts of text data to learn patterns in the language.
These programs, which are rapidly proliferating in the wake of ChatGPT’s success, are capable of generating text on a seemingly endless array of subjects, producing everything from academic papers to poetry.
What is especially impressive about them is their ability to create responses to questions that are so convincingly human-like in almost any language, including coding.
But in order to make those languages sound convincing, native-speaking human programmers are often required to provide a critical layer of context and understanding that can enhance accuracy and reliability.
“Jais is purpose-built for the Arabic language and excels in capturing its intricacies and nuances, ensuring highly accurate and contextually relevant responses — a distinct advantage over general-purpose models,” said Jackson.
“This specialization is a pivotal development, opening up opportunities for governments, industries, and individuals across the Arab world to tap into the potential of generative AI.”
Currently considered among the foremost Arabic LLMs, Jais, a 13-billion parameter model, was trained on a newly developed 395-billion-token Arabic and English dataset on Condor Galaxy, one of the largest cloud AI supercomputers in the world, launched by G42 and Cerebras in July using 116 billion Arabic tokens and 279 billion English tokens.
“Jais was born in Abu Dhabi and offers more than 400 million Arabic speakers the opportunity to harness the potential of generative AI,” Preslav Nakov, professor and deputy department chair of Natural Language Processing at MBZUAI, told Arab News.
“It will facilitate and expedite innovation, highlighting Abu Dhabi’s leading position as a hub for AI, innovation, culture preservation and international collaboration.”
As an open-source model, Jais is expected to engage scientists, academics and developers to accelerate the growth of a an Arabic language AI ecosystem. It could also serve as a model for other languages now underrepresented in mainstream AI.
“Jais outperforms existing Arabic models by a sizable margin,” said Nakov. “It is also competitive with English models of similar size despite being trained on significantly less English data.
“This exciting result shows that the model’s English component learned from the Arabic data and vice versa, opening a new era in LLM development and training.”
In Jais’s development, significant attention was devoted to pre-processing Arabic text, enhancing support for the language’s unique features, including its writing style and word order.
Jais also maintains a balanced Arabic-English dataset focus for optimal performance, offering a marked improvement over models with a limited Arabic text presence.
Its developers say Jais, unlike other models, captures linguistic nuances and even comprehends various Arabic dialects and cultural references.
“Jais facilitates faster customization for specific Arabic-focused use cases and addresses data ownership concerns by being based in the UAE, offering a reassuring solution for local enterprises,” said Inception CEO Jackson.
The UAE’s Ministry of Foreign Affairs and Ministry of Industry and Advanced Technology, Abu Dhabi’s National Oil Company and Department of Health, Etihad Airways, First Abu Dhabi Bank, and global technology group e& are planning to utilize Jais, offering valuable insights to enhance the model and its applications across their industries.
Given the strong digital transformation efforts by several of the Arab Gulf governments, accompanied by huge investments in high-tech industries and homegrown tech startups, AI programs that are responsive to the Arabic language could widen access to a transformational new technology and challenge the monopoly of a clutch of Silicon Valley companies.
Last month, Technology Innovation Institute, an Emirati research center in Abu Dhabi, released Falcon 180b, an open-source AI model. Established in 2020, TII released Falcon 40b, the first version of its flagship open-source AI model, in May this year, after unveiling Noor, an Arabic-based AI model, last year.
According to a report in The Economist magazine, TII is the applied-research arm of the Advanced Technology Research Council, a government agency that employs an 800-strong multinational staff working on subjects from biotechnology and robotics to quantum computing.
“We are entering the game to disrupt the core players,” Faisal Al-Bannai, secretary-general of the ATRC, told The Economist, adding that TII will build new proprietary models and applications catering for specific fields such as medicine and law.
For its part, Saudi Arabia launched its National Strategy for Data and Artificial Intelligence in October 2020, aiming to become a global leader in the field as it seeks to attract $20 billion in foreign and local investments by 2030.
The Kingdom is also determined to future-proof its workforce, initially by training and developing a pool of 20,000 AI and data specialists. In May this year, Deloitte’s AI Institute was officially launched at the Experience Analytics conference in Riyadh.
Just last week Saudi Arabia launched a National Olympiad for Programming and Artificial Intelligence open to all middle- and high-school pupils. An estimated 300,000 students will be selected from 3 million participants for training in programming and AI, according to media reports.
The initiative is a collaboration between the Saudi Data and Artificial Intelligence Authority, in collaboration with the Ministry of Education and King Abdulaziz and His Companions Foundation for Giftedness and Creativity (Mawhiba).
Saudi Arabia’s adoption of digitalization and emerging technologies is forecast to contribute about 2.4 percent to its gross domestic product by 2030, according to a recent report by global consultancy firm PwC.
In terms of average annual growth in the contribution of AI by region, Saudi Arabia is expected to grab a 31.3 percent share in the technology’s expansion between 2018 and 2030, the PwC report added.
“AI is developing rapidly, and its impact will be felt more and more across all sectors and areas of life,” said MBZUAI’s Nakov. “In this context, it is vital that the Arab world has access to an advanced LLM that can be adapted and utilized across all sectors.
“The rapid advancement of AI means that organizations that fail to adapt and start using AI sooner rather than later will be left behind, which makes it even more essential for the Arab world to have access to quality LLMs.”
Beyond its business applications, however, a crucial aspect of a program such as Jais is its ability to champion neglected languages, preserve them in a fast-changing economy, and promote digital inclusivity.
Although Arabic is an official language in 22 countries and is partly spoken in 11 others, it accounts for just 1 percent of total global online content, according to Jais’s creators. The hope is that the advent of AI and the automation of rapid translation will be a game changer.
By placing the language at the forefront of the AI revolution, Jais and its successors could help to maintain Arabic’s global prominence and its distinctive cultural significance in the digital age.