Enabling a sovereign digital future for indigenous languages

Papa Reo is a multilingual language platform grounded in indigenous knowledge and ways of thinking and powered by cutting edge data science.

Kaupapa — Our Mission

Papa Reo will enable smaller indigenous language communities to develop their own speech recognition and natural language processing capabilities, ensuring that the sovereignty of the data remains with them and the benefits derived from these technologies goes directly to their communities.

The Papa Reo project is the culmination of work undertaken by Te Reo Irirangi o Te Hiku o Te Ika (Te Hiku Media) over the last 30 years to instil, nurture and proliferate the Māori language unique to haukāinga of Te Hiku o Te Ika.

Minority languages and their communities, such as te reo Māori, are largely invisible and unheard in the digital world. Everyday tasks can be completed using your voice and speaking to your devices, but due to the absence of the large datasets required for machine learning, speakers of minority languages cannot engage with this technology. This further marginalises their language and reduces their ability to fully participate in society.

Our vision is for a multilingual language platform that will develop cutting edge natural language processing methods and tools. The programme will begin with te reo Māori, ensuring intergenerational transmission and accessibility to the language alongside the rapid development of technologies. New Zealand English will be touched on to support multilingual language use and, drawing on international collaborators, advance Hawaiian and Samoan languages. This is pioneering data science research that will ultimately support minority languages worldwide.

Existing machine learning techniques require large data sets to support the development of speech to text, text to speech and speech synthesis. Papa Reo aims to make these tools and methods available for languages with smaller data sets to ensure their languages are present in the digital age.

Papa Reo aims to support smaller language communities to gather their own data and grow capabilities within their own communities to use the tools from the platform. If we look at big tech companies like Google, they have all failed to provide quality natural language processing tools for indigenous languages due to their failure to engage and grow the capability of the community. Central to this is a staunch belief that each community must maintain control and sovereignty of their data.

Whakapapa — Our Story

'He marangai, tū ana te pahukahuka, he iti pioke, no Rangaunu, he au tōna...'

Established on 10 December 1990, Te Reo Irirangi Māori o Te Hiku o Te Ika (Te Hiku Media), by virtue of its whakapapa, is an organisation committed to the revitalisation of tikanga and te reo Māori. The 1989 report of the Waitangi Tribunal on WAI 11, the Te Reo Māori claim, and the outcome of the 1994 Privy Council case regarding the sale of state-owned broadcasting assets had far-reaching consequences for Māori language in broadcasting.

Borne of struggle and as a response to intergenerational Crown policies of systemic and targeted racism, Te Hiku Media is an iwi radio station and media hub representing the collective rights and Māori language broadcasting interests of Ngāti Kuri, Te Aupōuri, Ngāi Takoto, Te Rarawa and Ngāti Kahu. Te Hiku Media is a symbol of courage and tenacity in the recognition of mana Māori motuhake.

The purpose of Te Hiku Media is to encourage and proliferate te reo rangatira me ngā tikanga Māori o ngā hāukāinga through iwi media and iwi radio, innovation and development. The Te Hiku iwi media hub functions to revitalise te reo Māori, and as an entity serve the needs of the haukāinga communities, marae, whānau, hapū and iwi.

On Thursday 30 May 2013, a hui was held with kaumātua and kuia and other native speakers of Te Reo Māori at Mahimaru Marae. It was at this hui where two significant resolutions were passed by the kaumātua that set Te Hiku Media on the pathway of digital innovation. 2013 was also the year that Te Hiku TV (started in 2010) was forced to go 100% digital. The following year, the Whare Kōrero, tehiku.nz, was launched. The online platform became the repository for all Te Hiku media content.

Later that year, Sir Hekenukumai Pūhipi, the master navigator and waka builder, called upon Te Hiku Media to deliver the organisation’s first live streamed event of the return of the Hōkūle’a to Aotearoa. Te Hiku Media then went on to pioneer live streaming at Waitangi Day celebrations in 2015 and have built a reputation as the premier provider of live streaming in Northland.

As Te Hiku Media continued to grow their capability in digital media, they sought out opportunities to innovate with the te reo Māori data they held within their archives. In 2017, following a project to digitise the 1000s of hours of archived audio, Te Hiku Media developed ​Te Reo o Te Kāinga​, supported by Te Mātāwai. This seemingly small project sought to transcribe and present video interviews in te reo Māori with kaumātua, tagging phrases unique to the region to share with the audience. The process of transcribing te reo Māori from native speakers proved laborious and a solution was required if the large audio archive was going to be attempted.

Te Hiku Media successfully received funding from the Ka Hao Fund in 2018. Initially started as the Kōrero Māori project, the aim was ​"to teach computers te reo Māori"​. It was at this time that Te Hiku Media and Dragonfly Data Science began their relationship. Dragonfly had experience working with Māori language data and working with the iwi radio network. The relationship has grown to become a strategic partnership that has shared goals of growing data science capability in Aotearoa and in Māori and Pasifika communities.

The Kōrero Māori program of work led to the development of an automatic transcription tool ​kaituhi.nz​, using our ​speech to text API​, the first-ever ​synthesised Māori voice​ and provides the foundational data set for the Papa Reo project.

Papa Reo has a long, rich whakapapa and recognises and celebrates the work of the many te reo activists and innovators that have contributed to the important work of revitalising the language. Te Hiku Media is a product of that grit and determination and the vision of kaumātua that continue to guide their activities.

In demonstrating their capability, Te Hiku Media successfully applied for the Data Science Platform Funding in 2019 for Papa Reo. The only non-University based project to do so. Papa Reo is funded for 7 years starting in 2020 by the Strategic Science Investment Fund held by the Ministry of Business, Innovation and Employment. We are the only non-University project to be funded. This long-term funding will allow us to ensure maximum impact for our work and benefit back to the language communities.

tehiku.nz

The Whare Kōrero where our digital journey began.

https://tehiku.nz

Kaitiakitanga — Guardianship

'Kia u ki te whakapono, kia aroha tetahi ki tetahi.'

Te Hiku Media have developed a Kaitiakitanga licence, which states that data is not owned but as cared for under the principle of kaitiakitanga and any benefit derived from data flows to the source of the data. Kaitiakitanga is a principle that expresses guardianship rather than ownership of data. Te Hiku Media are merely caretakers of the data and seek to ensure that all decisions made about the use of that data respect it’s mana and that of the people from whom it descends.

Māori data will not be openly released, but requests for access to the data, or for the use of the tools developed under the platform, will be managed using tikanga Māori. Te Hiku Media have been invited to speak on their kaitiakitanga licence and it has been adopted by a government department and a social enterprise.

Research on other indigenous languages that is carried out under this platform will be for the primary benefit of those peoples. The language platform will also collect data on NZ English and, where appropriate, this will be released under an open-access licence. Machine learning software, that is independent of the language communities, will be made openly accessible where appropriate.

Kaitiakitanga License

Our Kaitiakitanga License is like open source but with affirmative action. It's a work in progress, and it's already making an impact in how Aotearoa works with Māori data.

View the License

Mātauranga – Research

I ētahi rā, i te haere kē mātou, hoki rawa mai kua pau ngā hua te kai i te mahi a te tamariki.

Central to the mission of Papa Reo is that language communities benefit directly from the research we conduct.

We are the first group to achieve transcription of te reo Māori using modern deep learning methods and have made initial progress on the synthesis of Māori speech from text. There is an active linguistics research community in New Zealand, working on aspects of both New Zealand English and te reo Māori. This work tends to use more traditional research methods and the application of machine learning methods to this research will be novel.

The language platform will support the academic research that is carried out in New Zealand, rather than competing with it. We anticipate that the tools provided by this platform will be used by not only the linguistics research community but the wider research community,

Scoring pronunciation accuracy via close introspection of a speech recognition recurrent neural network

Poster on te reo Māori pronunciation presented at NeurIPS 2020.

View the Poster

Te Reo Māori Speech Recognition: A story of community, trust, and soverignty

Presented at the 2020 Natives in Tech conference.

Mahi – Our work

Kaituhi

Automatic te reo Māori transcriptions.

Visit kaituhi.nz

Kōrero Māori

Our platform used to collect and protect indigenous data.

Visit kōreromāori.com to use the app or if you're interested in using this platform check out the repository.