2022
DOI: 10.1007/s10579-022-09580-w
|View full text |Cite
|
Sign up to set email alerts
|

Harnessing Indigenous Tweets: The Reo Māori Twitter corpus

Abstract: Te reo Māori, the Indigenous language of Aotearoa New Zealand, is a distinctive feature of the nation’s cultural heritage. This paper documents our efforts to build a corpus of 79,000 Māori-language tweets using computational methods. The Reo Māori Twitter (RMT) Corpus was created by targeting Māori-language users identified by the Indigenous Tweets website, pre-processing their data and filtering out non-Māori tweets, together with other sources of noise. Our motivation for creating such a resource is three-f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 20 publications
0
4
0
Order By: Relevance
“…The language detection model was also extended using the labelled Hansard corpus to detect Māori-English code-switching points in the text. This was also tested on the labelled Hansard database, and the Reo Māori Tweets corpus [16]. The labelled Hansard corpus is currently being used to improve language detection and code-switch detection models.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…The language detection model was also extended using the labelled Hansard corpus to detect Māori-English code-switching points in the text. This was also tested on the labelled Hansard database, and the Reo Māori Tweets corpus [16]. The labelled Hansard corpus is currently being used to improve language detection and code-switch detection models.…”
Section: Discussionmentioning
confidence: 99%
“…Since code-switching appears more in informal contexts such as social media, future research should focus on developing labelled corpora containing social media text. Some attempts in this regard have been reported [16,17], but more rigorous testing of these corpora in language processing applications is still to be done.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…And in sheer self-protection, foreign traders faced with Roman customs-men must have learned to keep records"(2004, p. 1157). 6 Te Reo M aori, for example, is thriving both in broadcast media(Middleton, 2020) and online(Trye et al, 2022).7 Japan has its own complex history of orthographic resistance (e.g.,Heinrich, 2012;Takayama, 1995;Unger, 1996). 8 New writing systems do not necessarily lead to extinction of old, but script shifts and disappearances have often happened via biscriptalism(Houston, 2004, pp.…”
mentioning
confidence: 99%