Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "Runyankore-Rukiga Corpus"

(Description of the corpus)
 
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
===Purpose of the corpus===
 
The Corpus of Locative Expressions in Runyankore-Rukiga was created to allow the study of Locative Expressions in Runyankore-Rukiga a Bantu Language spoken in south-western Uganda by approximately 6 million people according to the Uganda National Population and Housing Census report (2014).
 
  
===Description of the corpus===
+
'''The Runyankore-Rukiga <ref>Runyakitara is standard language based on four closely related languages of western Uganda. These four languages are ((Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. These languages are spoken in south-western Uganda    by approximately 6 million people according to the Uganda National Population and Housing Census  report    (2014). (Ru)nyankore (ISO 639-3 nyn) and  (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region respectively. Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga. </ref> Corpus is a  Interlinear Text Corpus  which has been designed to support linguistic studies of the language. On this page we describe the corpus and we make available downloads of sentence collections which were created to inform the study of specific construction types of the language.
This corpus is a sub-corpus of the TypeCraft Runyankore-Rukiga corpus [ -- packaged -- time stamped ---]. It consists of naturally occurring data, as well as elicited sentences which we found relevant for our study of locative expressions. The packaged data available from this site consists of 298 in depth annotated sentences. The data is provided in XML format.  
+
'''
 +
 
 +
'''Go to  [[#Description of the TypeCraft Runyankore-Rukiga corpus|descriptive section]] to learn more about the TypeCraft Runyankore-Rukiga corpus.
 +
'''
 +
'''To download the Locative expression sentence collection go the [[#Download|Download section]]'''
 +
 
 +
 
 +
 
 +
 
 +
===Data collections===
 +
The material available for download from this site are small-sized linguistic data collections of several hundred sentences which were created using the TypeCraft Runyankore-Rukiga corpus. <ref>TypeCraft allows users to create their own sentence collections from an existing corpus which makes it possible for them to apply new annotations to already annotated data.</ref>. This way users are able to prepare data sets choosing examples from the corpus to apply customized additional labeling, thus applying attributes and extracting wordlists that suit their research.
 +
 
 +
 
 +
====Locative expressions in Runyankore-Rukiga - A data collection====
 +
The Locative expressions data collection consists of around 600 sentences. The data is a collected of examples selected from the Runyankore-Rukiga corpus where they were extracted through a query for the locative words "omu", "aha" and their long forms "ahari" and "omuri" . The examples reflect work between 2010 and approximately 2015 by different graduate students in linguistics. The Locative expression collection was created in 2018 by by Allen Asiimwe and Dorothee Beermann. 
 +
 
 +
 
 +
==== Download ====
 +
 
 +
[[File:ZipFile.jpg]]
 +
[[File: RR loc15-03-2020 300.zip]] 
 +
 
 +
The zip file consists of 300 sentences in TC-XML format. The TC locative subcorpus is an aggregation of 664 sentences which have been extracted from the TC-Runyankore-Rukiga corpus described below.
 +
 
 +
 
 +
===Description of the TypeCraft Runyankore-Rukiga corpus===
 +
====Creation ====
 +
The TypeCraft  Runyankore-Rukiga corpus of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper ''Orumuri''. <ref>Today Orumuri can still be found on Facebook [https://www.facebook.com/orumuri/ Orumuri], but most of the articles presented are now in English. </ref> We also digitised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997) <ref>Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.</ref>.
 +
The data was created by native-speaker linguistics graduates as part of their class work, or in the context of their master’s thesis between 2006 and 2013. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim) . The main student contributors were Justus Turamyomwe, Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus please go to the TypeCraft.database. <ref>You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.</ref>.
 +
 
 +
==== Size and Format====
 +
 
 +
The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. Gries & Berez (2017) <ref>Gries, Stefan Th., Berez Andrea, L. (2017) Linguistic Annotations in/for Corpus Linguistics. In: Ide, Nancy, Pustejovsky, James (eds) ''Handbook of Linguistics Annotation'', Springer.</ref>mention that corpora that are documentary-linguistic in nature, which also applies to this corpus, tend to be small compared with standard corpora. Data collecting is slow and depends on the individual effort of linguists working together with local communities. Creating a balanced or representative corpus is often difficult (Gries & Berez, p.381),
 +
 
 +
Most corpus analyses are based on creating frequency lists. Typical for such word lists is a frequency profile where function words are most frequent followed by content words. Looking at the 20 most frequent word forms in the RR TC-corpus most of these words in fact belong to the functional word classes.
 +
 
 +
 
 +
'''Table 1.''' '''Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus'''
 +
[[File:RR most-frequent-words.png ]]
  
 
====Annotations and Standards====
 
====Annotations and Standards====
 +
We have used two layers of annotation for the labeling of  the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common.
 +
Short definitions of the POS symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/POSTags/ TypeCraft POS tags]
 +
 +
''Table 2. Part of Speech tags used for the annotation of Runyankore-Rukiga'''
 +
[[File:RR pos 080120.png]]
 +
 +
 +
 +
The TypeCraft editor supports the in-depth word-by-word annotation for which TypeCraft platform provides a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequently used glosses are ''Initial-''  and ''Final-Vowel''. The legend of the pie chart in Figure 1 lists the Glosses in the order of their frequency from the left to the right, starting from the top.
 +
 +
Short definitions of the Gloss symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/GlossTags/ TypeCraft GLOSS tags].
 +
  
===Creation of the TypeCraft Ruynankore Rukiga corpus===
+
'''Figure 1. Glosses used for the annotation of Runyankore-Rukiga'''
The TypeCraft Ruynankore Rukiga corpus of which the data presented here is a part of consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. We also digitalised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997).  (REF)
+
[[File:RR glosses-09-2018.png]]
The data was created by native-speaker graduates as part of their class work or in the context of their  master’s thesis between 20xx and 20xx. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim) . The main student contributers were Justus ...., ......., .........,. The collection has been extended continuously. For a closer look at the entire corpus go to TypeCraft.org. Select from the TypeCraft Tools menu, Search Texts. The Search interface lets you choose a language. To see all text in Runyankore-Rukiga type the name into the provide field.
+

Latest revision as of 15:22, 15 March 2020

The Runyankore-Rukiga [1] Corpus is a Interlinear Text Corpus which has been designed to support linguistic studies of the language. On this page we describe the corpus and we make available downloads of sentence collections which were created to inform the study of specific construction types of the language.

Go to descriptive section to learn more about the TypeCraft Runyankore-Rukiga corpus. To download the Locative expression sentence collection go the Download section



Data collections

The material available for download from this site are small-sized linguistic data collections of several hundred sentences which were created using the TypeCraft Runyankore-Rukiga corpus. [2]. This way users are able to prepare data sets choosing examples from the corpus to apply customized additional labeling, thus applying attributes and extracting wordlists that suit their research.


Locative expressions in Runyankore-Rukiga - A data collection

The Locative expressions data collection consists of around 600 sentences. The data is a collected of examples selected from the Runyankore-Rukiga corpus where they were extracted through a query for the locative words "omu", "aha" and their long forms "ahari" and "omuri" . The examples reflect work between 2010 and approximately 2015 by different graduate students in linguistics. The Locative expression collection was created in 2018 by by Allen Asiimwe and Dorothee Beermann.


Download

ZipFile.jpg File:RR loc15-03-2020 300.zip

The zip file consists of 300 sentences in TC-XML format. The TC locative subcorpus is an aggregation of 664 sentences which have been extracted from the TC-Runyankore-Rukiga corpus described below.


Description of the TypeCraft Runyankore-Rukiga corpus

Creation

The TypeCraft Runyankore-Rukiga corpus of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. [3] We also digitised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997) [4]. The data was created by native-speaker linguistics graduates as part of their class work, or in the context of their master’s thesis between 2006 and 2013. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim) . The main student contributors were Justus Turamyomwe, Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus please go to the TypeCraft.database. [5].

Size and Format

The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. Gries & Berez (2017) [6]mention that corpora that are documentary-linguistic in nature, which also applies to this corpus, tend to be small compared with standard corpora. Data collecting is slow and depends on the individual effort of linguists working together with local communities. Creating a balanced or representative corpus is often difficult (Gries & Berez, p.381),

Most corpus analyses are based on creating frequency lists. Typical for such word lists is a frequency profile where function words are most frequent followed by content words. Looking at the 20 most frequent word forms in the RR TC-corpus most of these words in fact belong to the functional word classes.


Table 1. Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus RR most-frequent-words.png

Annotations and Standards

We have used two layers of annotation for the labeling of the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common. Short definitions of the POS symbols can be found here: TypeCraft POS tags

Table 2. Part of Speech tags used for the annotation of Runyankore-Rukiga' RR pos 080120.png


The TypeCraft editor supports the in-depth word-by-word annotation for which TypeCraft platform provides a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequently used glosses are Initial- and Final-Vowel. The legend of the pie chart in Figure 1 lists the Glosses in the order of their frequency from the left to the right, starting from the top.

Short definitions of the Gloss symbols can be found here: TypeCraft GLOSS tags.


Figure 1. Glosses used for the annotation of Runyankore-Rukiga

RR glosses-09-2018.png
  1. Runyakitara is standard language based on four closely related languages of western Uganda. These four languages are ((Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. These languages are spoken in south-western Uganda by approximately 6 million people according to the Uganda National Population and Housing Census report (2014). (Ru)nyankore (ISO 639-3 nyn) and (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region respectively. Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga.
  2. TypeCraft allows users to create their own sentence collections from an existing corpus which makes it possible for them to apply new annotations to already annotated data.
  3. Today Orumuri can still be found on Facebook Orumuri, but most of the articles presented are now in English.
  4. Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.
  5. You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.
  6. Gries, Stefan Th., Berez Andrea, L. (2017) Linguistic Annotations in/for Corpus Linguistics. In: Ide, Nancy, Pustejovsky, James (eds) Handbook of Linguistics Annotation, Springer.