Difference between revisions of "Runyankore-Rukiga Corpus"

Latest revision as of 11:31, 23 June 2021

The TypeCraft Runyankore-Rukiga ^[1] Corpus consists of interlinear glossed examples; its size and the structure is described below.

A collection of locative expressions in this language can be downloaded from Dataverse NO where you also find instructions about how to use the data in your own project.

The data package is a supplement to the publication: Beermann, Dorothee and Allen Asiimwe. (to appear 2021) Locatives in Runyankore-Rukiga. In: Marten, Lutz, Hannah Gibson and Rozenn Guérois (eds). Current Approaches to morphosyntactic variation in Bantu. Oxford University Press.

To view the interlinear glossed examples in that package online, please go to text 4510 in the TypeCraft database: https://typecraft.org/tc2/ntceditor.html#4510

The entire collection of locative expressions in Runyankore-Rukiga consists of around 600 sentences which have been selected from the TypeCraft Runyankore-Rukiga corpus . The material was collected and annotated between 2010 and 2015 by graduate students in linguistics at the Norwegian University of Science and Technology (NTNU).

Annotations and Leipzig Glossing Rules

The Leipzig Glossing Rules (link title) are a widely accepted standard for interlinear morpheme glossing. A few of these glosses have been adapted to the specifics of Bantu-morphology. In addition, some 'older glosses' have been kept since they had a higher acceptance among local Bantuist scholars.

Notable differences are:

TC-gloss	LGR-gloss	Category
IV (initial vowel)	AUG (augment)	The initial vowel, augment or pre-prefix, is a morpheme that is prefixed to the noun class prefix of nouns in Bantu.
3SG.SM (third-person subject marker)	SM (subject marker)	The SM agrees with the noun class features of the preverbal subject. In cases of pro-drop the person/number features are added to the SM marker
PAST (PASTim = immediate past, PASThst = historical past)	PST	The immediate or recent past refers to a time range from yesterday to days ago. What counts as recent past is determined culturally.
Note		TC annotations distinguish between different clitic types, such as locative (CLITloc), temporal (CLITtemp), or pronominal clitics (CLITpron). Our digital data makes does not make use of the "=" sign to indicate clitics.

Description of the TypeCraft Runyankore-Rukiga corpus

Creation

The TypeCraft Runyankore-Rukiga corpus, of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. ^[2] We also digitised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997) ^[3]. The data was created by native-speaker linguistics graduates as part of their class work, or in the context of their master’s thesis between 2006 and 2013. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim). The main student contributors were Justus Turamyomwe, Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus, please go to the TypeCraft.database. ^[4].

Size and Format

The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. Gries & Berez (2017) ^[5]mention that corpora that are documentary-linguistic in nature, which also applies to this corpus, tend to be small compared with standard corpora. Data collecting is slow and depends on the individual effort of linguists working together with local communities. Creating a balanced or representative corpus is often difficult (Gries & Berez, p.381),

Most corpus analyses are based on creating frequency lists. Typical for such word lists is a frequency profile where function words are most frequent followed by content words. Looking at the 20 most frequent word forms in the RR TC-corpus, most of these words in fact belong to the functional word classes.

Table 1. Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus

Annotations and Standards

We have used two layers of annotation for the labeling of the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common. Short definitions of the POS symbols can be found here: TypeCraft POS tags

Table 2. Part of Speech tags used for the annotation of Runyankore-Rukiga'

The TypeCraft editor supports the in-depth word-by-word annotation, for which TypeCraft platform provides a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequently used glosses are Initial- and Final-Vowel. The legend of the pie chart in Figure 1 lists the Glosses in the order of their frequency from the left to the right, starting from the top.

Short definitions of the Gloss symbols can be found here: TypeCraft GLOSS tags.

Figure 1. Glosses used for the annotation of Runyankore-Rukiga

↑ Runyakitara is a standard language based on four closely related languages of western Uganda. These four languages are (Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. These languages are spoken in south-western Uganda by approximately 6 million people, according to the Uganda National Population and Housing Census report (2014). (Ru)nyankore (ISO 639-3 nyn) and (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region, respectively. Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga.
↑ Today, Orumuri can still be found on Facebook Orumuri, but most of the articles presented are now in English.
↑ Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.
↑ You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.
↑ Gries, Stefan Th., Berez Andrea, L. (2017) Linguistic Annotations in/for Corpus Linguistics. In: Ide, Nancy, Pustejovsky, James (eds) Handbook of Linguistics Annotation, Springer.

[1] Runyakitara is a standard language based on four closely related languages of western Uganda. These four languages are (Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. These languages are spoken in south-western Uganda by approximately 6 million people, according to the Uganda National Population and Housing Census report (2014). (Ru)nyankore (ISO 639-3 nyn) and (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region, respectively. Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga.

[2] Today, Orumuri can still be found on Facebook Orumuri, but most of the articles presented are now in English.

[3] Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.

[4] You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.

[5] Gries, Stefan Th., Berez Andrea, L. (2017) Linguistic Annotations in/for Corpus Linguistics. In: Ide, Nancy, Pustejovsky, James (eds) Handbook of Linguistics Annotation, Springer.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
-===Purpose of the corpus===
-The Corpus of Locative Expressions in Runyankore Rukiga  was created to allow  the study of Locative Expression as they present themselves in this Bantu Languages spoken in south-western Uganda by ... poeple. .....
-===Description of the corpus===
+The TypeCraft Runyankore-Rukiga <ref>Runyakitara is a standard language based on four closely related languages of western Uganda. These four languages are (Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. These languages are spoken in south-western Uganda    by approximately 6 million people, according to the Uganda National Population and Housing Census  report    (2014). (Ru)nyankore (ISO 639-3 nyn) and  (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region, respectively.
-This corpus is a sub-corpus of the TypeCraft Runyankore Rukiga corpus -- packaged -- time stamped ---  It consists of naturally occurring data, narratives and short stories, as well as elicitated data which we found relevant for our study of locative expressions in this language.
+Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga. </ref> Corpus consists of interlinear glossed examples;  its size and the structure is described below.
-It contains .....  sentences.
-===Creation of the TypeCraft Ruynankore Rukiga corpus===
+'''A  collection of locative expressions in this language can be downloaded from  [https://doi.org/10.18710/YPHCNA  Dataverse NO] where you also find instructions about how to use the data in your own project.'''
-The TypeCraft  Ruynankore Rukiga corpusis based on data which consists of naturally occurring data, narratives and short stories, as well as elicitated data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. We also digitalised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997). The data was created by native speaker graduates as class work or as part of their  master’s thesis. The collection has been extended by the authors continuously. At the time of writing our open end  Runyankore Rukiga corpus contains 126 texts  consisting of  XX 000 sentences.
+'''The data package is a supplement to the publication: Beermann, Dorothee and Allen Asiimwe. (to appear 2021) Locatives in Runyankore-Rukiga. In: Marten, Lutz, Hannah Gibson and Rozenn Guérois (eds). ''Current Approaches to morphosyntactic variation in Bantu. '' Oxford University Press.'''
+'''To view the interlinear glossed examples in that package  online, please go to text 4510 in the TypeCraft database: https://typecraft.org/tc2/ntceditor.html#4510'''
+====Locative expressions in Runyankore-Rukiga - A data collection====
+The entire collection of  locative expressions in Runyankore-Rukiga consists of around 600 sentences which have been selected from the TypeCraft Runyankore-Rukiga corpus . The material was collected and annotated between 2010 and 2015 by  graduate students in linguistics at the Norwegian University of Science and Technology (NTNU).
+==== Annotations and Leipzig Glossing Rules ====
+The Leipzig Glossing Rules ([http://www.example.com link title]) are a widely accepted standard for interlinear morpheme glossing. A few of these glosses have been adapted to the specifics of Bantu-morphology. In addition, some 'older glosses' have been kept since they had a higher acceptance among local Bantuist scholars.
+Notable differences are:
+{|
+! style="text-align:left;"| TC-gloss
+! LGR-gloss
+! Category
+|-
+|IV (initial vowel)
+|AUG (augment)
+| The initial vowel, augment or pre-prefix, is a morpheme that is prefixed to the noun class prefix of nouns in Bantu.
+|-
+|3SG.SM (third-person subject marker)
+|SM (subject marker)
+| The SM agrees with the noun class features of the preverbal subject. In cases of pro-drop the person/number features are added to the SM marker
+|-
+|PAST (PASTim = immediate past, PASThst = historical past)
+|PST
+|The immediate or recent past refers to a time range from yesterday to days ago. What counts as recent past is determined culturally.
+|-
+!Note
+|
+|TC annotations distinguish between different clitic types, such as locative (CLITloc), temporal (CLITtemp), or pronominal clitics (CLITpron). Our digital data makes does not make use of the "=" sign to indicate clitics.
+|}
+===Description of the TypeCraft Runyankore-Rukiga corpus===
+====Creation ====
+The TypeCraft  Runyankore-Rukiga corpus, of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper ''Orumuri''. <ref>Today, Orumuri can still be found on Facebook [https://www.facebook.com/orumuri/ Orumuri], but most of the articles presented are now in English. </ref> We also digitised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997) <ref>Mubangizi, B.K.(1997) Abagyenda Bareeba. Memorial Single Volume. Kisubi: Marianum Press.</ref>.
+The data was created by native-speaker linguistics graduates as part of their class work, or in the context of their master’s thesis between 2006 and 2013. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim). The main student contributors were Justus Turamyomwe, Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus, please go to the TypeCraft.database. <ref>You can search the TypeCraft database from the navigation bar on the left side of your browser window. Select from the TypeCraft Tools menu, Search Texts, then specify the Language, and Press ENTER.</ref>.
+==== Size and Format====
+The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences.  Gries & Berez (2017) <ref>Gries, Stefan Th., Berez Andrea, L. (2017) Linguistic Annotations in/for Corpus Linguistics. In: Ide, Nancy, Pustejovsky, James (eds) ''Handbook of Linguistics Annotation'', Springer.</ref>mention that corpora that are documentary-linguistic in nature, which also applies to this corpus, tend to be small compared with standard corpora. Data collecting is slow and depends on the individual effort of linguists working together with local communities. Creating a balanced or representative corpus is often difficult (Gries & Berez, p.381),
+Most corpus analyses are based on creating frequency lists. Typical for such word lists is a frequency profile where function words are most frequent followed by content words. Looking at the 20 most frequent word forms in the RR TC-corpus, most of these words in fact belong to the functional word classes.
+'''Table 1.'''  '''Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus'''
+[[File:RR most-frequent-words.png ]]
+====Annotations and Standards====
+We have used two layers of annotation for the labeling of  the RR-corpus. Traditionally linguists do not consistently annotate examples for word class, but in the wake of the Digital Humanities leading to a closer cooperation between linguistics and computer scientist, POS-tagged corpora from linguistic work have become more common.
+Short definitions of the POS symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/POSTags/ TypeCraft POS tags]
+''Table 2. Part of Speech tags used for the annotation of Runyankore-Rukiga'''
+[[File:RR pos 080120.png]]
+The TypeCraft editor supports the in-depth word-by-word annotation, for which TypeCraft platform provides a list of over 300 glosses. Projects working with TypeCraft can ask for customised glossing lists. For the annotation of Runyankore-Rukiga we worked with TypeCraft's standard Glossing list, using 74 different tags. 13 different noun class tags were used, and the two most frequently used glosses are ''Initial-''  and ''Final-Vowel''. The legend of the pie chart in Figure 1 lists the Glosses in the order of their frequency from the left to the right, starting from the top.
+Short definitions of the Gloss symbols can be found here: [https://typecraft.org/tc2wiki/Special:TypeCraft/GlossTags/ TypeCraft GLOSS tags].
+'''Figure 1.  Glosses used for the annotation of Runyankore-Rukiga'''
+[[File:RR glosses-09-2018.png]]