Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "Runyankore-Rukiga Corpus"

(Size and Format)
(Size and Format)
Line 29: Line 29:
 
The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. A table over the most frequent word forms  in the corpus shows that the corpus is biased. While it has been created by several users of TypeCraft users working on graduate project addressing different topics the corpus seems to contain more sentences containing locative words than one probably would expect to find in naturally occuring text.   
 
The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. A table over the most frequent word forms  in the corpus shows that the corpus is biased. While it has been created by several users of TypeCraft users working on graduate project addressing different topics the corpus seems to contain more sentences containing locative words than one probably would expect to find in naturally occuring text.   
 
Between the 20 most frequent word forms are mostly words belonging to the functional word classes. This is expected.
 
Between the 20 most frequent word forms are mostly words belonging to the functional word classes. This is expected.
 +
 +
 +
'''Table 1.'''  '''Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus'''
 +
[[File:RR most-frequent-words.png ]]
  
 
====Annotations and Standards====
 
====Annotations and Standards====

Revision as of 12:40, 8 January 2020


Purpose of the data collection

The download is a linguistic data collection [1] for the study of Locative Expressions in Runyankore-Rukiga. It was created using the TypeCraft Runyankore-Rukiga corpus. The collection is meant to inform the study of locative formation in Runyankore-Rukiga.

More about Runyankore and Rukiga:

  Runyakitara is standard language based on four closely related languages of western Uganda. These four languages are ((Ru)nyore, (Ru)tooro, (Ru)nyankore, and (Ru)kiga. 
  These languages are spoken in south-western Uganda    by approximately 6 million people according to the Uganda National Population and Housing Census  report
  (2014). (Ru)nyankore (ISO 639-3 nyn) and  (Ru)kiga (ISO 639-3 cgg ) are spoken in the Ankola and the Kigeza region respectively. 

Here we refer to (Ru)nyankore, and (Ru)kiga as Runyankore-Rukiga.

Download

Error creating thumbnail: Unable to save thumbnail to destination
(The Download is under preparation --Dorothee (talk) 11:14, 8 January 2020 (UTC))

The material consists of 298 sentences. [ -- packaged -- time stamped ---], which were taken from naturally occurring data (narration, converstations), as well as from prior annotated linguistic sentence collections. The data is provided in XML format. fro

Description of the of the TypeCraft Runyankore-Rukiga corpus

Creation

The TypeCraft Runyankore-Rukiga corpus of which the data presented here is a part, consists of narratives and short stories, as well as elicited data. Texts are either transcriptions of oral narratives or fragments of newspaper texts from the Runyankore-Rukiga weekly newspaper Orumuri. We also digitalised sections taken from the novel Abagyenda Bareeba ‘Adventures of travelers' by Mubangizi (1997). The data was created by native-speaker graduates as part of their class work or in the context of their master’s thesis between 2006 and 20xx. The creation process was a collaborative effort coordinated by the principal investigators Dr. Allen Asiimwe (Makerere University, Uganda) and Prof. Dorothee Beermann (NTNU, Trondheim) . The main student contributers were Justus Turamyomwe and Misah Natumanya and Allen Asiimwe. The collection has been extended continuously. For a closer look at the entire corpus please go to TypeCraft.org. Select from the TypeCraft Tools menu, Search Texts. The Search interface lets you choose a language. To see all texts in Runyankore-Rukiga, type the name into the provided field.

Size and Format

The TypeCraft Runyankore-Rukiga corpus consists of 143 426 words, corresponding to 28 057 sentences. A table over the most frequent word forms in the corpus shows that the corpus is biased. While it has been created by several users of TypeCraft users working on graduate project addressing different topics the corpus seems to contain more sentences containing locative words than one probably would expect to find in naturally occuring text. Between the 20 most frequent word forms are mostly words belonging to the functional word classes. This is expected.


Table 1. Most frequent 20 words in the TypeCraft Runyankore-Rukiga corpus

Error creating thumbnail: Unable to save thumbnail to destination

Annotations and Standards

  1. TypeCraft allows users to create their own sentence collections from an existing corpus which allows them to apply new annotations within the bounds set by the system.