Typecraft v2.5
Jump to: navigation, search

TypeCraft Akan Data Collection Release 1.0

Revision as of 15:41, 25 March 2018 by Typecraft (Talk | contribs)

Release 1.0

1. License and Legal Issues

This corpus is distributed solely for non-commercial, non-profit educational and research use. Release 1.0 consists of a derivative compilation of multiple annotated texts created by linguistic graduate students as part of their class work. The original texts were created between 2007 – 2013 at the Department of Linguistics, NTNU, Trondheim, Norway.


2. Download Release 1.0

(DOI: 10.13140/RG.2.2.14614.86088)

Error creating thumbnail: Unable to save thumbnail to destination
Media:Release 1.0.zip.

The zip file contains the Akan data as XML and the release notes. To extract the information, save the file to your computer, and use "Extract To". Then enter the a destination path and press "OK".

The TypeCraft Importer which you find on the navigation bar to the left of this browser window allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.


3. Description of the TypeCraft Akan Corpus

The Release 1.0 of the TC Akan Corpus consists of 41 short texts, mostly linguistic sentence collections, corresponding to 669 sentences. Two of the released texts are transcribed recordings of students narrating a video. The students doing the original work were native-speakers of Akan. The material was curated, starting in 2016 over the period of 1 ½ years. For the curation expert linguists worked together with student annotators, native and non-native speakers, to achieve a better consistency of the original data. We will say more about the type and depth of annotation below. On a 4 point scale from green (high quality), yellow, orange and red (should not be used for research), we would like to characterise the Release 1.0 as a yellow corpus by which we mean that it can be used for research with some care.

4. Data structure

All texts carry morpheme-based annotations as well as POS tags. The original texts have been translated to English.The whole corpus has been privacy masked and meta data is provided. When loaded to the TypeCraft Editor the the data looks as shown in the example below:


Aba yii kuruwa no firi ɛpono no so
“Aba removed the cup from the table”
Aba
aba
AbaSBJ
N
yii
yii
removePAST
V
kuruwa
kuruwa
cupOBJ
N
nó
nó
DEF
DET
firii
firii
away_fromPAST
V
ɛpono
ɛpono
tableOBJ
N
nò
nò
DEF
DET
so
so
onLOC
Nrel


For the annotation of the TC Akan corpora we used an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 shows the corresponding for Gloss tags. A link to the complete TypeCraft POS and Gloss annotation sets you find on your navigation bar (under TypeCraft help at the left of your browser window ).

Chart 1 The most frequent POS tag in the TypeCraft Akan Corpus
Chart 2 The most frequent POS tag in the TypeCraft Akan Corpus












5. Authors, Citation and Contact Information

The TC Akan corpus was created by Dorothee Beermann. Joana Awua Ahadofo assisted with the manual annotations. Special thanks for their support goes to Associate Professor James Essegbey, University of Florida, Gainsville and The Ghanaian Student Association at the Norwegian University of Science and Technology. The corpus should be cited as follows:

Dorothee Beermann (2018). TypeCraft Project – The TypeCraft Akan corpus, Release 1.0. TypeCraft – The Interlinear Text Repository.

Please address all questions, comments and suggestions to Dorothee Beermann (dorothee.beermann@ntnu.no) --Typecraft (talk) 18:07, 24 March 2018 (CET)