Difference between revisions of "TypeCraft Akan Data Collection Release 1.0"
m (Typecraft moved page TypeCraft Akan Data Collection - Release 1.0 to TypeCraft Akan Data Collection Release 1.0)
Revision as of 12:34, 30 September 2018
1. License and Legal Issues
This corpus is distributed solely for non-commercial, non-profit educational and research use. Release 1.0 consists of a derivative compilation of multiple annotated texts created by linguistic graduate students as part of their class work. The original texts were created between 2007 – 2013 at the Department of Linguistics, NTNU, Trondheim, Norway.
2. Download Release 1.0
The zip file contains the Akan data as XML and the release notes. To extract the information, save the file to your computer, and use "Extract To". Then enter the a destination path and press "OK".
The TypeCraft Importer which you find on the navigation bar to the left of this browser window allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.
3. Description of the TypeCraft Akan Corpus
The Release 1.0 of the TC Akan Corpus consists of 41 short texts, mostly linguistic sentence collections, corresponding to 669 sentences. Two of the released texts are transcribed recordings of students narrating a video. The students doing the original work were native-speakers of Akan. The material was curated, starting in 2016 over the period of 1 ½ years. For the curation expert linguists worked together with student annotators, native and non-native speakers, to achieve a better consistency of the original data. We will say more about the type and depth of annotation below. On a 4 point scale from green (high quality), yellow, orange and red (should not be used for research), we would like to characterise the Release 1.0 as a yellow corpus by which we mean that it can be used for research with some care.
4. Data structure
All texts carry morpheme-based annotations as well as POS tags. The original texts have been translated to English.The whole corpus has been privacy masked and meta data is provided. When loaded to the TypeCraft Editor the the data looks as shown in the example below:
For the annotation of the TC Akan corpora we used an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 shows the corresponding for Gloss tags. A link to the complete TypeCraft POS and Gloss annotation sets you find on your navigation bar (under TypeCraft help at the left of your browser window ).
5. Authors, Citation and Contact Information
The TC Akan corpus was created by Dorothee Beermann. Joana Awua Ahadofo assisted with the manual annotations. Special thanks for their support goes to Associate Professor James Essegbey, University of Florida, Gainsville and The Ghanaian Student Association at the Norwegian University of Science and Technology. The corpus should be cited as follows:
Dorothee Beermann (2018). TypeCraft Project – The TypeCraft Akan corpus, Release 1.0. TypeCraft – The Interlinear Text Repository.