Typecraft v2.5
Jump to: navigation, search

TypeCraft Akan Data Collection Release 1.0

Revision as of 16:24, 24 March 2018 by Typecraft (Talk | contribs)

Release 1.0

1. License and Legal Issues

This corpus is distributed solely for non-commercial, non-profit educational and research use. Release 1.0 consists of a derivative compilation of multiple annotated texts created by linguistic graduate students as part of their class work. The original texts were created between 2007 – 2013 at the Department of Linguistics, NTNU, Trondheim, Norway.

2. Download Release 1.0

You can download Relase 1.0 Media:Release 1.0.zip here . Safe the zip file to your computer. Use "Extract To" and enter a destination path and press "OK". It contains the Akan data as XML and the release notes.

The release represents a subcorpus of the TC Akan corpus which has been created by the TypeCraft project. The project maintains TypeCraft - The Interlinear Glossed Text Repository which is a linguistic online service. Release 1.0 is an XML data set. The TypeCraft xsd-schema you find HERE.

Poio-api converts TypeCraft XML and other file formats such as Elan’s EAF and Toolbox files into annotation graphs as defined in ISO 24612. Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow.

The TypeCraft Importer allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.

3. Description of the TypeCraft Akan Corpus

The Release 1.0 of the TC Akan Corpus consists of 41 short texts, mostly linguistic sentence collections, corresponding to 669 sentences. Two of the released texts are transcribed recordings of students narrating a video. The students doing the original work were native-speakers of Akan. The material was curated, starting in 2016 over the period of 1 ½ years. For the curation expert linguists worked together with student annotators, native and non-native speakers, to achieve a better consistency of the original data. We will say more about the type and depth of annotation below. On a 4 point scale from green (high quality), yellow, orange and red (should not be used for research), we would like to characterise the Release 1.0 as a yellow corpus by which we mean that it can be used for research with some care.

4. Data structure

Release 1.0 is an XML data set. All texts are translated to English, and have morpheme based annotations. Words are part of speech tagged. The whole corpus had been privacy masked and provided with meta data.

For the annotation of the TC Akan corpora we use an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 the most frequently assigned Gloss tags for our inter Akan corpus. The TC POS and GLOSS tag sets can be found here: Tag descriptions you find at:

5. Authors, Citation and Contact Information

The TC Akan corpus was created by Dorothee Beermann. Joana Awua Ahadofo assisted with the manual annotations. Special thanks for their support goes to Associate Professor James Essegbey, University of Florida, Gainsville and The Ghanaian Student Association at the Norwegian University of Science and Technology. The corpus should be cited as follows:

Dorothee Beermann (2018). TypeCraft Project – The TypeCraft Akan corpus, Release 1.0. TypeCraft – The Interlinear Text Repository.

Please address all questions, comments and suggestions to Dorothee Beermann (dorothee.beermann@ntnu.no)