Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "TypeCraft Akan Data Collection Release 1.0"

Line 7: Line 7:
 
=====2. Download Release 1.0 =====
 
=====2. Download Release 1.0 =====
  
'''You can download Relase 1.0 [[Media:Release 1.0.zip here ]]'''. Safe the zip file to your computer.  Use  "Extract To" and enter a destination path and press "OK".  It contains the Akan data as XML and the release notes.
+
[[File:ZipFile.jpg]]  ''' [[Media:Release 1.0.zip]]'''.  
  
The release represents a subcorpus of the TC Akan corpus which has been created by the TypeCraft projectThe project maintains  [https://typecraft.org TypeCraft - The Interlinear Glossed Text Repository] which is a linguistic online service. Release 1.0 is an XML data set. The TypeCraft xsd-schema you find [https://typecraft.org/typecraft.xsd HERE].
+
The zip file contains the Akan data as XML and the release notes. To extract the information, save the file to your computer, and use "Extract To". Then enter the a destination path and press "OK".
  
[https://github.com/cidles/poio-api Poio-api] converts TypeCraft XML and other file formats such as  Elan’s EAF and Toolbox files into annotation graphs as defined in ISO 24612.  Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow.
+
[https://typecraft.org/tc2/jsp/converter.jsp The TypeCraft Importer] which you find on the navigation bar to the left of this browser window allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.  
 
+
[https://typecraft.org/tc2/jsp/converter.jsp The TypeCraft Importer] allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.  
+
  
  
Line 20: Line 18:
  
 
=====4. Data structure=====
 
=====4. Data structure=====
Release 1.0 is an XML data set. All texts are translated to English, and have morpheme based annotations. Words are part of speech tagged. The whole corpus had been privacy masked and provided with meta data.   
+
All texts carry morpheme-based annotations as well as POS tags. The original texts have been translated to English.The whole corpus has been privacy masked and meta data is provided. When loaded to the TypeCraft Editor the the data looks as shown in the example below:
 +
 
 +
<Phrase>454241</Phrase>
 +
 
 +
For the annotation of the TC Akan corpora we used an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 shows the corresponding for Gloss tags.   
 +
 
 +
[[File:AkanCorpus POS.png|thumb|800px|left|Chart 1 The most frequent POS tag in the TypeCraft Akan Corpus]]
 +
 
  
For the annotation of the TC Akan corpora we use an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 the most frequently assigned Gloss tags for our inter Akan corpus.
+
[[File:AkanCorpus glosses.png|thumb|800px|left|Chart 2 The most frequent POS tag in the TypeCraft Akan Corpus]]
The TC POS and GLOSS tag sets can be found here:
+
Tag descriptions you find at:
+
  
  
 +
A link to the complete TypeCraft POS and Gloss annotation sets you find on your navigation bar (under ''TypeCraft help'' at the left of your browser window ).
  
  

Revision as of 17:01, 24 March 2018

Release 1.0

1. License and Legal Issues

This corpus is distributed solely for non-commercial, non-profit educational and research use. Release 1.0 consists of a derivative compilation of multiple annotated texts created by linguistic graduate students as part of their class work. The original texts were created between 2007 – 2013 at the Department of Linguistics, NTNU, Trondheim, Norway.


2. Download Release 1.0
Error creating thumbnail: Unable to save thumbnail to destination
Media:Release 1.0.zip.

The zip file contains the Akan data as XML and the release notes. To extract the information, save the file to your computer, and use "Extract To". Then enter the a destination path and press "OK".

The TypeCraft Importer which you find on the navigation bar to the left of this browser window allows you to import the corpus into the TypeCraft Editor for further annotation, or for export to other formats.


3. Description of the TypeCraft Akan Corpus

The Release 1.0 of the TC Akan Corpus consists of 41 short texts, mostly linguistic sentence collections, corresponding to 669 sentences. Two of the released texts are transcribed recordings of students narrating a video. The students doing the original work were native-speakers of Akan. The material was curated, starting in 2016 over the period of 1 ½ years. For the curation expert linguists worked together with student annotators, native and non-native speakers, to achieve a better consistency of the original data. We will say more about the type and depth of annotation below. On a 4 point scale from green (high quality), yellow, orange and red (should not be used for research), we would like to characterise the Release 1.0 as a yellow corpus by which we mean that it can be used for research with some care.

4. Data structure

All texts carry morpheme-based annotations as well as POS tags. The original texts have been translated to English.The whole corpus has been privacy masked and meta data is provided. When loaded to the TypeCraft Editor the the data looks as shown in the example below:

Aba yii kuruwa no firi ɛpono no so
“Aba removed the cup from the table”
Aba
aba
AbaSBJ
N
yii
yii
removePAST
V
kuruwa
kuruwa
cupOBJ
N
nó
nó
DEF
DET
firii
firii
away_fromPAST
V
ɛpono
ɛpono
tableOBJ
N
nò
nò
DEF
DET
so
so
onLOC
Nrel


For the annotation of the TC Akan corpora we used an annotation set consisting of 60 POS tags and 123 gloss tags. Chart 1 shows the most frequently assigned POS tags and Chart 2 shows the corresponding for Gloss tags.

Chart 1 The most frequent POS tag in the TypeCraft Akan Corpus


Chart 2 The most frequent POS tag in the TypeCraft Akan Corpus


A link to the complete TypeCraft POS and Gloss annotation sets you find on your navigation bar (under TypeCraft help at the left of your browser window ).


5. Authors, Citation and Contact Information

The TC Akan corpus was created by Dorothee Beermann. Joana Awua Ahadofo assisted with the manual annotations. Special thanks for their support goes to Associate Professor James Essegbey, University of Florida, Gainsville and The Ghanaian Student Association at the Norwegian University of Science and Technology. The corpus should be cited as follows:

Dorothee Beermann (2018). TypeCraft Project – The TypeCraft Akan corpus, Release 1.0. TypeCraft – The Interlinear Text Repository.

Please address all questions, comments and suggestions to Dorothee Beermann (dorothee.beermann@ntnu.no)