Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "TypeGram"

(The grammar)
(The grammar)
Line 130: Line 130:
 
== The grammar ==
 
== The grammar ==
  
The idea is to collect in one grammar all the structures (types, rules) needed to produce the patterns found in any grammar/language. Specific grammars can then be defined around this 'core', with lexicons, and subsets of the defined rules/types, specific to the languages in question. So far, the grammar essentially contains verb construction specifications, for active constructions in Norwegian - more than 200 - Ga - more than 100, and starting Kistaninya. Both syntax and semantics are included. The classification of construction types is done according to the 'Construction Labeling System' - see [[Verbconstructions cross-linguistically - Introduction]]
+
The idea is to collect in one grammar all the structures (types, rules) needed to produce the patterns found in any grammar/language. Specific grammars can then be defined around this 'core', with lexicons, and subsets of the defined rules/types, specific to the languages in question. So far, the grammar essentially contains verb construction specifications, for active constructions in Norwegian - more than 200 - Ga - more than 100, and starting Kistaninya. Both syntax and semantics are included. The classification of construction types is done according to the 'Construction Labeling System' (CLS) - see [[Verbconstructions cross-linguistically - Introduction]]
 
and  [[Media: ‎1_Introlabels_SLAVOB-final.pdf|The Construction Labeling system ]].
 
and  [[Media: ‎1_Introlabels_SLAVOB-final.pdf|The Construction Labeling system ]].
  
At present, the grammar contains the following files:
+
At present, the grammar contains the following files ('tdl' for 'type description language', a code suited for the computational system in question):
  
- types.tdl -- the core assembly of types
+
- 'types.tdl' -- the core assembly of types
  
- labeltypes.tdl -- types defined for all labels in CLS, based on types.tdl
+
- 'labeltypes.tdl' -- types defined for all labels in CLS, based on types.tdl
  
- gatemplates.tdl -- construction types in Ga defined in terms of the types defined in labeltypes.tdl
+
- 'gatemplates.tdl' -- construction types in Ga defined in terms of the types defined in labeltypes.tdl
  
- nortemplates.tdl -- construction types in Norwegian defined in terms of the types defined in labeltypes.tdl
+
- 'nortemplates.tdl' -- construction types in Norwegian defined in terms of the types defined in labeltypes.tdl
  
- kistanetemplates.tdl -- construction types in Kistaninya defined in terms of the types defined in labeltypes.tdl
+
- 'kistanetemplates.tdl' -- construction types in Kistaninya defined in terms of the types defined in labeltypes.tdl
  
- rules.tdl -- a small number of syntactic rules sufficient for the construction array in question
+
- 'rules.tdl' -- a small number of syntactic rules sufficient for the construction array in question
  
- lrules.tdl -- a small number of lexical rules sufficient for the construction array and lexical types in question
+
- 'lrules.tdl' -- a small number of lexical rules sufficient for the construction array and lexical types in question
  
- inflr.tdl -- a small number of inflection rules sufficient for the construction array and lexical types in question
+
- 'inflr.tdl' -- a small number of inflection rules sufficient for the construction array and lexical types in question
  
- lexicon -- an assembly of all the English-like stems found in test; in the case of verbs, with lexical types reflecting the construction type they head.
+
- 'lexicon' -- an assembly of all the English-like stems found in test; in the case of verbs, with lexical types reflecting the construction type they head.
  
- test -- quasi-sentences instantiating all the construction types represented, each sentence consisting of English stems combined with abstract symbols for functional and inflectional categories, somewhat like what the (English) GLOSS-line of given sentences would look like. These sentences are thus the 'meta-strings' mentioned above.
+
- 'test' -- currently, quasi-sentences instantiating all the construction types represented, each sentence consisting of English stems combined with abstract symbols for functional and inflectional categories, somewhat like what the (English) GLOSS-line of given sentences would look like. These sentences are thus the 'meta-strings' mentioned above.
  
  
 
Batch parsing shows how many parses are provided for each 'sentence'.
 
Batch parsing shows how many parses are provided for each 'sentence'.

Revision as of 16:13, 3 February 2015

--Lars Hellan 17:26, 2 February 2015 (UTC)


TypeGram Contributors: Lars Hellan, Tore Bruland, Dorothee Beermann (all NTNU)


Downloads from: http://regdili.hf.ntnu.no:8081/typegramusers/menu.


TypeGram is an application for converting Interlinear Glossed Text (IGT) to Grammar specification. The IGT comes from TypeCraft (cf. Beermann and Mihaylov 2014), and the grammar formalism is that of HPSG (cf. Pollard and Sag 1994), using the LKB platform (cf. Copestake 2002). The application converts information contained in the IGT of a set of sentences of a language L into material suited for files designed for a grammar of L, namely its content word lexicon, its function word lexicon, and its file for inflectional rules. The insertion of this material is incremental, and thus can be run over for any new or increased set of IGT available for L. In addition, the grammar comes with a ready-defined inventory of grammatical types and rules hypothesized to accommodate structures from most types of languages of the world. These items together constitute the downloadable item 'Grammar' at the download site http://regdili.hf.ntnu.no:8081/typegramusers/menu. The IGT from TypeCraft is downloaded as XML directly from the Typecraft editor (see below).

The conversion processor is called 'TypeGramUtil2', and is the item called 'TypeGram Software' on the download site . We now describe its functionality (cf. Bruland 2011).


Functionality

Install
 We assume that java version "1.7.0_25" or higher is installed on your computer.
 Copy the TypeGramUtil2 folder to a place on your harddisc.
 In Linux
   select the TypeGramUtil2.jar in the file browser and right-click;
   select "properties", then "permissions" and check "Execute" (allow executing file as program);
  or
   open a terminal application;
   in the terminal: change directory to the folder which contains TypeGramUtil2.jar;
   run: chmod u=rx TypeGramUtil2.jar.
Start application
 In Windows: 
   double click on TypeGramUtil2.jar file (called 'start_win'). 
 In Linux:
   right-click on TypeGramUtil2.jar file, Open with -> your java version
  or
   open a terminal application, 
   in the terminal: change directory to the folder which contains TypeGramUtil2.jar 
   you make it runnable with: chmod u=rx run_unix;
   start the run_unix bash script with: ./run_unix.
Use of the application
Starting the TypeGramUtil2.jar file produces a graphical interface (GI).
 The application reads a downloaded XML file from TypeCraft and it creates/reads/updates the LKB files:
  prefix + FuncWord.tdl
  prefix + Infl.tdl
  prefix + Lex.tdl
  prefix + MetaFuncWord.tdl
  prefix + MetaInfl.tdl
  prefix + MetaLex.tdl
  prefix + Gloss.txt
  
 The downloaded XML file from TypeCraft is called tc2_export.xml or tc2_export(num).xml (a new number for
each new download). The downloaded file is stored in the folder:
 In Windows: "My Documents/Downloads"
 In Unix: Home/Downloads
 We recommend to move the file to another folder and to rename it, see below.
 
 The GI has a button named "Input" that selects the downloaded XML file. Click on the
button, and you can define exactly in which folder this file will be found.
 
 The GI has a button named "Output" that selects the destination for the LKB files. In the default case,
it will be the same as specified for "Input" (but see below).
 
 The GI has a text field named "File Prefix" that sets the prefix for the LKB files. For example, 
the prefix "nor" gives the following files (given a selection of Norwegian IGT from TypeCraft):
     norFuncWord.tdl
     norInfl.tdl
     norLex.tdl
     norMetaFuncWord.tdl
     norMetaInfl.tdl
     norMetaLex.tdl
     norGloss.txt

 The button "Transfer" reads the XML file, reads the LKB files, updates the new entries, and writes
the result back to the LKB files.
 An item is a duplicate if it is previously stored in a file, and duplicates are not written to the
LKB files.
 After each transfer, a set of counters are updated for duplicate items and new items for each file.
 When all the "new" counters are zero, it means that no new information is found in the XML file. 
 A message is displayed with the number of items written for each LKB file.
 For example: "Numbers saved. lex: 248, infl: 21, funk: 48, gloss:58".
 When the application is closed, the "Input", "Output" and "File Prefix" values are saved in the 
TypeGram.ini file.
 Next time you start the application you get the previous values for "Input", "Output" and "File Prefix".


PRACTICAL EXAMPLE

(This set of steps refers to the use of the Windows version.)


1. Make a folder which will be the habitat for a grammar of Ga, and name it GaG.


2. Download 'Grammar.zip' from 'TypeGram for Users'. It gets downloaded to 'Downloads'.

Extract all the zipped parts, resulting in a folder Grammar. Move this folder into GaG.


3. Download 'Example TypeCraft XML ga_export.xml' from 'TypeGram for Users'. Also it gets downloaded to 'Downloads', and it can be moved directly into GaG. (This is for demonstration - in the normal case you will bring this xml export directly from TypeCraft to this folder.)


4. Download 'TypeGram Software Java Jar file for Unix and Windows typegram.zip' from 'TypeGram for Users'. Like in step 2, unzip it inside of 'Downloads', and move the unzipped folder 'typeGram' to GaG.


5. Open 'typegram', and you find three items inside. Click on 'start_win', and the GI opens.


6. The line to the right of the button 'Input' describes the item to be used as input. The path corresponds to the folder chosen, but the item - 'default.xml - is only a placeholder name. Click on "Input", and select through the folder system the item 'ga_export.xml'.


7. We now want to define the place where the created lkb-files will end up. We could let this be Grammar, since this is where they will be used, but as an intermediate step, we create a folder Converted tdl-files, from which we in subsequent steps can make selections of files. This is now selected as output area, under "Output".


8. In the line to the right of 'Prefix', replace 'typegram' with 'ga'.


9. Click "Transfer". The folder Converted tdl-files now contains seven files: 'gaFuncWord', 'gaInfl' and 'gaLex' represent the relevant 'object language' items, whereas 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex' represent the gloss words and gloss tags of the IGTs. The file 'ga Gloss' contains the full gloss specification of each sentence assembled as wordsize units in a string, like in "1SGPOSSface AORblack 1SG" ('my face blackens me' = 'I get angry'), the items of which thereby match the items in the files 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex'. The GI provides counts of each file, as explained above. (The 'meta-level' grammar is descibed in Hellan 2010, Hellan and Beermann 2014.)


10. Depending on whether you want to create an 'object grammar' of Ga or a 'metagrammar', you import either 'gaFuncWord', 'gaInfl' and 'gaLex', or 'gaMetaFuncWord', 'gaMetaInfl' and 'gaMetaLex', into Grammar. For one caveat, see next point.


11. As Grammar is already defined, it contains test lexicon and inflection files which match a test suite called 'test', which is a set of meta-strings based on Norwegian and Ga. Before running with the new files, the old ones have to be given a prefix 'old' or so, with the new files taking over the names used for the previous ones (this is in order to make the grammar's loading definitions apply as normal).


The grammar

The idea is to collect in one grammar all the structures (types, rules) needed to produce the patterns found in any grammar/language. Specific grammars can then be defined around this 'core', with lexicons, and subsets of the defined rules/types, specific to the languages in question. So far, the grammar essentially contains verb construction specifications, for active constructions in Norwegian - more than 200 - Ga - more than 100, and starting Kistaninya. Both syntax and semantics are included. The classification of construction types is done according to the 'Construction Labeling System' (CLS) - see Verbconstructions cross-linguistically - Introduction and The Construction Labeling system .

At present, the grammar contains the following files ('tdl' for 'type description language', a code suited for the computational system in question):

- 'types.tdl' -- the core assembly of types

- 'labeltypes.tdl' -- types defined for all labels in CLS, based on types.tdl

- 'gatemplates.tdl' -- construction types in Ga defined in terms of the types defined in labeltypes.tdl

- 'nortemplates.tdl' -- construction types in Norwegian defined in terms of the types defined in labeltypes.tdl

- 'kistanetemplates.tdl' -- construction types in Kistaninya defined in terms of the types defined in labeltypes.tdl

- 'rules.tdl' -- a small number of syntactic rules sufficient for the construction array in question

- 'lrules.tdl' -- a small number of lexical rules sufficient for the construction array and lexical types in question

- 'inflr.tdl' -- a small number of inflection rules sufficient for the construction array and lexical types in question

- 'lexicon' -- an assembly of all the English-like stems found in test; in the case of verbs, with lexical types reflecting the construction type they head.

- 'test' -- currently, quasi-sentences instantiating all the construction types represented, each sentence consisting of English stems combined with abstract symbols for functional and inflectional categories, somewhat like what the (English) GLOSS-line of given sentences would look like. These sentences are thus the 'meta-strings' mentioned above.


Batch parsing shows how many parses are provided for each 'sentence'.