Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "Converting a Toolbox lexical database to LKB format"

m
(fee / make: Update in progress)
 
(34 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
Note: I currently work on this page, I hope to get it into a usable state by mid October 2012. [[User:Hannes Hirzel|Hannes Hirzel]]
+
 
  
 
The [http://moin.delph-in.net/LkbTop LKB system] (Linguistic Knowledge Builder) is a grammar and lexicon development environment for unification-based linguistic formalisms. LKB is focused on the use of [http://hpsg.stanford.edu/ HPSG]. This page contains a description and the program to convert a lexicon database made with
 
The [http://moin.delph-in.net/LkbTop LKB system] (Linguistic Knowledge Builder) is a grammar and lexicon development environment for unification-based linguistic formalisms. LKB is focused on the use of [http://hpsg.stanford.edu/ HPSG]. This page contains a description and the program to convert a lexicon database made with
[http://www.sil.org/computing/toolbox/  Toolbox] to lexicon format needed by LKB. The scripts were developed by [[User:Hannes Hirzel|Hannes Hirzel]].
+
[http://www.sil.org/computing/toolbox/  Toolbox] to the lexicon format needed by LKB. The scripts were developed by [[User:Hannes Hirzel|Hannes Hirzel]].
 +
 
 +
A presentation given in Trondheim in 2005 ([[File:Toolbox-LKB-Link-slides - version 4.pdf]]) shows how this may be applied to a lexicon file of the [http://en.wikipedia.org/wiki/Ga_language Ga language]. The dictionary file was created by [[User:Mary Esther Kropp Dakubu|Mary E. Kropp Dakubu]].
 +
 
 +
The scripting language used for the conversion is called [http://www.sil.org/computing/toolbox/documentation.htm 'Consistent changes'] and built into the Toolbox program.
 +
 
 +
For the working portable setup see the download section on this page. The setup might need some adaptation for the needs of other languages. All files to do so are text files which you may change, see license section.
 +
 
 +
 
 +
===Status===
 +
Updated 9th November 2012. [[User:Hannes Hirzel|Hannes Hirzel]]. In entries with multiple senses only the last entry is converted. This needs to be fixed. Contact dictionaries_gillbt@gillbt.org
 +
 
 +
== Download ==
 +
 
 +
The following folder
 +
[[File:Toolbox Project Ga.zip]]
 +
contains the standard files produced by the
 +
utility program 'Toolbox New Project Package 1.5.8''
 +
from http://www'.sil.org/computing/toolbox/downloads.htm .
 +
 
 +
 
 +
The file 'Dictionary.txt' has been replaced
 +
by the Ga lexicon
 +
created by Mary Ester Dakubu (MED), University of Ghana.
 +
 
 +
This folder has been posted to this web site [www.typecraft.org]
 +
by permission.
 +
 
 +
NOTE 5th Dec 2012: It does not contain the correct lexicon file. The lexicon file needs to have only one sense per entry.
 +
 
 +
== How to start Toolbox ==
 +
 
 +
 
 +
The folder 'Settings' contains the Toolbox exe file. Double click on it to start it.
 +
 
 +
[[File:HowToStartThePortableToolboxSetup.png]]
 +
 
 +
== How to create the LKB tdl file ==
 +
 
 +
 
 +
You may run the conversion program from within Toolbox.
 +
 
 +
To run the conversion do the following steps
 +
 
 +
# Make the dictionary window the 'active window' by clicking on the title bar
 +
# Choose menu 'File' / 'Export'
 +
# Select 'TBox-LKB Step1'
 +
# Click 'OK'.
 +
# A new file 'LKBlexicon.tdl' is created.
 +
 
 +
[[File:HowToStartTheLKBconversion.png]]
 +
 
 +
== Examples ==
 +
 
 +
The examples need to be updated to reflect the restriction on the lexicon input file.
 +
 
 +
=== lɔ / sneak ===
 +
 
 +
[[File:Ga-Dictionary-Lexical-Entry-Sneak.png]]
 +
 
 +
is converted to
 +
 
 +
    lO_2 := verb-lexeme &
 +
    [STEM <"lO">,
 +
    PHON <"lO">,
 +
    ENGL-GLOSS <"sneak", "">,
 +
    SYNSEM.LKEYS.KEYREL.PRED "_lO_v_rel"].
 +
 
 +
=== lɔŋ / raffia_palm  ===
 +
 
 +
    \lx lɔŋ
 +
    \ph lɔ̀ŋ̀
 +
    \ps n
 +
    \sn 1
 +
    \ge raffia_palm
 +
    \xv lɔŋ tso
 +
    \sn 2
 +
    \ge fibre,_raffia
 +
    \de the fibre of the raffia palm, used for sewing sacks and weaving mats.
 +
    \xv lɔŋ kɛ abui
 +
    \xe thread and needle; close association (fig.).
 +
    \et PGD *lɔ-
 +
    \dt 12/Apr/2007
 +
 
 +
 
 +
is converted to
  
A presentation in Trondheim, 2005 ([[File:Toolbox-LKB-Link-slides - version 4.pdf]]) shows how this was applied to a lexicon file of the [http://en.wikipedia.org/wiki/Ga_language Ga language] edited by [[User:Mary Esther Kropp Dakubu|Mary E. Kropp Dakubu]].
+
    lOG := noun-lexeme &
 +
    [STEM <"lOG">,
 +
    PHON <"lOG">,
 +
    ENGL-GLOSS <"fibre,_raffia", "">,
 +
    SYNSEM.LKEYS.KEYREL.PRED "_lOG_n_rel"].
  
The scripting language used is called [http://www.sil.org/computing/toolbox/documentation.htm 'Consistent changes'] and built into the Toolbox program. You may run the program from within Toolbox. To do so make  the lexicon file the active windows and then choose the
+
Only the second sense is given. This needs to be fixed.
  
* 'File' menu,
+
=== fee / make ===
* 'Export'
+
* 'TBox-LKB-Step1'.
+
  
This runs all processing steps. The result is a lexicon file in LKB format.
+
    \lx fee
 +
    \hm 2
 +
    \ph fèê, fèé, !fé
 +
    \ps verb
 +
    \sn 1
 +
    \ge make
 +
    \de make, do, perform
 +
    \sl1 v
 +
    \sl2 tr
 +
    \sl4 suAg_obTh
 +
    \sl6 CREATION
 +
    \xv E-fee flɔɔ, samala
 +
    \xg 3S.AOR-make stew
 +
    \xe she made stew, soap.
  
 +
is converted to
  
[[File:Toolbox-LKB-Link-Export-Chain.PNG]]
+
  .....
  
A working portable setup is available from the author on request. However the information and files to recreate the setup is included on this page. You might need to adapt it to your particular lexicon file.
+
== Implementation of the conversion ==
  
== Implementation ==
+
There is a folder 'Tbox2LKB-conv-scripts' which has a copy of the 
=== Setup ===
+
the cct files of the folder
 +
''2005-05-31Ga-for-LKB-Uni-Trondheim-11a'' mentioned in the presentation of 2005.
  
The files which belong to a Toolbox project may be kept all in the same folder. The following screen shot shows the setup how Toolbox has to be setup to produce an LKB TDL lexicon file. Marked green are the six 'consistent changes' script files. They include a conversion from an 8 bit font to Unicode for the particular setup used for the Ga language as of 2005. As of 2012 most lexicons use a Unicode font so these steps might be left out or adapted. The LKB lexicon is the result of the sixth step marked in red.
+
These cct files are used to convert
 +
the Ga lexicon which is in SFM (Toolbox format) to
 +
the format LKB ([http://moin.delph-in.net/LkbTop| Linguistic Knowledge Builder]) needs.
  
[[File:Toolbox-LKB-Link-File-Layout-in-Project-Folder.PNG]]
 
  
Each of the steps of the 'consistent changes' process chain must be defined. The screen shot shows the last dialog. It contains input fields for
+
The Ga alphabet contains the additional characters
* input file
+
* 'consistent changes' script
+
* output file
+
  
[[File:Toolbox-LKB-Definition-of-Step-6.PNG]]
+
* ɛ
 +
* ŋ
 +
* ɔ
  
 +
They are converted to
  
=== The script file ===
+
* E
 +
* G
 +
* O
  
The script files
+
This conversion is defined in the file 'Step1-Unicode.cct'. It converts Unicode to plain ASCII combinations. In case the LKB processor used can cope with certain forms of Unicode this file has to be adapted. This means that some conversions just have to be deleted.
[[File:Toolbox-LKB-Link-CCT-tables-for-Ga-lexicon.zip]]
+
  
 +
== License ==
  
=== License ===
+
The presentation and this wiki page are licensed under the [http://creativecommons.org/licenses/by-sa/3.0/deed.en_US Creative Commons Attribution-ShareAlike 3.0 Unported License]. The script code (program code) is under the [http://opensource.org/licenses/mit-license.php MIT license].
  
The presentation and this wiki page are licensed under the [http://creativecommons.org/licenses/by-sa/3.0/deed.en_US Creative Commons Attribution-ShareAlike 3.0 Unported License]. The script code is under the [http://opensource.org/licenses/mit-license.php MIT license].
+
License for data (dictionary file): to be determined; contact medakubu@gmail.com

Latest revision as of 19:21, 31 January 2013

Summary

The LKB system (Linguistic Knowledge Builder) is a grammar and lexicon development environment for unification-based linguistic formalisms. LKB is focused on the use of HPSG. This page contains a description and the program to convert a lexicon database made with Toolbox to the lexicon format needed by LKB. The scripts were developed by Hannes Hirzel.

A presentation given in Trondheim in 2005 (File:Toolbox-LKB-Link-slides - version 4.pdf) shows how this may be applied to a lexicon file of the Ga language. The dictionary file was created by Mary E. Kropp Dakubu.

The scripting language used for the conversion is called 'Consistent changes' and built into the Toolbox program.

For the working portable setup see the download section on this page. The setup might need some adaptation for the needs of other languages. All files to do so are text files which you may change, see license section.


Status

Updated 9th November 2012. Hannes Hirzel. In entries with multiple senses only the last entry is converted. This needs to be fixed. Contact dictionaries_gillbt@gillbt.org

Download

The following folder File:Toolbox Project Ga.zip contains the standard files produced by the utility program 'Toolbox New Project Package 1.5.8 from http://www'.sil.org/computing/toolbox/downloads.htm .


The file 'Dictionary.txt' has been replaced by the Ga lexicon created by Mary Ester Dakubu (MED), University of Ghana.

This folder has been posted to this web site [www.typecraft.org] by permission.

NOTE 5th Dec 2012: It does not contain the correct lexicon file. The lexicon file needs to have only one sense per entry.

How to start Toolbox

The folder 'Settings' contains the Toolbox exe file. Double click on it to start it.

Error creating thumbnail: Unable to save thumbnail to destination

How to create the LKB tdl file

You may run the conversion program from within Toolbox.

To run the conversion do the following steps

  1. Make the dictionary window the 'active window' by clicking on the title bar
  2. Choose menu 'File' / 'Export'
  3. Select 'TBox-LKB Step1'
  4. Click 'OK'.
  5. A new file 'LKBlexicon.tdl' is created.
Error creating thumbnail: Unable to save thumbnail to destination

Examples

The examples need to be updated to reflect the restriction on the lexicon input file.

lɔ / sneak

Error creating thumbnail: Unable to save thumbnail to destination

is converted to

   lO_2 := verb-lexeme &
   [STEM <"lO">,
   PHON <"lO">,
   ENGL-GLOSS <"sneak", "">,
   SYNSEM.LKEYS.KEYREL.PRED "_lO_v_rel"].

lɔŋ / raffia_palm

   \lx lɔŋ
   \ph lɔ̀ŋ̀
   \ps n
   \sn 1
   \ge raffia_palm
   \xv lɔŋ tso
   \sn 2 
   \ge fibre,_raffia
   \de the fibre of the raffia palm, used for sewing sacks and weaving mats. 
   \xv lɔŋ kɛ abui 
   \xe thread and needle; close association (fig.).
   \et PGD *lɔ-
   \dt 12/Apr/2007


is converted to

   lOG := noun-lexeme &
   [STEM <"lOG">,
   PHON <"lOG">,
   ENGL-GLOSS <"fibre,_raffia", "">,
   SYNSEM.LKEYS.KEYREL.PRED "_lOG_n_rel"].

Only the second sense is given. This needs to be fixed.

fee / make

   \lx fee
   \hm 2 
   \ph fèê, fèé, !fé 
   \ps verb
   \sn 1 
   \ge make
   \de make, do, perform
   \sl1 v
   \sl2 tr
   \sl4 suAg_obTh
   \sl6 CREATION
   \xv E-fee flɔɔ, samala
   \xg 3S.AOR-make stew
   \xe she made stew, soap.

is converted to

  .....

Implementation of the conversion

There is a folder 'Tbox2LKB-conv-scripts' which has a copy of the the cct files of the folder 2005-05-31Ga-for-LKB-Uni-Trondheim-11a mentioned in the presentation of 2005.

These cct files are used to convert the Ga lexicon which is in SFM (Toolbox format) to the format LKB (Linguistic Knowledge Builder) needs.


The Ga alphabet contains the additional characters

  • ɛ
  • ŋ
  • ɔ

They are converted to

  • E
  • G
  • O

This conversion is defined in the file 'Step1-Unicode.cct'. It converts Unicode to plain ASCII combinations. In case the LKB processor used can cope with certain forms of Unicode this file has to be adapted. This means that some conversions just have to be deleted.

License

The presentation and this wiki page are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. The script code (program code) is under the MIT license.

License for data (dictionary file): to be determined; contact medakubu@gmail.com