Jump to: navigation, search

Internal:System description

TypeCraft System description

Pavel Mihaylov et al.


Overview

TypeCraft is a multilingual online database of linguistically-annotated natural lan- guage text, embedded in a collaboration and information tool. This setup allows users (projects as well as individuals) to create their own domains, to invite others, as well as share their data with the public. The kernel of TypeCraft is morphological word-level annotation in a relational database setting, wrapped into a communication system, not unlike popular online community sites. TypeCraft allows you to import raw text for annotation and export annotated data to MS Word, OpenOffice.org, L A TEX or XML for further use. Note: this section is intended to describe TypeCraft from a non-technical point of view so people who are new to the system have some idea of what it does.

Modules

  • Server side modules: TypeCraft java server, TypeCraft database, MediaWiki (php

files + database), Apache

  • Client side modules: TypeCraft editor interface + MediaWiki (content produced

by the server side) = TypeCraft application from user’s point of view.

Server side

All files and modules reside in the home directory of the user typecraft , /home/typecraft . The present 2.0 version of the system is in the subdirectory typecraft2.

TypeCraft storage and data model

TypeCraft uses an SQL database (at present PostgreSQL) for data storage. The data mapping between Java objects and database tables is managed by Hibernate ( http: //www.hibernate.org/ ), so the system isn’t bound to any specific SQL database. TypeCraft data can be divided into two specific groups: • Common data: pos tags, gloss tags, global tags, ISO 639-3 languages ( http: //www.sil.org/ISO639-3/ ). This is data shared between all annotated tokens and users. • Individual data: texts, phrases, words and morphemes, together with their an- notation. This is data specific to each user. Individual data items reference common data items. E.g. everyone who uses the pos tag n will share the reference to a single common tag n, so if you change n to something else, say nn, then everyone will have their n’s changed to nn.


SQL schemas and tables

There are three schemas: common , individual and iso639 . Common and individual reflect the common and the individual data from the data model. Iso639 contains all of the tables describing languages according to ISO 639-3 and is thus common too. It is defined as a separate schema as the data is a direct import of the tables provided by SIL.

Server proper

The TypeCraft server proper is a Java application running inside a Java application server. Currently we use JBoss ( http://www.jboss.org/ ), but a lighter alternative, such as Tomcat, can be used in future. JBoss is located in /home/typecraft/typecraft2/jboss . The TypeCraft application is contained entirely in server/default/deploy/TypeCraftApp.ear relative to that path. By default, JBoss makes Java applications accessible via http://localhost:8080/ , but since there was (probably still is?) something else on port 8080, we changed it 18080 in file server/default/deploy/jboss-web.deployer/server.xml: <Connector port="18080" address="${jboss.bind.address}" maxThreads="250" maxHttpHeaderSize="8192" emptySessionPath="true" protocol="HTTP/1.1" enableLookups="false" redirectPort="8443" acceptCount="100" connectionTimeout="20000" disableUploadTimeout="true" /> 2Apart from that, JBoss is a barebone untarred install. To start JBoss simply go to its directory and run bin/run.sh . It must be run as user typecraft . It is convenient to run it via screen or nohup in order to have its output somewhere else and not on the console. Important: TypeCraft needs Java 6 (aka 1.6.x) to run. Ginnungagap ’s default Java in /usr/bin is version 5, so you should make sure the correct Java is your path (taken from typecraft ’s user .bash_profile ): export PATH=/usr/jdk1.6.0_02/bin:"${PATH}


Wiki

TypeCraft uses MediaWiki ( http://www.mediawiki.org/ ) with a couple of custom- written extensions. No modifications of the MediaWiki code are necessary. The exten- sions are: • TypeCraft –the TypeCraft editor + a builder for static lists of pos and gloss tags to show as a wiki page. • PhraseRender –used to embed annotated phrases in wiki pages, i.e. this exten- sion handles the wiki tag and converts to HTML which MediaWiki in turn shows as part of the page. • Other extensions not part of the TypeCraft project used to enhance MediaWiki’s functionality: flashmp3 , StubManager , GroupManager , accesscontrol , GoogleMaps . These do not affect or depend on the TypeCraft editor. MediaWiki is presently installed in /home/typecraft/typecraft2/mediawiki on gin- nungagap . See the MediaWiki documentation on how to install a standard MediaWiki. All configuration is in LocalSettings.php , the standard MediaWiki config file. Exten- sions live in the subdirectory extensions . Another customisation is the TypeCraft skin. It is defined in the following files: skins/tc2/* skins/Tc2.phpMediaWiki ⇔ TypeCraft server interaction MediaWiki is a PHP application and the TypeCraft server is a Java one. Since they both need to know some common information (such as which user is logged in), they need a means of communication. We use MediaWiki to do actual the login and then the server opens the PHP session file MediaWiki produced to find out who logged in. As MediaWiki and the JBoss run as different users, we need to have the session files accessible for both of them. PHP (through Apache) normally makes its session 3files only accessible by www-data , so we have a custom session handler that changes the group of those files to typecraft . JBoss runs under the user typecraft (which is a member of the group typecraft) so it gains read access to those files. The custom session handler is defined in tc_custom_session.php in MediaWiki’s home and we have to tell MediaWiki to use it by including it in includes/GlobalFunctions.php , function wfSetupSession() (the four lines after require_once are context from the original file):

  1. TypeCraft: Next line added to share the session with java

require_once("tc_custom_session.php"); session_cache_limiter(’private, must-revalidate’); wfSuppressWarnings(); session_start(); wfRestoreWarnings(); Important: make sure you add that line if you upgrade MediaWiki.


MediaWiki ⇔ TypeCraft server interaction

MediaWiki is a PHP application and the TypeCraft server is a Java one. Since they both need to know some common information (such as which user is logged in), they need a means of communication. We use MediaWiki to do actual the login and then the server opens the PHP session file MediaWiki produced to find out who logged in. As MediaWiki and the JBoss run as different users, we need to have the session files accessible for both of them. PHP (through Apache) normally makes its session 3files only accessible by www-data , so we have a custom session handler that changes the group of those files to typecraft . JBoss runs under the user typecraft (which is a member of the group typecraft) so it gains read access to those files. The custom session handler is defined in tc_custom_session.php in MediaWiki’s home and we have to tell MediaWiki to use it by including it in includes/GlobalFunctions.php , function wfSetupSession() (the four lines after require_once are context from the original file):

  1. TypeCraft: Next line added to share the session with java

require_once("tc_custom_session.php"); session_cache_limiter(’private, must-revalidate’); wfSuppressWarnings(); session_start(); wfRestoreWarnings(); Important: make sure you add that line if you upgrade MediaWiki.

Client side

The user interacts with TypeCraft through a web based interface. The interface consists of the customised wiki (MediaWiki) and a text-and-phrase editor. The editor is used to edit texts and phrases and assign annotation to phrases.

Editor proper

The editor interface is written entirely in JavaScript and builds a GUI using HTML elements. The present GUI uses the YUI library ( http://developer.yahoo.com/yui/ ) and plenty of homebrew code that evolved with time. A new GUI based on the Google web toolkit ( http://code.google.com/intl/bg/webtoolkit/ ) is in progress and it will be written entirely in Java, which is then translated into portable JavaScript, thus avoiding the mess of supporting JavaScript directly and adjusting code for browser idiosyncrasies.

Wiki

On the client side the wiki looks and behaves as any other MediaWiki and the editor proper appears to be an integral part of it. The user logs in and interacts with the wiki and the editor as if it is a single system. This is important to know as users often complain something “doesn’t work”, and the behaviour they describe is not caused by a bug but by design (e.g. if you search for wiki pages you won’t be able to find any annotated phrases as they are stored separately and MediaWiki doesn’t know about them). Other times only one part of the system will be broken, e.g. if JBoss is malfunctioning but MediaWiki runs fine. When debugging a possible problem, it’s best to ask the bug reporter to be as detailed as possible.

Building the system

The easiest way to build the system is to use Eclipse as that is what we use to develop it (though command line tools like javac and ant should work too). The source code for the stable system is in the SVN repository at http://www.typecraft.org/svn/tc/ . You need the following components checked out as three separate Eclipse projects: • trunk/lib (Eclipse project lib ) • branches/tc2/common (Eclipse project tc2-common ) • branches/tc2/server (Eclipse project tc2-server ) Once you have all the projects, Eclipse will compile the java files automatically. Then you can right click packaging-build.xml in tc2-server and choose Run as Ant Build. This will make three files in tc2-server : TypeCraftApp.ear , TypeCraftEJB.jar and TypeCraftWeb.war . The .ear file is the end product and needs to be dropped into JBoss’s deployment directory, while the other two files are intermediate. Important: tc2-server has a directory docroot/js with a bunch of JavaScript files. Those need to be concatenated and gzipped into a single file docroot/tc2tabbed.jsgz before you start packaging. This is normally done automatically by Eclipse, but if needed, it can be achieved with the following shell commands: cd {eclipse-workspace-folder}/tc2-server/docroot cat js/*.js | gzip > tc2tabbed.jsgz

Apache configuration

The Apache configuration resides in /etc/apache2/sites-enabled/typecraft.org.conf . The file describes everything Apache has to do in order to show MediaWiki pages and proxy some requests to JBoss. In addition, it sets up compression to speed up the system on slow connections. <VirtualHost *:80> ServerName www.typecraft.org ServerAlias typecraft.org UseCanonicalName On DocumentRoot /var/www/ CustomLog /var/log/apache2/typecraft-access.log combined ErrorLog /var/log/apache2/typecraft-error.log

  1. enable the rewriting engine and set some basic redirects

ReWriteEngine On RewriteRule ^/Icons - RewriteRule ^/$ /tc2wiki/Main_Page [R,L]

  1. rules that proxy TC client requests to the jboss server at port 18080

5RewriteRule ^/TCEditor/?$ http://localhost:18080/tc2/jsp/tceditor.jsp [P] RewriteRule ^/TCEditor/(\d+)/?$ http://localhost:18080/tc2/jsp/tceditor.jsp?id=$1 [P] RewriteRule ^/TCEditor/(\d+)/(\d+)/?$ \ http://localhost:18080/tc2/jsp/tceditor.jsp?id=$1&pid=$2 [P] ProxyVia On ProxyPass /tc2 http://localhost:18080/tc2 ProxyPassReverse /tc2 http://localhost:18080/tc2

  1. a hack to allow older versions of the offline client to login, must come before next line

RewriteRule ^/tc2wiki/api.php(.*)$ /w/api.php$1 [PT]

  1. make api.php available so the offline client can login

Alias /w/api.php /home/typecraft/typecraft2/mediawiki/api.php

  1. MediaWiki rules to get nice MediaWiki urls (taken from the web)

Alias /w/skins/ /home/typecraft/typecraft2/mediawiki/skins/ Alias /w/images/ /home/typecraft/typecraft2/mediawiki/images/ Alias /w/index.php /home/typecraft/typecraft2/mediawiki/index.php Alias /tc2wiki /home/typecraft/typecraft2/mediawiki/index.php

  1. setup compression for text-based files to speed up things on slow connections

DeflateFilterNote ratio AddOutputFilterByType DEFLATE text/html text/plain text/xml text/javascript \ text/css application/x-javascript LogFormat ’"%r" %b (%{ratio}n) "%{User-agent}i"’ deflate CustomLog /var/log/apache2/deflate_log deflate LogLevel warn AddType "text/javascript;charset=UTF-8" .jsgz AddEncoding gzip .jsgz <Location /> SetOutputFilter DEFLATE

  1. Netscape 4.x has some problems...

BrowserMatch ^Mozilla/4 gzip-only-text/html

  1. Netscape 4.06-4.08 have some more problems

BrowserMatch ^Mozilla/4\.0[678] no-gzip

  1. MSIE masquerades as Netscape, but it is fine

BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html

  1. Don’t compress images

SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary

  1. Make sure proxies don’t deliver the wrong content

Header append Vary User-Agent env=!dont-vary AddOutputFilter DEFLATE .js </Location> <Directory /home/typecraft/typecraft2/mediawiki> Options Includes Indexes FollowSymLinks MultiViews AllowOverride All Order allow,deny allow from all </Directory> </Virtualhost>