Difference between revisions of "Norwegian HPSG grammar NorSource"

Revision as of 16:30, 14 April 2014

For the web demo of the Norwegian HPSG grammar Norsource, the sentences below can be used for illustration.

More complete test suites for basic verbal constructions are found in

License

[WWW] Lesser General Public License For Linguistic Resources

History and Purpose of the grammar

NorSource is a so-called ‘deep’ computational grammar (‘DG’) of Norwegian, developed throughout the last 12 years. The grammar has been developed with a view to the following overall desiderata:

Desideratum 1. Encoding of Linguistic Meaning

As a ‘generic’ information repository, the DG should have a semantic component from which a Reasoning capacity can be deduced for any domain of discourse – possibly with addition of concepts for the specific domains. It should be like a Fregean ‘Sinn’, in acting as a function from domains of use to models of interpretation. However, contrary to most artificial ‘reasoning’ devices, a DG must span the full complexity of a natural language, reflecting the size of its vocabulary and its grammar complexity. In this respect,the DG can also be seen as the materialization of a Generative Grammar, in the original sense of that notion.

Desideratum 2. Cross-grammar Generality

The content of the DG should to as large an extent as possible be phrased in terms used, or alignable with terms used, in other grammars and for other languages, thereby enabling linguistic comparison using the DG. By ‘content of the DG’ we mean both the content of the grammar files (formalism, notions used) and the content of its parse productions.

Desideratum 3. Interoperability

The DG should attain as much interoperability with other applications as possible. In general, what a digital ubiquitous research environment for linguistics should enable is an interconnectivity of data, researchers and processing facilities whereby from any point in an overall structure of components, a contribution can have its ramifications immediately implemented throughout the entire structure. Such interconnectivity will have to be manifested both on an ‘outer’ level enabling data flow and easy access, and on an ‘inner’ level ensuring information exchange from one system component to another. For a DG, thus, its files and productions (parses, etc.) should be transportable to other applications, and the codes in which its files are written should be readable by other applications, or able to be mapped into other codes.

Desideratum 4. Sustainability

The DG should be in such a format, and be situated in such an over-all environment, that as much as possible of its capacity can be retained, independently of particular persons maintaining it or particular physical environments.

The first of the desiderata reflects a central concern throughout modern logic and philosophy of language, and in turn linguistics and Artificial Intelligence. Semantics being inevitably the basis for significant progress in cross-linguistic modelling, the desideratum has relevance also for desideratum 2. The grammar to be discussed belongs to a family of DGs whose design quite explicitly caters for this concern. This family of DGs has as its formal and theoretical framework HPSG (Pollard and Sag 1994, Sag et al. 2003), and started as a computational project through the LinGO initiative at CSLI, Stanford, using the LKB platform (Copestake 2002), which is a general platform with the format of typed feature-structures (TFS), and has integrated in it a format of semantic representation called Minimal Recursion Semantics (‘MRS’; cf. Copestake et al. 2005). Before year 2000 there were three grammars in this framework, the English Resource Grammar (ERG), the Japanese grammar Jacy, and a German grammar. Essential to the development of further grammars in the family was the ‘HPSG Grammar Matrix’ (‘the Matrix’; see Bender et al. 2002, 2010), which was mainly based on ERG, and had its first phase of deployment during the EU-project DeepThought (2002-4). The grammar family is currently developed within the frame of the DELPH-IN consortium, and will in the following be referred to as the ‘DELPH-IN grammars’.

The DG to be discussed was started in 2001, by linguists versed in Generative Grammar since the late 60ies, and formal semantics (‘Montague Grammar’) since the mid 70ies. From the mid 80ies the group developed a computational lexicon (under the acronym ‘TROLL’, see Hellan et al. 1989), mainly associated with research within ‘consolidated GB’. In the late 90ies the group reoriented itself towards HPSG, and started the DG as part of the LinGO initiative with the LKB platform. The DG was the first grammar to be built on the Matrix, during the EU-project DeepThought (2002-4), and despite never receiving very substantial funding, it has retained a place among the medium-large DELPH-IN grammars. We can distinguish four main phases in its development:

 Phase 1, the Grounding phase (2001-04), 
 Phase 2, the Semantic Expansion phase (2005-07), 
 Phase 3, the Cross-Linguistic Coding phase (2008-10), 
 Phase 4, the Interoperability phase (2010-14).

Phase 1 resided in the building of a basic core grammar around the Matrix skeleton (using the Matrix versions 0.1 – 0.6, as they developed; this included the MRS system). This stage included the accommodation of a 80,000 entries lexicon imported from the previously established resources TROLL and NorKompLex , where a verb valence code and a code for inflectional paradigms constituted major parts. Main publications from this period are: Hellan 2001, Hellan and Haugereid 2002, Hellan 2003.

Phase 2 resided in the development of a fine-grained ontology and computing system of spatial and temporal relations, amenable to grammatical systems across languages and typologies, and a detailed semantics of comparative constructions. The grammar was also used as a part of a small Norwegian-Japanese MT system. In this period, the inflectional system was thoroughly revised. Main publications from this period are: Hellan and Beermann (2004), Bermann et al. (2004), Beermann and Hellan (2005).

Phase 3 was devoted to a thorough revision of the valence code, to accommodate a cross-linguistically defined classification system of valence and construction types. Main publications from this period are: Hellan (2008), Hellan and Dakubu (2009, 2010)

Phase 4 can be divided into the following themes:

A. Deploying the grammar in ‘external’ applications: a ‘Grammar Sparrer’, as described in Hellan et al. 2013, accessed at A Norwegian Grammar Sparrer. This is a construct along the lines of Bender et al. 2004, and Suppes et al. 2014, falling within the overall initiatives described in Heift and Schultze 2007, where specific types of grammatical mistakes are accommodated by ‘mal-rules’ in an extended ‘mal’-version of the grammar, and parses involving such mal-phenomena are reported to the user as tutoring instructions. This system has been running as a webdemo since 2011.

B. Exporting information from the grammar to independent resources:

1. A valence bank, which, with the same exporting strategy as for Norwegian, contains also two other languages, constituting the first instance of an in depth Multilingual Valence repository. In essence, the valence code used in verbal lexical types (cf. 3.2 below) is expanded to alternative and more easily inspectable formats, and the verb lexicons of the languages involved are imported into a database organized according to the newer codes, and searchable in terms of these codes. (See Hellan and Bruland 2013, and a web access at Multilingual Verb Valence Lexicon.)

2. A POS-tagger reflecting the lexical inventory of the grammar, useful for lexical acquisition from new text ( (http://regdili.idi.ntnu.no:8080/webtagger/tagger )).

3. A simple Reasoner over movement and spatial information exported from the MRS. (See Bruland 2013)

@@ Line 65: / Line 65: @@
 . A valence bank, which, with the same exporting strategy as for Norwegian, contains also two other languages, constituting the first instance of an in depth Multilingual Valence repository. In essence, the valence code used in verbal lexical types (cf. 3.2 below) is expanded to alternative and more easily inspectable formats, and the verb lexicons of the languages involved are imported into a database organized according to the newer codes, and searchable in terms of these codes. (See Hellan and Bruland 2013, and a web access at [[Multilingual Verb Valence Lexicon]].)
-. A POS-tagger reflecting the lexical inventory of the grammar, useful for lexical acquisition from new text .
+. A POS-tagger reflecting the lexical inventory of the grammar, useful for lexical acquisition from new text ( (http://regdili.idi.ntnu.no:8080/webtagger/tagger )).
 . A simple Reasoner over movement and spatial information exported from the MRS. (See Bruland 2013)