India2011- Digital Linguistics

In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called India 2011, with India as the theme. The focus will be on broad cooperation

in culture, research, higher education and business.

The present arrangement between the University of Hyderabad, and the Department of Language and Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.

The arrangement of several talks and workshops, announced here, is part of NTNU's India week.

India is a continent of many languages. Ethnologue ^[1] refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.

1 Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages
2 Keynote Talks
3 Workshop
- 3.1 Collaborative corpus creation - qualitative and quantitative linguistic methods
4 WORKSHOP PROGRAM
- 4.1 Multilingual text processing, interlinear annotation and formalisation of Indic languages
- 4.2 Grammatical construction types across Indian languages
5 References

Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages

Oktober 1 - 9 2011

In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.

The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU.

The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.

Keynote Talks

Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages

The Architecture and Processing of Brahmi-Derived Scripts

Professor Sengupta

Professor Gautam Sengupta, University of Hyderabad

The earliest material evidence of writing in India appears in the Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. All the major writing systems of India, with the sole exception of Urdu, derive from this early Ashokan script known as Brāhmī. The Brāhmī-derived scripts are often called alpha-syllabaries on account of the fact that they are based upon the notion of orthographic syllable or akṣara. This talk will be about the basic architecture of the Brahmic scripts of India and how they are processed in reading.

A unitary system for formal multilingual classification and a digital platform for cross-level representation

Professor Hellan

This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as representing structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.

The Syntax and Semantics of Non-nominative Subject in South Asian languages

Professor K.V. Subbarao

I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I provide a brief description of NNSs in SALs first, keeping in view the notion of subject. I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].

Workshop

The workshop will be introduced by a talk on

Collaborative corpus creation - qualitative and quantitative linguistic methods

by

Associate Professor Dorothee Beermann, NTNU.

In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.

WORKSHOP PROGRAM

The Workshop is supported by India 2011

October 3th-7th

Monday

Tuesday

Wednesday

Thursday

Friday

Meals during

the day are

provided at

university facilities

9.15-10.30 Language Documentation

Dorothee Beermann

(LinLab (Building 4, Dragvoll))

Tea

10:45-12:00 Keynote

Gautam Sengupta

(LinLab)

Lunch

13.15-14.30 Hands-on:

Introduction to TypeCraft

(LinLab and CompLab)

Afternoon Tea

15:00-16:45 Hands-on:

Creation of research Corpora, verb annotation and discussion

(LinLab and CompLab)

9.15-10.30 Keynote

K.V. Subbarao

(LinLab)

Tea

10.45-12 Discussion

Indian constructions types

(LinLab)

Lunch

13:30-14:30

Hands-on:

Classifying and annotating construction types

(CompLab and offices)

Afternoon Tea

15:00 - 16:45

Second afternoon session

9:15-10:30 Keynote

Lars Hellan

(LinLab)

Tea

10:45-12:00 Discussion AVMs

(LinLab)

Lunch

13:15-14:30

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)

Afternoon Tea

15:00 - !6:45

Second afternoon session

All day:

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)

with meals as usual

EVENING SESSION

Public talk at Dokkhuset:

19:00-19:30:

Gautam Sengupta:

The Aksara-Based Script Systems of India

All day at Gløshaugen Campus,

The IT Building, room 054:

Interconnecting Digital Linguistics and NLP

Contributors:

Tore Bruland, Anil Kumar Singh

Lunch at Gløshaugen

Discussion of digital tools and sustainability of

distributive research using community platforms.

Note:

COLING 2012

The workshop features two sections:

Multilingual text processing, interlinear annotation and formalisation of Indic languages

Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.

**List of Workshop Languages**
Language name	Language Family	Script
Banglā (Bengali)	Indo-Aryan	Banglā
Hindi	Indo-Aryan	Devanāgarī
Punjabi	Indo-Aryan	Gurmukhi
Malayālam	Dravidian	Malayāḷalipi
Khasi	Austro-Asiatic	Roman
Angami	Tibeto-Burman	Roman

Grammatical construction types across Indian languages

Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.

References

Soma Paul Tross paper

↑ Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

[1] Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

[1]

India2011- Digital Linguistics

Contents

Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages

Keynote Talks

The Architecture and Processing of Brahmi-Derived Scripts

A unitary system for formal multilingual classification and a digital platform for cross-level representation

The Syntax and Semantics of Non-nominative Subject in South Asian languages

Workshop

Collaborative corpus creation - qualitative and quantitative linguistic methods

WORKSHOP PROGRAM

Multilingual text processing, interlinear annotation and formalisation of Indic languages

Grammatical construction types across Indian languages

References