India2011- Digital Linguistics

In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called India 2011, with India as the theme. The focus will be on broad cooperation

in culture, research, higher education and business.

The present arrangement between the University of Hyderabad, and the Department of Language and Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.

The arrangement of several talks and workshops, announced here, is part of NTNU's India week.

India is a continent of many languages. Ethnologue ^[1] refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.

1 Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages
2 Keynote Talks
- 2.1 The Architecture and Processing of Brahmi-Derived Scripts
- 2.2 A unitary system for formal multilingual classification and a digital platform for cross-level representation
3 Workshop -- Workshop preparations
4 References

Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages

Oktober 1 - 9 2011

In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.

The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU.

The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.

Keynote Talks

Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages

The Architecture and Processing of Brahmi-Derived Scripts

Professor Sengupta

The earliest material evidence of writing in India appears in the Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. All the major writing systems of India, with the sole exception of Urdu, derive from this early Ashokan script known as Brāhmī. The Brāhmī-derived scripts are often called alpha-syllabaries on account of the fact that they are based upon the notion of orthographic syllable or akṣara. This talk will be about the basic architecture of the Brahmic scripts of India and how they are processed in reading.

A unitary system for formal multilingual classification and a digital platform for cross-level representation

Professor Hellan

This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as representing structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata. Examples will be from non-Indic languages, but with suggestions for applying the strategy to certain Indic phenomena.

Workshop -- Workshop preparations

The workshop will be introduced by a talk on

Collaborative corpus creation - qualitative and quantitative linguistic methods

by

Associate Professor Dorothee Beermann, NTNU.

In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the possibilities that open access to scientific data offers for linguist work in the Humanities. Work with data, from its creation to its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.

The workshop features two sections:

Multilingual text processing, interlinear annotation and formalisation of Indic languages

Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.

**List of Workshop Languages**
Language name	Language Family	Script
Banglā (Bengali)	Indo-Aryan	Banglā
Hindi	Indo-Aryan	Devanāgarī
Punjabi	Indo-Aryan	Gurmukhi
Malayālam	Dravidian	Malayāḷalipi
Khasi	Austro-Asiatic	Roman
Angami	Tibeto-Burman	Roman

Grammatical construction types across Indian languages

Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.

References

↑ Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

[1] Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

[1]