Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "India2011- Digital Linguistics"

(The Architecture and Processing of Brahmi-Derived Scripts)
 
(119 intermediate revisions by 3 users not shown)
Line 1: Line 1:
''In October 2011, NTNU will arrange a week-long event, called '''India 2011''', with India as the theme. The focus will be on broad cooperation
+
In October 2011, from 1. - 9. October 2011,  NTNU will arrange a week-long event, called '''India 2011''', with India as the theme. The focus will be on broad cooperation
  
in culture, research, higher education and business. [[File:India2011.gif|200px|left]]''  
+
in culture, research, higher education and business. [[File:India2011.gif|thumb|left]]   
  
''The present arrangement between the University of Hyderabad and the Institute of Languages and  
+
The present arrangement between the University of Hyderabad, and the Department of Language and  
 +
Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.
  
''Communication Studies and the Institute of Modern Languages at NTNU has Indian languages as its focus.''
+
The arrangement of several talks and workshops, announced here, is part of NTNU's India week.  
  
''The arrangement of a several talks and workshops, announce here, is part of the NTNU's India week. ''
 
  
 
<blockquote><span style="color:green">'''India is a continent of many languages.  Ethnologue <ref> Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.</ref> refers to 452 listed languages  of  India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics  has had a significant influence on the development of linguistics as we know it today.'''</span></blockquote>
 
<blockquote><span style="color:green">'''India is a continent of many languages.  Ethnologue <ref> Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.</ref> refers to 452 listed languages  of  India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics  has had a significant influence on the development of linguistics as we know it today.'''</span></blockquote>
 +
 +
  
 
==Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages==
 
==Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages==
[[File:LanguagesIndia.jpeg|thumb|left|map of the languages of India]]
 
More about the theme and about the importance to combine linguistic description with digital Knowledge Representation, on the one
 
side and formal approaches to language description on the other.
 
  
 +
''' Oktober 1 - 9 2011'''
  
 +
In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.
 +
[[File:LanguagesIndia.jpeg|right]]
 +
The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of  annotation. Results of the workshop will be made accessible online using software developed at NTNU.
 +
 +
The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of  formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.
  
  
Line 24: Line 29:
 
Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages  
 
Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages  
  
===Construction Types - multilingual===
+
====The Architecture and Processing of Brahmi-Derived Scripts====
[[Image:LarsByM.jpg|thumb|150px|left|Lars Hellan|Professor Hellan is...]]
+
  
 +
[[Image:Gautam.Sengupta.jpg|thumb|150px|right|Professor Sengupta ]]
 +
'''Professor Gautam Sengupta, University of Hyderabad'''
  
 +
The earliest material evidence of writing in India appears in the
 +
Ashokan inscriptions at Girnar, dating back to the 3rd century B.C.
 +
All the major writing systems of India, with the sole exception of
 +
Urdu, derive from this early Ashokan script known as Brāhmī. The
 +
Brāhmī-derived scripts are often called alpha-syllabaries on account
 +
of the fact that they are based upon the notion of orthographic
 +
syllable or akṣara. This talk will be about the basic architecture of
 +
the Brahmic scripts of India and how they are processed in reading.
  
 +
====A unitary system for formal multilingual classification and a digital platform for cross-level representation====
 +
[[Image:LarsByM.jpg|thumb|150px|left|Lars Hellan|Professor Hellan]]
  
  
 +
This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as ''representing'' structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.
  
 +
===The Syntax and Semantics of Non-nominative Subject in South Asian languages===
  
 +
[[Image:BookSubbaro.png|thumb|right|200px]]
  
 +
'''Professor K.V. Subbarao'''
  
 +
I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I  provide a brief description of NNSs in SALs first, keeping in view the notion of subject.  I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].
  
 +
<br>
 +
<br>
 +
<br>
  
 +
==Workshop==
 +
The workshop will be introduced by a talk on
  
 +
=== '''Collaborative corpus creation  -  qualitative and quantitative linguistic methods'''===
  
==Workshop==
+
by
+
In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan and the Austro-Asiatic language families of India.
+
  
Using natural language processing tools and linguistic web-technology developed at  University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.
+
[[User:Dorothee Beermann|Associate Professor Dorothee Beermann, NTNU]].
  
'''Workshop talk''' Collaborative corpus creation -  qualitative and quantitative linguistic methods
+
In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the
[[Image:Dorothee2.jpg|thumb|150px|right|Dorothee Beermann is an assoc. professor at NTNU. Her fields of research are syntax and lexical semantics. She has specialised in the use of online data basing for Language Description and Language Documentation ]]
+
possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to
 +
its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of
 +
the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a
 +
valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration.
 +
In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.
  
more soon
+
<br>
  
 +
==WORKSHOP PROGRAM==
 +
<font size="1" face="Verdana"> The Workshop is supported by [http://www.ntnu.no/india2011 India 2011] </font>
  
 +
{| border="1" cellpadding="2"
 +
|-valign="top"
 +
|width="5%"|'''October 3th-7th'''
 +
|width="20%"|'''Monday'''
 +
|width="20%"|'''Tuesday'''
 +
|width="20%"|'''Wednesday'''
 +
|width="20%"|'''Thursday'''
 +
|width="20%"|'''Friday'''
 +
|-valign="top"
 +
|
  
  
  
  
 +
Meals during
  
 +
the day are
  
 +
provided at
  
 +
university facilities
  
  
 +
|
 +
9.15-10.30 '''Language Documentation'''
  
 +
''Dorothee Beermann''
  
 +
(LinLab (Building 4, Dragvoll))
  
  
 +
<span style="color:orange"> Tea </span>
  
  
 +
10:45-12:00 ''' Keynote '''
 +
 +
''Gautam Sengupta'' 
 +
 +
(LinLab)
 +
 +
 +
<span style="color:orange"> Lunch </span>
 +
 +
 +
 +
13.15-14.30 Hands-on:
 +
 +
'''Introduction to TypeCraft'''
 +
 +
(LinLab and CompLab)
 +
 +
 +
<span style="color:orange"> Afternoon Tea </span>
 +
 +
 +
15:00-16:45 Hands-on:
 +
 +
'''Creation of research Corpora,
 +
verb annotation and discussion'''
 +
 +
(LinLab and CompLab)
 +
 +
 +
 +
|
 +
9.15-10.30  '''Keynote'''
 +
 +
''K.V. Subbarao''
 +
 +
(LinLab)
 +
 +
 +
<span style="color:orange"> Tea </span>
 +
 +
 +
10.45-12 Discussion
 +
 +
'''Indian constructions types'''
 +
 +
(LinLab)
 +
 +
 +
<span style="color:orange"> Lunch </span>
 +
 +
 +
 +
13:30-14:30
 +
 +
Hands-on:
 +
 +
'''Classifying and annotating construction types'''
 +
 +
(CompLab and offices)
 +
 +
 +
<span style="color:orange"> Afternoon Tea </span>
 +
 +
 +
15:00 - 16:45
 +
 +
Second afternoon session
 +
 +
|
 +
 +
9:15-10:30  '''Keynote'''
 +
 +
''Lars Hellan''
 +
 +
(LinLab)
 +
 +
 +
<span style="color:orange"> Tea </span>
 +
 +
 +
10:45-12:00  '''Discussion AVMs'''
 +
 +
(LinLab)
 +
 +
 +
<span style="color:orange"> Lunch </span>
 +
 +
 +
 +
13:15-14:30
 +
 +
Hands-on:
 +
 +
'''AVM construction, construction labeling, TC annotation'''
 +
 +
(CompLab and offices)
 +
 +
 +
<span style="color:orange"> Afternoon Tea </span>
 +
 +
 +
15:00 - !6:45
 +
 +
Second afternoon session
 +
|
 +
 +
All day:
 +
 +
Hands-on:
 +
 +
'''AVM construction, construction labeling, TC annotation'''
 +
 +
(CompLab and offices)
 +
 +
 +
<span style="color:orange"> with meals as usual </span>
 +
 +
 +
 +
'''EVENING SESSION'''
 +
 +
 +
 +
'''Public talk at ''Dokkhuset''''':
 +
 +
19:00-19:30:
 +
 +
''Gautam Sengupta'':
 +
 +
The Aksara-Based Script Systems of India
 +
 +
|
 +
All day at Gløshaugen Campus,
 +
 +
The IT Building, room 054:
 +
 +
'''Interconnecting Digital Linguistics and NLP'''
 +
 +
Contributors:
 +
 +
''Tore Bruland, Anil Kumar Singh''
 +
 +
 +
<span style="color:orange"> Lunch at Gløshaugen </span>
 +
 +
 +
 +
 +
Discussion of digital tools and sustainability of
 +
 +
distributive research using community platforms.
 +
 +
 +
Note:
 +
 +
[http://www.cicling.org/2012  COLING 2012]
 +
 +
 +
|-
 +
|}
 +
 +
<br>
 +
 +
 +
 +
'''The workshop features two sections:
 +
 +
===Multilingual text processing, interlinear annotation and formalisation of Indic languages===
 +
 +
Using natural language processing tools and linguistic web-technology developed at  University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.
 +
 +
 +
{| class="wikitable" style="margin: 1em auto 1em auto"
 +
|+ '''List of Workshop Languages'''
 +
! Language name || Language Family || Script
 +
|-
 +
| Banglā (Bengali)|| Indo-Aryan|| Banglā
 +
|-
 +
| Hindi || Indo-Aryan || Devanāgarī
 +
|-
 +
| Punjabi||Indo-Aryan||Gurmukhi
 +
|-
 +
|Malayālam||Dravidian||Malayāḷalipi
 +
|-
 +
|Khasi||Austro-Asiatic||Roman
 +
|-
 +
|Angami||Tibeto-Burman||Roman
 +
|}
 +
 +
===Grammatical construction types across Indian languages===
 +
 +
Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.
  
 
==References==
 
==References==
 +
[[Media:Paul.pdf|Soma Paul Tross paper]]
 +
 
<references/>
 
<references/>
 +
 +
[http://www4.clustrmaps.com/user/8abdaf33 http://www4.clustrmaps.com/stats/maps-no_clusters/www.typecraft.org-thumb.jpg]

Latest revision as of 12:03, 20 October 2011

In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called India 2011, with India as the theme. The focus will be on broad cooperation

in culture, research, higher education and business.
Error creating thumbnail: Unable to save thumbnail to destination

The present arrangement between the University of Hyderabad, and the Department of Language and Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.

The arrangement of several talks and workshops, announced here, is part of NTNU's India week.


India is a continent of many languages. Ethnologue [1] refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.


Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages

Oktober 1 - 9 2011

In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.

Error creating thumbnail: Unable to save thumbnail to destination

The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU.

The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.



Keynote Talks

Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages

The Architecture and Processing of Brahmi-Derived Scripts

Error creating thumbnail: Unable to save thumbnail to destination
Professor Sengupta

Professor Gautam Sengupta, University of Hyderabad

The earliest material evidence of writing in India appears in the Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. All the major writing systems of India, with the sole exception of Urdu, derive from this early Ashokan script known as Brāhmī. The Brāhmī-derived scripts are often called alpha-syllabaries on account of the fact that they are based upon the notion of orthographic syllable or akṣara. This talk will be about the basic architecture of the Brahmic scripts of India and how they are processed in reading.

A unitary system for formal multilingual classification and a digital platform for cross-level representation

Professor Hellan


This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as representing structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.

The Syntax and Semantics of Non-nominative Subject in South Asian languages

BookSubbaro.png

Professor K.V. Subbarao

I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I provide a brief description of NNSs in SALs first, keeping in view the notion of subject. I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].




Workshop

The workshop will be introduced by a talk on

Collaborative corpus creation - qualitative and quantitative linguistic methods

by

Associate Professor Dorothee Beermann, NTNU.

In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.


WORKSHOP PROGRAM

The Workshop is supported by India 2011

October 3th-7th Monday Tuesday Wednesday Thursday Friday



Meals during

the day are

provided at

university facilities


9.15-10.30 Language Documentation

Dorothee Beermann

(LinLab (Building 4, Dragvoll))


Tea


10:45-12:00 Keynote

Gautam Sengupta

(LinLab)


Lunch


13.15-14.30 Hands-on:

Introduction to TypeCraft

(LinLab and CompLab)


Afternoon Tea


15:00-16:45 Hands-on:

Creation of research Corpora, verb annotation and discussion

(LinLab and CompLab)


9.15-10.30 Keynote

K.V. Subbarao

(LinLab)


Tea


10.45-12 Discussion

Indian constructions types

(LinLab)


Lunch


13:30-14:30

Hands-on:

Classifying and annotating construction types

(CompLab and offices)


Afternoon Tea


15:00 - 16:45

Second afternoon session

9:15-10:30 Keynote

Lars Hellan

(LinLab)


Tea


10:45-12:00 Discussion AVMs

(LinLab)


Lunch


13:15-14:30

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)


Afternoon Tea


15:00 - !6:45

Second afternoon session

All day:

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)


with meals as usual


EVENING SESSION


Public talk at Dokkhuset:

19:00-19:30:

Gautam Sengupta:

The Aksara-Based Script Systems of India

All day at Gløshaugen Campus,

The IT Building, room 054:

Interconnecting Digital Linguistics and NLP

Contributors:

Tore Bruland, Anil Kumar Singh


Lunch at Gløshaugen



Discussion of digital tools and sustainability of

distributive research using community platforms.


Note:

COLING 2012




The workshop features two sections:

Multilingual text processing, interlinear annotation and formalisation of Indic languages

Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.


List of Workshop Languages
Language name Language Family Script
Banglā (Bengali) Indo-Aryan Banglā
Hindi Indo-Aryan Devanāgarī
Punjabi Indo-Aryan Gurmukhi
Malayālam Dravidian Malayāḷalipi
Khasi Austro-Asiatic Roman
Angami Tibeto-Burman Roman

Grammatical construction types across Indian languages

Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.

References

Soma Paul Tross paper

  1. Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

www.typecraft.org-thumb.jpg