Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "India2011- Digital Linguistics"

(WORKSHOP PROGRAM)
(The Architecture and Processing of Brahmi-Derived Scripts)
 
(29 intermediate revisions by 3 users not shown)
Line 32: Line 32:
  
 
[[Image:Gautam.Sengupta.jpg|thumb|150px|right|Professor Sengupta ]]
 
[[Image:Gautam.Sengupta.jpg|thumb|150px|right|Professor Sengupta ]]
 +
'''Professor Gautam Sengupta, University of Hyderabad'''
  
 
The earliest material evidence of writing in India appears in the
 
The earliest material evidence of writing in India appears in the
Line 47: Line 48:
  
 
This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as ''representing'' structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.
 
This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as ''representing'' structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.
 
  
 
===The Syntax and Semantics of Non-nominative Subject in South Asian languages===
 
===The Syntax and Semantics of Non-nominative Subject in South Asian languages===
Line 54: Line 54:
  
 
'''Professor K.V. Subbarao'''
 
'''Professor K.V. Subbarao'''
 +
 +
I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I  provide a brief description of NNSs in SALs first, keeping in view the notion of subject.  I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].
  
 
<br>
 
<br>
Line 59: Line 61:
 
<br>
 
<br>
  
==Workshop --  [[Media:Phenomenology.pdf| '''Workshop preparations''']] ==
+
==Workshop==
 
The workshop will be introduced by a talk on
 
The workshop will be introduced by a talk on
  
Line 69: Line 71:
  
 
In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the  
 
In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the  
possibilities that open access to scientific data offers for linguist work in the Humanities. Work with data, from its creation to
+
possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to
 
its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of
 
its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of
 
the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a
 
the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a
Line 89: Line 91:
 
|width="20%"|'''Friday'''
 
|width="20%"|'''Friday'''
 
|-valign="top"
 
|-valign="top"
|Daily breaks
+
|
 
+
 
+
 
+
Tea break 10:30-10:45
+
 
+
  
  
  
  
 +
Meals during
  
 +
the day are
  
Lunch break 12:00-13:00
+
provided at
  
 +
university facilities
  
  
 +
|
 +
9.15-10.30 '''Language Documentation'''
  
 +
''Dorothee Beermann''
  
 +
(LinLab (Building 4, Dragvoll))
  
  
Tea break
+
<span style="color:orange"> Tea </span>
 
+
14:30-15:00
+
|
+
9.15-10.30: '''Introduction'''
+
 
+
''Dorothee Beermann''
+
 
+
Room: LinLab
+
+
  
  
Line 125: Line 120:
 
''Gautam Sengupta''   
 
''Gautam Sengupta''   
  
Room: LinLab
+
(LinLab)
 
+
 
+
  
  
 +
<span style="color:orange"> Lunch </span>
  
  
Line 137: Line 131:
 
'''Introduction to TypeCraft'''  
 
'''Introduction to TypeCraft'''  
  
Room: CompLab
+
(LinLab and CompLab)
  
  
 +
<span style="color:orange"> Afternoon Tea </span>
  
  
15:00 - 16:45 Hands-on:
+
15:00-16:45 Hands-on:
  
'''Creation of research Corpora
+
'''Creation of research Corpora,
 
verb annotation and discussion'''
 
verb annotation and discussion'''
  
Rooms: LinLab and CompLab
+
(LinLab and CompLab)
  
  
Line 159: Line 154:
  
  
 +
<span style="color:orange"> Tea </span>
  
10.45-12: Discussion
 
  
'''Indian constructions types and their
+
10.45-12 Discussion
internal properties'''
+
 
 +
'''Indian constructions types'''
  
 
(LinLab)
 
(LinLab)
  
  
 +
<span style="color:orange"> Lunch </span>
  
  
 +
 +
13:30-14:30
 +
 +
Hands-on:
 +
 +
'''Classifying and annotating construction types'''
 +
 +
(CompLab and offices)
 +
 +
 +
<span style="color:orange"> Afternoon Tea </span>
 +
 +
 +
15:00 - 16:45
 +
 +
Second afternoon session
  
 
|
 
|
 +
 
9:15-10:30  '''Keynote'''
 
9:15-10:30  '''Keynote'''
  
Line 178: Line 192:
 
(LinLab)
 
(LinLab)
  
 +
 +
<span style="color:orange"> Tea </span>
  
  
Line 185: Line 201:
  
  
 +
<span style="color:orange"> Lunch </span>
  
  
  
 
+
13:15-14:30
 
+
 
+
13:15--
+
  
 
Hands-on:  
 
Hands-on:  
Line 200: Line 214:
  
  
 +
<span style="color:orange"> Afternoon Tea </span>
  
 +
 +
15:00 - !6:45
 +
 +
Second afternoon session
 
|
 
|
 +
 
All day:
 
All day:
  
Line 211: Line 231:
  
  
 +
<span style="color:orange"> with meals as usual </span>
  
  
  
 +
'''EVENING SESSION'''
  
  
  
 +
'''Public talk at ''Dokkhuset''''':
  
 +
19:00-19:30:
  
 +
''Gautam Sengupta'':
  
 +
The Aksara-Based Script Systems of India
  
 +
|
 +
All day at Gløshaugen Campus,
  
 +
The IT Building, room 054:
  
 +
'''Interconnecting Digital Linguistics and NLP'''
  
 +
Contributors:
  
Evening public talk at ''Dokkhuset'':
+
''Tore Bruland, Anil Kumar Singh''
 +
 
 +
 
 +
<span style="color:orange"> Lunch at Gløshaugen </span>
  
19:00-19:30: Gautam Sengupta:
 
  
  
|
 
All day at Gløshaugen Campus, The IT Building, room 054
 
  
'''Introductions to Grammar Engineering using LKB, XML export from TC to LKB grammar, 'meta-parsing''''
+
Discussion of digital tools and sustainability of
  
Contributors: ''Dorothee Beermann, Tore Bruland, Lars Hellan''
+
distributive research using community platforms.
  
  
 +
Note:
  
Discussion of digital tools and sustainability of distributive research using community platforms.  
+
[http://www.cicling.org/2012  COLING 2012]
  
  
Line 278: Line 310:
  
 
==References==
 
==References==
 +
[[Media:Paul.pdf|Soma Paul Tross paper]]
 +
 
<references/>
 
<references/>
  
 
[http://www4.clustrmaps.com/user/8abdaf33 http://www4.clustrmaps.com/stats/maps-no_clusters/www.typecraft.org-thumb.jpg]
 
[http://www4.clustrmaps.com/user/8abdaf33 http://www4.clustrmaps.com/stats/maps-no_clusters/www.typecraft.org-thumb.jpg]

Latest revision as of 12:03, 20 October 2011

In October 2011, from 1. - 9. October 2011, NTNU will arrange a week-long event, called India 2011, with India as the theme. The focus will be on broad cooperation

in culture, research, higher education and business.
Error creating thumbnail: Unable to save thumbnail to destination

The present arrangement between the University of Hyderabad, and the Department of Language and Communication Studies and the Department of Modern Languages at NTNU, has Indian languages as its focus.

The arrangement of several talks and workshops, announced here, is part of NTNU's India week.


India is a continent of many languages. Ethnologue [1] refers to 452 listed languages of India. The nation is not only rich in languages. Grounded on work dating back to Pāṇini, Indian linguistics has had a significant influence on the development of linguistics as we know it today.


Digital Language Description, Knowledge Representation and Formal Linguistics for Indic Languages

Oktober 1 - 9 2011

In a workshop on Digital Language Description, Knowledge Representation and Formal Linguistics, linguists from Hyderabad and Trondheim will work together on the representation and formalisation of some of the salient aspects of selected languages from the Dravidian, the Indo-Aryan, the Tibeto-Burman and the Austro-Asiatic language families of India.

Error creating thumbnail: Unable to save thumbnail to destination

The workshop will take place in a digital communication environment. A group of linguists will work on qualitative language description and linguistic formalisation of Indian languages. Keynote talks addressing central issues in the digitisation and formalisation of Indic languages will be combined with group sessions dedicated to the documentation and formalisation of central Indic construction types. Legacy-data will be digitised and enriched by further layers of annotation. Results of the workshop will be made accessible online using software developed at NTNU.

The arrangement situates modern approaches to language description and documentation in the environment of the rise of linguistic sciences, namely the languages in the tradition of formal description of Sanskrit dating back nearly 3000 years. Vibrant communities in Hyderabad and Trondheim will develop and refine methods of digitised formal language research together, with staff and students from both universities informing each other on both formal, computational and empirical issues. Where the Sanskrit grammarian Panini made the first systematic symbolic approach to language description, the present arrangement focuses on symbolic approaches relative to current technologies and formal frameworks.



Keynote Talks

Several Keynote talks will address central issues in the digitisation and formalisation of Indic languages

The Architecture and Processing of Brahmi-Derived Scripts

Error creating thumbnail: Unable to save thumbnail to destination
Professor Sengupta

Professor Gautam Sengupta, University of Hyderabad

The earliest material evidence of writing in India appears in the Ashokan inscriptions at Girnar, dating back to the 3rd century B.C. All the major writing systems of India, with the sole exception of Urdu, derive from this early Ashokan script known as Brāhmī. The Brāhmī-derived scripts are often called alpha-syllabaries on account of the fact that they are based upon the notion of orthographic syllable or akṣara. This talk will be about the basic architecture of the Brahmic scripts of India and how they are processed in reading.

A unitary system for formal multilingual classification and a digital platform for cross-level representation

Professor Hellan


This talk first shows the feasibility of designing a cross-linguistically valid system of syntactic-semantic representation, emphasizing both the content of the categories used and the function of a grammar formalism as representing structure and not just processing structure. The talk then shows a strategy for how standard formats of sentence annotation (such as Interlinear Glossing) can be made communicate with a level of representation satisfying the above desiderata.

The Syntax and Semantics of Non-nominative Subject in South Asian languages

BookSubbaro.png

Professor K.V. Subbarao

I discuss the nature of case marking — lexical/inherent vs. structural, the choice of case on the subject and object in non-nominative subject (hereafter, NNS) constructions, general trends in SALs and the variation by genetic affiliation or sub region. I provide a brief description of NNSs in SALs first, keeping in view the notion of subject. I shall then discuss some subject properties of NNSs. I argue that (i) the predicate in a dative subject construction (DSC) is [-transitive] and unaccusative; (ii) all NNSs except the ergative are inherently case-marked; (iii) such inherent case marking cannot be done by an intransitive verb alone, but by the whole predicate compositionally consisting of a theme or an adjective along with the [-transitive] verb; and (iv) information concerning agreement should be available vP-internally (in the lower thematic S) for proper assignment of inherent case to the NNS. I shall show that the accusative/dative case marking of the theme in dative/genitive subject constructions in Bangla, Tamil and Malayalam does not count as counter-evidence to treating the predicate in NNS constructions [-transitive].




Workshop

The workshop will be introduced by a talk on

Collaborative corpus creation - qualitative and quantitative linguistic methods

by

Associate Professor Dorothee Beermann, NTNU.

In this workshop we will explore the possibilities that e-Research offers for Linguists working on Indic languages. In my talk I will discuss the possibilities that open access to scientific data offers for linguists working in the Humanities. Work with data, from its creation to its integration into a publication is not rarely perceived as a chore. Given the right tools however, it can become a meaningful part of the linguistic investigation. The standard format for linguistic data in the Humanities is Interlinear Glossed Text. As such they represent a valuable resource even though linguists tend to disagree about the role and the methods by which data should influence linguistic exploration. In describing the components of the TypeCraft system we focus in this talk on the potential that an online linguistic data management system offers for the description and documentation of Indic languages, real-time datasharing, and the continuous dissemination of research results.


WORKSHOP PROGRAM

The Workshop is supported by India 2011

October 3th-7th Monday Tuesday Wednesday Thursday Friday



Meals during

the day are

provided at

university facilities


9.15-10.30 Language Documentation

Dorothee Beermann

(LinLab (Building 4, Dragvoll))


Tea


10:45-12:00 Keynote

Gautam Sengupta

(LinLab)


Lunch


13.15-14.30 Hands-on:

Introduction to TypeCraft

(LinLab and CompLab)


Afternoon Tea


15:00-16:45 Hands-on:

Creation of research Corpora, verb annotation and discussion

(LinLab and CompLab)


9.15-10.30 Keynote

K.V. Subbarao

(LinLab)


Tea


10.45-12 Discussion

Indian constructions types

(LinLab)


Lunch


13:30-14:30

Hands-on:

Classifying and annotating construction types

(CompLab and offices)


Afternoon Tea


15:00 - 16:45

Second afternoon session

9:15-10:30 Keynote

Lars Hellan

(LinLab)


Tea


10:45-12:00 Discussion AVMs

(LinLab)


Lunch


13:15-14:30

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)


Afternoon Tea


15:00 - !6:45

Second afternoon session

All day:

Hands-on:

AVM construction, construction labeling, TC annotation

(CompLab and offices)


with meals as usual


EVENING SESSION


Public talk at Dokkhuset:

19:00-19:30:

Gautam Sengupta:

The Aksara-Based Script Systems of India

All day at Gløshaugen Campus,

The IT Building, room 054:

Interconnecting Digital Linguistics and NLP

Contributors:

Tore Bruland, Anil Kumar Singh


Lunch at Gløshaugen



Discussion of digital tools and sustainability of

distributive research using community platforms.


Note:

COLING 2012




The workshop features two sections:

Multilingual text processing, interlinear annotation and formalisation of Indic languages

Using natural language processing tools and linguistic web-technology developed at University at Hyderabad and at NTNU, we will create small research corpora which we will annotate for salient linguistic properties with the goal of deriving Attribute Value Matrix Notations from these annotations.


List of Workshop Languages
Language name Language Family Script
Banglā (Bengali) Indo-Aryan Banglā
Hindi Indo-Aryan Devanāgarī
Punjabi Indo-Aryan Gurmukhi
Malayālam Dravidian Malayāḷalipi
Khasi Austro-Asiatic Roman
Angami Tibeto-Burman Roman

Grammatical construction types across Indian languages

Using methods of formal linguistic representation such as 'attribute value matrices' (AVMs), a systematic comparison of representatives of each of the major language families spoken in India will be conducted, focusing on a limited set of sentential construction types. The languages and their families are the above listed.

References

Soma Paul Tross paper

  1. Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/.

www.typecraft.org-thumb.jpg