iDAT (Indic Data Annotation in Typecraft) is the name of a cooperation initiative between the University of Hyderabad and the Norwegian University of Science and Technology.

The goal of the initiative is to create a pool of in-depth annotated data to support online knowledge representation, e-research, that is the free exchange of research data, and natural language processing initiatives that depend on richly annotated linguistic data.

iDAT also stands for 'I' for Indo-Aryan, 'D' for Dravidian, 'A' for Austro-Asiatic and 'T' for Tibeto-Burman, which are the four language families found in India.

Language Families of India

The languages of India belong to several language families, the major ones being the Indo-European languages—Indo-Aryan (spoken by 72% of Indians) and the Dravidian languages (spoken by 25% of Indians). Other languages spoken in India belong to the Austro-Asiatic, Tibeto-Burman, and a few minor language families and isolates. Individual mother tongues in India number several hundred; the 1961 census recognised 1,652 (SIL Ethnologue lists 415). According to Census of India of 2001, 30 languages are spoken by more than a million native speakers, 122 by more than 10,000. Three millennia of language contact has led to significant mutual influence among the four language families in India and South Asia. Two contact languages have played an important role in the history of India: Persian and English. The northern Indian languages from the Indo-European family evolved from Old Indo-Aryan such as Sanskrit, by way of the Middle Indo-Aryan Prakrit languages and Apabhraṃśa of the Middle Ages. There is no consensus for a specific time where the modern north Indian languages such as Hindi-Urdu, Assamese, Bengali, Gujarati, Marathi, Punjabi, Saraiki, Sindhi and Oriya emerged, but AD 1000 is commonly accepted. Each language had different influences, with Hindi-Urdu (Hindustani) being strongly influenced by Persian. The Dravidian languages of South India have a history independent of Sanskrit. The major Dravidian languages are Tamil, Telugu, Kannada and Malayalam. The Austro-Asiatic and Tibeto-Burman languages of North-East India also have long independent histories.

Language Family Speakers(2001, mill) State(s)
Assames Indo-Aryan,Eastern 13 Assam, Arunachal Pradesh
Bengali Indo-Aryan, Eastern 83 West Bengal, Tripura, Andaman & Nicobar Islands and also few regions of Assam
Bodo Tibeto-Burman 1.4 Assam
Dogri Indo-Aryan, Northwestern 2.3 Jammu and Kashmir
Gujarati Indo-Aryan, Western 46 Dadra and Nagar Haveli, Daman and Diu
Standard Hindi Indo-Aryan,Central 258-422 Andaman and Nicobar Islands, Arunachal Pradesh, Bihar, Chandigarh, Chhattisgarh, the national capital territory of Delhi, Haryana,Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh and Uttarakhand
Kannada Dravidian 38 Karnataka
Kashmiri Indo-Aryan Dardic 5.5 Jammu and Kashmir
Konkani Indo-Aryan, Southern 2.5 (7.6 per Ethnologue) Goa, Karnataka, Maharashtra, Kerala
Maithili Indo-Aryan, Eastern 12 (32 in India in 2000 per Ethnologue) Bihar
Malayalam Dravidian 33 Kerala, Andaman and Nicobar Islands, Lakshadweep, Pondicherry
Manipuri (alsoMeitei or Meithei) Tibeto-Burman 1.5 Manipur
Nepali Indo-Aryan, Northern 2.9 Sikkim, West Bengal, Assam
Oriya Indo-Aryan, Eastern 33 Orissa
Punjabi Indo-Aryan, North-western 29 Chandigarh, Delhi, Haryana, Punjab
Sanskrit Indo-Aryan 0.01 non-regional
Santhali Munda 6.5 Santhal tribals of the Chota Nagpur Plateau (comprising the states of Bihar, Chattisgarh, Jharkhand, Orissa)
Sindhi Indo-Aryan, North-western 2.5 non-regional
Tamil Dravidian 61 Tamil Nadu, Andaman & Nicobar Islands, Puducherry
Telugu Dravidian 74 Andaman & Nicobar Islands, Andhra Pradesh, Puducherry
Urdu Indo-Aryan, Central 52 Jammu and Kashmir, Andhra Pradesh, Delhi, Bihar, Uttar Pradesh and Uttarakhand