Typecraft v2.5
Jump to: navigation, search

Difference between revisions of "Classroom:LING2208 - Annotating Norwegian Bokmål/Agreement statistics"

Line 34: Line 34:
 
The results of such a search, is evidence for the aforementioned hypothesis; NEUT being overrepresented as a gloss tag for adjectives, as opposed to NEUT as a tag for any POS.
 
The results of such a search, is evidence for the aforementioned hypothesis; NEUT being overrepresented as a gloss tag for adjectives, as opposed to NEUT as a tag for any POS.
  
 +
These results can be compared to the distribution of genders among nouns in the NoWaC corpus <ref name="A">UiO, Frequency lists from NoWaC (Frequency list of analyzed word forms) http://www.hf.uio.no/iln/om/organisasjon/tekstlab/tjenester/nowac-frequency.html</ref>
 +
 +
 +
Performing three queries (one for each of mask, fem and nøyt), do:
 +
 +
Start with a sum of 0.
 +
 +
for each record in the file<ref name="A"/>, do:
 +
 +
if there is a substring in the fourth column matching ''subst'' '''AND''' a substring in the fourth collumn matching "<gender>", where <gender> is the tag being queried (''mask'', ''fem'', or ''nøyt''<ref> ''some'' problems with the file's encoding resulted in this being queried as '''nøyt'''</ref>) -- add the number in the first collumn to sum
  
The following table describes the distribution of marked gender as glossed on adjectives, and the total distribution of tags for Norwegian Bokmål in TypeCraft. This is compared to the distribution of genders among nouns in the [http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.html NoWaC corpus]. The percentages in the first columns represent the ratio of each tag to the total for each count, (i.e: 56% of all nouns are tagged in NoWaC as masculine). The final column contains the compound ratio of the ratio of each gender in entries tagged with ADJ in TypeCraft and the ratio of each gender in entries tagged as nouns in NoWaC. This gives us an indication of whether some genders are more frequently glossed for adjectives than they naturally occur.
 
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 69: Line 78:
 
| ''N/A''
 
| ''N/A''
 
|}
 
|}
 +
 +
The percentages in the first columns represent the ratio of each tag to the total for each count, (i.e: 56% of all nouns are tagged in NoWaC as masculine). The final column contains the compound ratio of the ratio of each gender in entries tagged with ADJ in TypeCraft and the ratio of each gender in entries tagged as nouns in NoWaC. This gives us an indication of whether some genders are more frequently glossed for adjectives than they naturally occur.
 +
 +
This also evidences that NEUT as a tag for adjectives seems overrepresented in TypeCraft.
 +
 +
<!--
 +
The following table describes the distribution of marked gender as glossed on adjectives, and the total distribution of tags for Norwegian Bokmål in TypeCraft. This is compared to the distribution of genders among nouns in the [http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.html NoWaC corpus]. The percentages in the first columns represent the ratio of each tag to the total for each count, (i.e: 56% of all nouns are tagged in NoWaC as masculine). The final column contains the compound ratio of the ratio of each gender in entries tagged with ADJ in TypeCraft and the ratio of each gender in entries tagged as nouns in NoWaC. This gives us an indication of whether some genders are more frequently glossed for adjectives than they naturally occur.
 +
  
 
From this data we can see that infinitival gender is overrepresented for adjectives. This is due to feminine and masculine genders (which appear to be equally underrepresented) not being indicated morphologically in adjectives, but rather indicated by their un-inflected base form, neuter adjectives are inflected with a morpheme. This reflects a tagging convention that is morphologically oriented.
 
From this data we can see that infinitival gender is overrepresented for adjectives. This is due to feminine and masculine genders (which appear to be equally underrepresented) not being indicated morphologically in adjectives, but rather indicated by their un-inflected base form, neuter adjectives are inflected with a morpheme. This reflects a tagging convention that is morphologically oriented.
 +
-->
 +
 +
 +
==Notes==
 +
 +
<references/>

Revision as of 16:58, 25 February 2014

Tagging of gender in Norwegian Bokmål

Regarding the gloss tags of adjectives, there is variation in the conventions for Norwegian Bokmål corpora of TypeCraft. Adjectives are at times glossed with grammatical gender tags, at other times not.

A possible theory for why some adjectives are tagged with gender, might be the neuter form. In Norwegian, the neuter form of adjectives more often than not is distinctly different from the masculine or feminine form. In fact, the masculine and feminine forms are indistinguishable from each other, and also indistinguishable from the base form. One could therefore expect NEUT to be an overrepresented tag among the adjectives tagged for gender.

Using TypeCraft's Phrase Search (for Norwegian Bokmål), performing three searches: [("POS:ADJ", "gloss:FEM"), ("POS:ADJ", "gloss:MASC"), ("POS:ADJ", "gloss:NEUT")], should result in three values. These are the number of adjectives that are tagged with each gender (summing them gives the total amount of gender-tagged adjectives).

In comparison, performing three searches just for the POS tags: [("gloss:FEM"), ("gloss:MASC"), ("gloss:NEUT")], should result in three new values. These are the total number of words in TypeCraft tagged with a gender (for Bokmål).

Gender Adjectives Total for all tags in TypeCraft
FEM 0 (0%) 33 (6.33%)
MASC 13 (21%) 302 (58%)
NEUT 49 (79%) 186 (35.7%)
Total: 62 (100%) 521 (100%)

The results of such a search, is evidence for the aforementioned hypothesis; NEUT being overrepresented as a gloss tag for adjectives, as opposed to NEUT as a tag for any POS.

These results can be compared to the distribution of genders among nouns in the NoWaC corpus [1]


Performing three queries (one for each of mask, fem and nøyt), do:

Start with a sum of 0.

for each record in the file[1], do:

if there is a substring in the fourth column matching subst AND a substring in the fourth collumn matching "<gender>", where <gender> is the tag being queried (mask, fem, or nøyt[2]) -- add the number in the first collumn to sum


Gender Adjectives Total for all tags in TypeCraft Total for nouns in NoWaC Ratio for ADJ to NoWaC
FEM 0 (0%) 33 (6.33%) 20358360 (16.47%) 0%
MASC 13 (21%) 302 (58%) 69209955 (56%) 37.5%
NEUT 49 (79%) 186 (35.7%) 34026414 (27.53%) 286.96%
Total: 62 (100%) 521 (100%) 123594729 (100%) N/A

The percentages in the first columns represent the ratio of each tag to the total for each count, (i.e: 56% of all nouns are tagged in NoWaC as masculine). The final column contains the compound ratio of the ratio of each gender in entries tagged with ADJ in TypeCraft and the ratio of each gender in entries tagged as nouns in NoWaC. This gives us an indication of whether some genders are more frequently glossed for adjectives than they naturally occur.

This also evidences that NEUT as a tag for adjectives seems overrepresented in TypeCraft.


Notes

  1. 1.0 1.1 UiO, Frequency lists from NoWaC (Frequency list of analyzed word forms) http://www.hf.uio.no/iln/om/organisasjon/tekstlab/tjenester/nowac-frequency.html
  2. some problems with the file's encoding resulted in this being queried as nøyt