Tagset

This corpus uses the Universal Dependencies (UD) annotation scheme as implemented in the UD Russian SynTagRus treebank.

One sentence in the corpus

The corpus is stored as a vertical file. Each sentence (<s>) contains one line per token with a fixed set of columns. The example below illustrates the exact token-level annotation format used in this corpus.

Example (sentence <s>)
id word lemma tag feat head deprel
1ВедьведьPART_4advmod
2когдакогдаSCONJ_4mark
3человекчеловекNOUNAnimacy=Anim|Case=Nom|Gender=Masc|Number=Sing4nsubj
4приходитприходитьVERBAspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act12advcl
5кокоADP_6case
6мнеяPRONCase=Dat|Number=Sing|Person=1|PronType=Prs4obl
7сосADP_9case
8своейсвойDETCase=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes9det
9больюбольNOUNAnimacy=Inan|Case=Ins|Gender=Fem|Number=Sing4obl
10,,PUNCT_4punct
11яяPRONCase=Nom|Number=Sing|Person=1|PronType=Prs12nsubj
12должендолженADJDegree=Pos|Gender=Masc|Number=Sing|Variant=Short0root
13войтивойтиVERBAspect=Perf|VerbForm=Inf|Voice=Act12xcomp
14ввADP_16case
15этуэтотDETCase=Acc|Gender=Fem|Number=Sing|PronType=Dem16det
16больбольNOUNAnimacy=Inan|Case=Acc|Gender=Fem|Number=Sing13obl
17,,PUNCT_19punct
18чтобычтобыSCONJMood=Cnd19mark
19понятьпонятьVERBAspect=Perf|VerbForm=Inf|Voice=Act13advcl
20,,PUNCT_23punct
21чточтоPRONAnimacy=Inan|Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel23obj
22ононPRONCase=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs23nsubj
23чувствуетчувствоватьVERBAspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act19ccomp
24..PUNCT_12punct

The columns shown above correspond to the token attributes used for search in the corpus. Detailed definitions of these attributes are provided below.

Truncated CoNLL-U format

UD treebanks are normally stored in CoNLL-U format, where each token is represented by multiple columns (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, etc.).

In this corpus, each <s> sentence uses a truncated CoNLL-U representation: only the fields required for search are retained, and unused columns are omitted.

Mapping to corpus attributes
Corpus attribute Meaning CoNLL-U analogue
idToken number within the sentenceID
wordSurface word formFORM
lemmaLemmaLEMMA
tagPOS tag (UPOS)UPOS
featMorphological feature bundleFEATS
headHead token ID (0 = root)HEAD
deprelDependency relation to headDEPREL

Annotation

<tag>: POS tags

The corpus uses 17 POS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

<feat>: Morphological categories and subcategories

The corpus uses UD-style feature bundles: Feature=Value pairs separated by |, for example: Case=Nom|Gender=Masc|Number=Sing.

In UD Russian SynTagRus, the feature inventory includes, among others, Animacy, Aspect, Case, Degree, Gender, Mood, Number, Person, PronType, Tense, Variant, VerbForm, Voice, etc.

Feature bundles by POS (full lists)

Click on a POS tag to see the morphological categories and values attested in UD Russian SynTagRus.

NOUN

See also: Case statistics, Animacy statistics.

VERB
PRON
ADJ
ADP

Adpositions (mostly prepositions) are typically not inflected in UD Russian SynTagRus, so they usually have no morphological features (feat=_).

ADV
AUX
CCONJ

Coordinating conjunctions typically have no morphological features (feat=_).

DET
INTJ

Interjections typically have no morphological features (feat=_).

NUM
PART

Particles typically have no morphological features (feat=_).

PROPN
PUNCT

Punctuation tokens have no morphological features (feat=_).

SCONJ

Subordinating conjunctions typically have no morphological features (feat=_).

SYM

Symbols typically have no morphological features (feat=_).

X

The tag X is used for other or unknown tokens; morphological features are typically empty (feat=_).

<head>: Head index

The head column stores the ID of the syntactic head (governor) of the current token within the same sentence. The dependency structure forms a single rooted tree over the tokens of the sentence.

Example from the sample sentence:

<deprel>: Dependency relations

Syntactic dependencies are encoded as one dependency tree per sentence using UD relation labels. The examples below are illustrative and use the direction HEAD → DEPENDENT.

See UD documentation: https://universaldependencies.org/u/dep/

Source documents

Back to top

Contact

name.surname [at] ur.de