Tagset

This corpus uses the Universal Dependencies (UD) annotation scheme as implemented in the UD Russian SynTagRus treebank.

One sentence in the corpus

The corpus is stored as a vertical file. Each sentence (<s>) contains one line per token with a fixed set of columns. The example below shows the exact token-level annotation format used in this corpus.

Example (sentence <s>)
id word lemma tag feat head deprel
1ВедьведьPART_4advmod
2когдакогдаSCONJ_4mark
3человекчеловекNOUNAnimacy=Anim|Case=Nom|Gender=Masc|Number=Sing4nsubj
4приходитприходитьVERBAspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act12advcl
5кокоADP_6case
6мнеяPRONCase=Dat|Number=Sing|Person=1|PronType=Prs4obl
7сосADP_9case
8своейсвойDETCase=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes9det
9больюбольNOUNAnimacy=Inan|Case=Ins|Gender=Fem|Number=Sing4obl
10,,PUNCT_4punct
11яяPRONCase=Nom|Number=Sing|Person=1|PronType=Prs12nsubj
12должендолженADJDegree=Pos|Gender=Masc|Number=Sing|Variant=Short0root
13войтивойтиVERBAspect=Perf|VerbForm=Inf|Voice=Act12xcomp
14ввADP_16case
15этуэтотDETCase=Acc|Gender=Fem|Number=Sing|PronType=Dem16det
16больбольNOUNAnimacy=Inan|Case=Acc|Gender=Fem|Number=Sing13obl
17,,PUNCT_19punct
18чтобычтобыSCONJMood=Cnd19mark
19понятьпонятьVERBAspect=Perf|VerbForm=Inf|Voice=Act13advcl
20,,PUNCT_23punct
21чточтоPRONAnimacy=Inan|Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel23obj
22ононPRONCase=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs23nsubj
23чувствуетчувствоватьVERBAspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act19ccomp
24..PUNCT_12punct

The columns above correspond to token attributes used for search in the corpus. Detailed definitions of each attribute are given below.

Truncated CoNLL‑U format

UD treebanks are normally stored in the CoNLL‑U format, where each token has multiple columns (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, etc.).

In this corpus, a truncated CoNLL‑U representation is used inside each <s> sentence: only the fields needed for search are kept, and unused columns are omitted.

Mapping to corpus attributes
Corpus attribute Meaning CoNLL‑U analogue
id Token number in the sentence ID
word Surface wordform FORM
lemma Lemma LEMMA
tag POS tag (UPOS) UPOS
feat Morphological feature bundle FEATS
head Head token id (0 = root) HEAD
deprel Dependency relation to head DEPREL

Annotation

<head>: Head index

The head column stores the ID of the syntactic head (governor) of the current token within the same sentence. The dependency structure is a single rooted tree over tokens.

Example from the sample sentence:

<deprel>: Dependency relations

Syntactic dependencies are encoded as a dependency tree per sentence using UD relation labels. Examples are illustrative and use the direction HEAD → DEPENDENT.

See UD documentation: https://universaldependencies.org/u/dep/

Source documents

Back to top