This corpus uses the Universal Dependencies (UD) annotation scheme as implemented in the UD Russian SynTagRus treebank.
The corpus is stored as a vertical file. Each sentence (<s>) contains one line per token with a fixed set of columns.
The example below illustrates the exact token-level annotation format used in this corpus.
| id | word | lemma | tag | feat | head | deprel |
|---|---|---|---|---|---|---|
| 1 | Ведь | ведь | PART | _ | 4 | advmod |
| 2 | когда | когда | SCONJ | _ | 4 | mark |
| 3 | человек | человек | NOUN | Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | 4 | nsubj |
| 4 | приходит | приходить | VERB | Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 12 | advcl |
| 5 | ко | ко | ADP | _ | 6 | case |
| 6 | мне | я | PRON | Case=Dat|Number=Sing|Person=1|PronType=Prs | 4 | obl |
| 7 | со | с | ADP | _ | 9 | case |
| 8 | своей | свой | DET | Case=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes | 9 | det |
| 9 | болью | боль | NOUN | Animacy=Inan|Case=Ins|Gender=Fem|Number=Sing | 4 | obl |
| 10 | , | , | PUNCT | _ | 4 | punct |
| 11 | я | я | PRON | Case=Nom|Number=Sing|Person=1|PronType=Prs | 12 | nsubj |
| 12 | должен | должен | ADJ | Degree=Pos|Gender=Masc|Number=Sing|Variant=Short | 0 | root |
| 13 | войти | войти | VERB | Aspect=Perf|VerbForm=Inf|Voice=Act | 12 | xcomp |
| 14 | в | в | ADP | _ | 16 | case |
| 15 | эту | этот | DET | Case=Acc|Gender=Fem|Number=Sing|PronType=Dem | 16 | det |
| 16 | боль | боль | NOUN | Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing | 13 | obl |
| 17 | , | , | PUNCT | _ | 19 | punct |
| 18 | чтобы | чтобы | SCONJ | Mood=Cnd | 19 | mark |
| 19 | понять | понять | VERB | Aspect=Perf|VerbForm=Inf|Voice=Act | 13 | advcl |
| 20 | , | , | PUNCT | _ | 23 | punct |
| 21 | что | что | PRON | Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel | 23 | obj |
| 22 | он | он | PRON | Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs | 23 | nsubj |
| 23 | чувствует | чувствовать | VERB | Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 19 | ccomp |
| 24 | . | . | PUNCT | _ | 12 | punct |
The columns shown above correspond to the token attributes used for search in the corpus. Detailed definitions of these attributes are provided below.
UD treebanks are normally stored in CoNLL-U format, where each token is represented by multiple columns (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, etc.).
In this corpus, each <s> sentence uses a truncated CoNLL-U representation:
only the fields required for search are retained, and unused columns are omitted.
| Corpus attribute | Meaning | CoNLL-U analogue |
|---|---|---|
id | Token number within the sentence | ID |
word | Surface word form | FORM |
lemma | Lemma | LEMMA |
tag | POS tag (UPOS) | UPOS |
feat | Morphological feature bundle | FEATS |
head | Head token ID (0 = root) | HEAD |
deprel | Dependency relation to head | DEPREL |
<tag> — POS tag (UD UPOS; 17 universal POS categories), e.g.
NOUN (человек), VERB (приходит), ADP (в).
<feat> — morphological features (UD FEATS) encoded as
Feature=Value pairs separated by |. For example:
Case=Nom|Gender=Masc|Number=Sing (typical of nouns, adjectives, and pronouns),
or Aspect=Imp|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act (typical of finite verbs).
If no features are available for a token, the value is _.
The corpus uses 17 POS tags:
ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN,
NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB,
X.
ADJ — adjective (e.g. большой).ADP — adposition (preposition or postposition), e.g. в, с.ADV — adverb, e.g. быстро.AUX — auxiliary, e.g. бы, быть (when used as an auxiliary).CCONJ — coordinating conjunction, e.g. и, но.DET — determiner (pronominal modifier), e.g. этот, свой.INTJ — interjection, e.g. увы.NOUN — common noun, e.g. человек.NUM — numeral, e.g. три, 10.PART — particle, e.g. ведь, же, не.PRON — pronoun, e.g. я, он, что.PROPN — proper noun, e.g. Москва.PUNCT — punctuation, e.g. ,, ..SCONJ — subordinating conjunction, e.g. что, чтобы, когда.SYM — symbol, e.g. %, §.VERB — lexical verb, e.g. приходить, понять.X — other or unknown; foreign fragments, etc.
The corpus uses UD-style feature bundles: Feature=Value pairs separated by |, for example:
Case=Nom|Gender=Masc|Number=Sing.
In UD Russian SynTagRus, the feature inventory includes, among others,
Animacy, Aspect, Case, Degree, Gender, Mood,
Number, Person, PronType, Tense, Variant,
VerbForm, Voice, etc.
Click on a POS tag to see the morphological categories and values attested in UD Russian SynTagRus.
Anim, Inan.Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Masc, Fem, Neut.Sing, Plur.See also: Case statistics, Animacy statistics.
Imp, Perf.Ind, Imp, Cnd.
Ind — indicative. Example: он читает.Imp — imperative. Example: читай!.Cnd — conditional. Example: он бы читал.Past, Pres, Fut.Fin, Inf, Part, Conv.
Fin — finite verb. Example: читает.Inf — infinitive. Example: читать.Part — participle. Examples: читающий, прочитанный.Conv — converb. Examples: читая, прочитав.Act, Pass, Mid.
Act — active. Example: он читает книгу.Pass — passive. Examples: книга прочитана, книга была прочитана.Mid — middle. Examples: дверь открывается, книга читается легко.Sing, Plur.1, 2, 3.Masc, Fem, Neut.Prs — personal (e.g. я, он).Dem — demonstrative (e.g. это, тот).Int — interrogative (e.g. кто?, что?).Rel — relative.Ind — indefinite.Neg — negative.Tot — total or universal.Emp — emphatic.Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Sing, Plur.Masc, Fem, Neut.1, 2, 3.Anim, Inan.Yes.Pos, Cmp, Sup.Sing, Plur.Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Masc, Fem, Neut.
Adpositions (mostly prepositions) are typically not inflected in UD Russian SynTagRus, so they usually have no morphological features
(feat=_).
Pos, Cmp, Sup.Fin, Part.Ind.Past.Sing, Plur.Masc, Fem, Neut.Act.
Coordinating conjunctions typically have no morphological features (feat=_).
Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Masc, Fem, Neut.Sing, Plur.Dem, Ind, Int, Rel, Tot, etc.Yes (possessive determiners).Yes (e.g. reflexive possessives such as свой).
Interjections typically have no morphological features (feat=_).
Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Masc, Fem, Neut.Sing, Plur.Card, Ord, Frac, Mult, Dist, Range, Sets.
Word, Digit, Roman.
Particles typically have no morphological features (feat=_).
Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.Masc, Fem, Neut.Sing, Plur.Anim, Inan.Geo, Prs, Giv, Sur, Com, Pro, Nat, Oth.
Punctuation tokens have no morphological features (feat=_).
Subordinating conjunctions typically have no morphological features (feat=_).
Symbols typically have no morphological features (feat=_).
The tag X is used for other or unknown tokens; morphological features are typically empty (feat=_).
<head> — head token ID, i.e. the syntactic governor of the current token within the same sentence.
Heads form a single dependency tree for each <s>; the root token has head=0.
For example, in the sample sentence, должен is the root (head=0), while
войти depends on it (head=12).
<deprel> — dependency relation label to head, describing the syntactic function of the token.
For example, nsubj marks subjects, case marks adpositions attached to a noun or pronoun,
det marks determiners, and punct attaches punctuation.
The head column stores the ID of the syntactic head (governor) of the current token within the same sentence.
The dependency structure forms a single rooted tree over the tokens of the sentence.
head is an integer pointing to another token’s id.head=0 means that the token is the sentence root.head and deprel must be interpreted together: deprel specifies the relation between the token and its head.Example from the sample sentence:
head=0 and is therefore the root of the sentence.head=12, so it depends on token 12 (должен) with relation xcomp.head=4, so it depends on token 4 (приходит) with relation nsubj.Syntactic dependencies are encoded as one dependency tree per sentence using UD relation labels. The examples below are illustrative and use the direction HEAD → DEPENDENT.
root — sentence root (typically the main predicate).nsubj — nominal subject.obj — object.iobj — indirect object.obl — oblique nominal.advmod — adverbial modifier.advcl — adverbial clause modifier.ccomp — clausal complement.xcomp — open clausal complement.mark — marker of a subordinate clause.case — case marker (adposition).det — determiner.amod — adjectival modifier.nmod — nominal modifier.conj — conjunct in coordination.cc — coordinating conjunction.punct — punctuation attachment.See UD documentation: https://universaldependencies.org/u/dep/
name.surname [at] ur.de