Tagset
This corpus uses the Universal Dependencies (UD) annotation scheme as implemented in the UD Russian SynTagRus treebank.
One sentence in the corpus
The corpus is stored as a vertical file. Each sentence (<s>) contains one line per token with a fixed set of columns.
The example below shows the exact token-level annotation format used in this corpus.
Example (sentence <s>)
| id | word | lemma | tag | feat | head | deprel |
|---|---|---|---|---|---|---|
| 1 | Ведь | ведь | PART | _ | 4 | advmod |
| 2 | когда | когда | SCONJ | _ | 4 | mark |
| 3 | человек | человек | NOUN | Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | 4 | nsubj |
| 4 | приходит | приходить | VERB | Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 12 | advcl |
| 5 | ко | ко | ADP | _ | 6 | case |
| 6 | мне | я | PRON | Case=Dat|Number=Sing|Person=1|PronType=Prs | 4 | obl |
| 7 | со | с | ADP | _ | 9 | case |
| 8 | своей | свой | DET | Case=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes | 9 | det |
| 9 | болью | боль | NOUN | Animacy=Inan|Case=Ins|Gender=Fem|Number=Sing | 4 | obl |
| 10 | , | , | PUNCT | _ | 4 | punct |
| 11 | я | я | PRON | Case=Nom|Number=Sing|Person=1|PronType=Prs | 12 | nsubj |
| 12 | должен | должен | ADJ | Degree=Pos|Gender=Masc|Number=Sing|Variant=Short | 0 | root |
| 13 | войти | войти | VERB | Aspect=Perf|VerbForm=Inf|Voice=Act | 12 | xcomp |
| 14 | в | в | ADP | _ | 16 | case |
| 15 | эту | этот | DET | Case=Acc|Gender=Fem|Number=Sing|PronType=Dem | 16 | det |
| 16 | боль | боль | NOUN | Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing | 13 | obl |
| 17 | , | , | PUNCT | _ | 19 | punct |
| 18 | чтобы | чтобы | SCONJ | Mood=Cnd | 19 | mark |
| 19 | понять | понять | VERB | Aspect=Perf|VerbForm=Inf|Voice=Act | 13 | advcl |
| 20 | , | , | PUNCT | _ | 23 | punct |
| 21 | что | что | PRON | Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel | 23 | obj |
| 22 | он | он | PRON | Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs | 23 | nsubj |
| 23 | чувствует | чувствовать | VERB | Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 19 | ccomp |
| 24 | . | . | PUNCT | _ | 12 | punct |
The columns above correspond to token attributes used for search in the corpus. Detailed definitions of each attribute are given below.
Truncated CoNLL‑U format
UD treebanks are normally stored in the CoNLL‑U format, where each token has multiple columns (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, etc.).
In this corpus, a truncated CoNLL‑U representation is used inside each <s> sentence:
only the fields needed for search are kept, and unused columns are omitted.
Mapping to corpus attributes
| Corpus attribute | Meaning | CoNLL‑U analogue |
|---|---|---|
id |
Token number in the sentence | ID |
word |
Surface wordform | FORM |
lemma |
Lemma | LEMMA |
tag |
POS tag (UPOS) | UPOS |
feat |
Morphological feature bundle | FEATS |
head |
Head token id (0 = root) | HEAD |
deprel |
Dependency relation to head | DEPREL |
Annotation
-
Morphological annotation:
-
<tag>— POS tag (UD UPOS; 17 universal POS categories), e.g.NOUN(человек),VERB(приходит),ADP(в). -
<feat>— morphological features (UD FEATS) encoded asFeature=Valuepairs separated by|. For example:Case=Nom|Gender=Masc|Number=Sing(typical for nouns/adjectives/pronouns), orAspect=Imp|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act(typical for finite verbs). If no features are available for a token, the value is_.
-
ADJ— adjective (e.g. большой).ADP— adposition (preposition/postposition), e.g. в, с.ADV— adverb, e.g. быстро.AUX— auxiliary, e.g. бы, быть (when used as auxiliary).CCONJ— coordinating conjunction, e.g. и, но.DET— determiner (pronominal modifier), e.g. этот, свой.INTJ— interjection, e.g. увы.NOUN— common noun, e.g. человек.NUM— numeral, e.g. три, 10.PART— particle, e.g. ведь, же, не.PRON— pronoun, e.g. я, он, что.PROPN— proper noun, e.g. Москва.PUNCT— punctuation, e.g. , ..SCONJ— subordinating conjunction, e.g. что, чтобы, когда.SYM— symbol, e.g. %, §.VERB— lexical verb, e.g. приходить, понять.X— other/unknown, foreign fragments, etc.- Animacy:
Anim,Inan. - Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Gender:
Masc,Fem,Neut. - Number:
Sing,Plur. - Aspect:
Imp,Perf. -
Mood:
Ind,Imp,Cnd.Ind— indicative (default/realis). Example: он читает.Imp— imperative (command/request). Example: читай!.Cnd— conditional (hypothetical “would”). Example: он бы читал.
- Tense:
Past,Pres,Fut. -
VerbForm:
Fin,Inf,Part,Conv.Fin— finite verb. Example: читает.Inf— infinitive. Example: читать.Part— participle. Examples: читающий, прочитанный.Conv— converb. Examples: читая, прочитав.
-
Voice:
Act,Pass,Mid.Act— active. Example: он читает книгу.Pass— passive. Examples: книга прочитана, книга была прочитана.Mid— middle. Examples: дверь открывается, книга читается легко.
- Number:
Sing,Plur. - Person:
1,2,3. - Gender (past/participles):
Masc,Fem,Neut. -
PronType:
Prs— personal (e.g. я, он).Dem— demonstrative (e.g. это, тот).Int— interrogative (e.g. кто?, что?).Rel— relative (e.g. кто / что in “тот, кто…”).Ind— indefinite (e.g. кто-то, что-нибудь).Neg— negative (e.g. никто, ничто).Tot— total/universal (e.g. все, оба).Emp— emphatic.
- Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Number:
Sing,Plur. - Gender (where applicable):
Masc,Fem,Neut. - Person (personal pronouns):
1,2,3. - Animacy (some pronouns):
Anim,Inan. - Reflex (reflexive):
Yes. - Degree:
Pos,Cmp,Sup. - Number:
Sing,Plur. - Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Gender:
Masc,Fem,Neut. - Variant: (attested; e.g. short/other variants; see statistics page for values used in this treebank).
- See also: ADJ statistics (ru_syntagrus), Degree, Variant.
-
Degree:
Pos,Cmp,Sup. UD - See also: ADV statistics (ru_syntagrus).
- VerbForm:
Fin,Part. - Mood:
Ind. - Tense:
Past. - Number:
Sing,Plur. - Gender:
Masc,Fem,Neut. - Voice:
Act. - Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Gender:
Masc,Fem,Neut. - Number:
Sing,Plur. - PronType: e.g.
Dem,Ind,Int,Rel,Tot, etc. - Poss:
Yes(possessive determiners). - Reflex:
Yes(e.g. reflexive possessives like “свой”). - Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Gender (where applicable):
Masc,Fem,Neut. - Number (where applicable):
Sing,Plur. -
NumType — numeral type.
Values:
Card,Ord,Frac,Mult,Dist,Range,Sets. Examples:Card(три),Ord(третий),Frac(полтора). UD: NumType -
NumForm — numeral form.
Values:
Word,Digit,Roman. Examples:Word(три),Digit(3),Roman(III). UD: NumForm - Case:
Nom,Acc,Gen,Dat,Ins,Loc,Par,Voc. - Gender:
Masc,Fem,Neut. - Number:
Sing,Plur. - Animacy (where applicable):
Anim,Inan. -
NameType — named-entity subtype for proper names.
Values:
Geo,Prs,Giv,Sur,Com,Pro,Nat,Oth. Examples:Geo(Москва),Prs(Иван),Com(Газпром). UD: NameType -
Syntactic annotation:
-
<head>— head token id (the syntactic “governor” of the current token within the same sentence). Heads form a single dependency tree for each<s>; the root token hashead=0. Example: in the sample sentence, должен is the root (head=0), while войти depends on it (head=12). -
<deprel>— dependency relation label tohead, describing the syntactic function of the token. Examples:nsubjmarks subjects (приходит → человек, должен → я);casemarks prepositions attached to a noun/pronoun (мне → ко, боль → в);detmarks determiners (боль → эту);punctattaches punctuation.
-
<tag>: POS tags
The corpus uses 17 POS tags:
ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN,
NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB,
X.
<feat>: Morphological categories and subcategories
The corpus uses UD-style feature bundles: Feature=Value pairs separated by |, for example:
Case=Nom|Gender=Masc|Number=Sing.
In UD Russian SynTagRus, the feature inventory includes (among others)
Animacy, Aspect, Case, Degree, Gender, Mood,
Number, Person, PronType, Tense, Variant,
VerbForm, Voice, etc.
Feature bundles by POS (full lists)
Click a POS tag to see all morphological categories and values attested in UD Russian SynTagRus.
NOUN
See also: Case statistics, Animacy statistics.
VERB
PRON
ADJ
ADP
Adpositions (mostly prepositions) are typically not inflected in UD Russian SynTagRus, so they usually have no morphological features
(feat=_).
See also: ADP statistics (ru_syntagrus).
ADV
AUX
See also: AUX statistics (ru_syntagrus).
CCONJ
Coordinating conjunctions typically have no morphological features (feat=_).
DET
See also: DET statistics (ru_syntagrus).
INTJ
Interjections typically have no morphological features (feat=_).
NUM
See also: NUM statistics (ru_syntagrus).
PART
Particles typically have no morphological features (feat=_).
See also: PART statistics (ru_syntagrus).
PROPN
See also: PROPN statistics (ru_syntagrus).
PUNCT
Punctuation tokens have no morphological features (feat=_).
See also: PUNCT statistics (ru_syntagrus).
SCONJ
Subordinating conjunctions typically have no morphological features (feat=_).
See also: SCONJ statistics (ru_syntagrus).
SYM
Symbols typically have no morphological features (feat=_).
X
The tag X is used for other/unknown tokens; morphological features are typically empty (feat=_).
See also: X statistics (ru_syntagrus).
<head>: Head index
The head column stores the ID of the syntactic head (governor) of the current token within the same sentence.
The dependency structure is a single rooted tree over tokens.
headis an integer pointing to another token’sid.head=0means that the token is the sentence root (the main predicate of the dependency tree).headanddeprelmust be interpreted together:depreltells what relation holds between the token and itshead.
Example from the sample sentence:
- должен has
head=0and therefore is the root of the sentence. - войти has
head=12, so it depends on token 12 (должен) with relationxcomp. - человек has
head=4, so it depends on token 4 (приходит) with relationnsubj.
<deprel>: Dependency relations
Syntactic dependencies are encoded as a dependency tree per sentence using UD relation labels. Examples are illustrative and use the direction HEAD → DEPENDENT.
root— sentence root (typically the main predicate). Example: должен.nsubj— nominal subject. Example: приходит → человек.obj— object. Example: чувствует → что.iobj— indirect object. Example: дать → мне.obl— oblique nominal. Example: приходит → мне.advmod— adverbial modifier / particle-like modifier. Example: приходит → ведь.advcl— adverbial clause modifier. Example: должен → приходит.ccomp— clausal complement. Example: понять → чувствует.xcomp— open clausal complement. Example: должен → войти.mark— marker of subordinate clause. Example: приходит → когда; понять → чтобы.case— case marker (adposition). Example: мне → ко; боль → в.det— determiner. Example: боль → эту; болью → своей.amod— adjectival modifier. Example: дом → большой.nmod— nominal modifier. Example: книга → брата.conj— conjunct in coordination. Example: читал → писал.cc— coordinating conjunction. Example: читал → и.punct— punctuation attachment. Example: должен → ,; должен → ..
See UD documentation: https://universaldependencies.org/u/dep/
Source documents
- UD Russian SynTagRus treebank description: https://universaldependencies.org/treebanks/ru_syntagrus/index.html
- Lyashevskaya, O. et al. (2016), Universal Dependencies for Russian: A New Syntactic Dependencies Tagset (working paper; UD for Russian and CoNLL‑U fields): (attached PDF).