Tagset

This corpus uses the Universal Dependencies (UD) annotation scheme as implemented in the UD Russian SynTagRus treebank.

One sentence in the corpus

The corpus is stored as a vertical file. Each sentence (<s>) contains one line per token with a fixed set of columns. The example below shows the exact token-level annotation format used in this corpus.

Example (sentence <s>)

id	word	lemma	tag	feat	head	deprel
1	Ведь	ведь	PART	_	4	advmod
2	когда	когда	SCONJ	_	4	mark
3	человек	человек	NOUN	Animacy=Anim\|Case=Nom\|Gender=Masc\|Number=Sing	4	nsubj
4	приходит	приходить	VERB	Aspect=Imp\|Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	12	advcl
5	ко	ко	ADP	_	6	case
6	мне	я	PRON	Case=Dat\|Number=Sing\|Person=1\|PronType=Prs	4	obl
7	со	с	ADP	_	9	case
8	своей	свой	DET	Case=Ins\|Gender=Fem\|Number=Sing\|Poss=Yes\|PronType=Prs\|Reflex=Yes	9	det
9	болью	боль	NOUN	Animacy=Inan\|Case=Ins\|Gender=Fem\|Number=Sing	4	obl
10	,	,	PUNCT	_	4	punct
11	я	я	PRON	Case=Nom\|Number=Sing\|Person=1\|PronType=Prs	12	nsubj
12	должен	должен	ADJ	Degree=Pos\|Gender=Masc\|Number=Sing\|Variant=Short	0	root
13	войти	войти	VERB	Aspect=Perf\|VerbForm=Inf\|Voice=Act	12	xcomp
14	в	в	ADP	_	16	case
15	эту	этот	DET	Case=Acc\|Gender=Fem\|Number=Sing\|PronType=Dem	16	det
16	боль	боль	NOUN	Animacy=Inan\|Case=Acc\|Gender=Fem\|Number=Sing	13	obl
17	,	,	PUNCT	_	19	punct
18	чтобы	чтобы	SCONJ	Mood=Cnd	19	mark
19	понять	понять	VERB	Aspect=Perf\|VerbForm=Inf\|Voice=Act	13	advcl
20	,	,	PUNCT	_	23	punct
21	что	что	PRON	Animacy=Inan\|Case=Acc\|Gender=Neut\|Number=Sing\|PronType=Int,Rel	23	obj
22	он	он	PRON	Case=Nom\|Gender=Masc\|Number=Sing\|Person=3\|PronType=Prs	23	nsubj
23	чувствует	чувствовать	VERB	Aspect=Imp\|Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	19	ccomp
24	.	.	PUNCT	_	12	punct

The columns above correspond to token attributes used for search in the corpus. Detailed definitions of each attribute are given below.

Truncated CoNLL‑U format

UD treebanks are normally stored in the CoNLL‑U format, where each token has multiple columns (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, etc.).

In this corpus, a truncated CoNLL‑U representation is used inside each <s> sentence: only the fields needed for search are kept, and unused columns are omitted.

Mapping to corpus attributes

Corpus attribute	Meaning	CoNLL‑U analogue
`id`	Token number in the sentence	ID
`word`	Surface wordform	FORM
`lemma`	Lemma	LEMMA
`tag`	POS tag (UPOS)	UPOS
`feat`	Morphological feature bundle	FEATS
`head`	Head token id (0 = root)	HEAD
`deprel`	Dependency relation to head	DEPREL

Annotation

Morphological annotation:
- <tag> — POS tag (UD UPOS; 17 universal POS categories), e.g. NOUN (человек), VERB (приходит), ADP (в).
- <feat> — morphological features (UD FEATS) encoded as Feature=Value pairs separated by |. For example: Case=Nom|Gender=Masc|Number=Sing (typical for nouns/adjectives/pronouns), or Aspect=Imp|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act (typical for finite verbs). If no features are available for a token, the value is _.

<tag>: POS tags

The corpus uses 17 POS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

ADJ — adjective (e.g. большой).
ADP — adposition (preposition/postposition), e.g. в, с.
ADV — adverb, e.g. быстро.
AUX — auxiliary, e.g. бы, быть (when used as auxiliary).
CCONJ — coordinating conjunction, e.g. и, но.
DET — determiner (pronominal modifier), e.g. этот, свой.
INTJ — interjection, e.g. увы.
NOUN — common noun, e.g. человек.
NUM — numeral, e.g. три, 10.
PART — particle, e.g. ведь, же, не.
PRON — pronoun, e.g. я, он, что.
PROPN — proper noun, e.g. Москва.
PUNCT — punctuation, e.g. , ..
SCONJ — subordinating conjunction, e.g. что, чтобы, когда.
SYM — symbol, e.g. %, §.
VERB — lexical verb, e.g. приходить, понять.
X — other/unknown, foreign fragments, etc.

<feat>: Morphological categories and subcategories

The corpus uses UD-style feature bundles: Feature=Value pairs separated by |, for example: Case=Nom|Gender=Masc|Number=Sing.

In UD Russian SynTagRus, the feature inventory includes (among others) Animacy, Aspect, Case, Degree, Gender, Mood, Number, Person, PronType, Tense, Variant, VerbForm, Voice, etc.

Feature bundles by POS (full lists)

Click a POS tag to see all morphological categories and values attested in UD Russian SynTagRus.

NOUN

Animacy: Anim, Inan.
Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Gender: Masc, Fem, Neut.
Number: Sing, Plur.

VERB

Aspect: Imp, Perf.
Mood: Ind, Imp, Cnd.
- Ind — indicative (default/realis). Example: он читает.
- Imp — imperative (command/request). Example: читай!.
- Cnd — conditional (hypothetical “would”). Example: он бы читал.
UD: Mood
Tense: Past, Pres, Fut.
VerbForm: Fin, Inf, Part, Conv.
- Fin — finite verb. Example: читает.
- Inf — infinitive. Example: читать.
- Part — participle. Examples: читающий, прочитанный.
- Conv — converb. Examples: читая, прочитав.
UD: VerbForm
Voice: Act, Pass, Mid.
- Act — active. Example: он читает книгу.
- Pass — passive. Examples: книга прочитана, книга была прочитана.
- Mid — middle. Examples: дверь открывается, книга читается легко.
UD: Voice
Number: Sing, Plur.
Person: 1, 2, 3.
Gender (past/participles): Masc, Fem, Neut.

PRON

PronType:
- Prs — personal (e.g. я, он).
- Dem — demonstrative (e.g. это, тот).
- Int — interrogative (e.g. кто?, что?).
- Rel — relative (e.g. кто / что in “тот, кто…”).
- Ind — indefinite (e.g. кто-то, что-нибудь).
- Neg — negative (e.g. никто, ничто).
- Tot — total/universal (e.g. все, оба).
- Emp — emphatic.
UD: PronType
Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Number: Sing, Plur.
Gender (where applicable): Masc, Fem, Neut.
Person (personal pronouns): 1, 2, 3.
Animacy (some pronouns): Anim, Inan.
Reflex (reflexive): Yes.

ADJ

Degree: Pos, Cmp, Sup.
Number: Sing, Plur.
Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Gender: Masc, Fem, Neut.
Variant: (attested; e.g. short/other variants; see statistics page for values used in this treebank).
See also: ADJ statistics (ru_syntagrus), Degree, Variant.

ADP

Adpositions (mostly prepositions) are typically not inflected in UD Russian SynTagRus, so they usually have no morphological features (feat=_).

ADV

Degree: Pos, Cmp, Sup. UD
See also: ADV statistics (ru_syntagrus).

AUX

VerbForm: Fin, Part.
Mood: Ind.
Tense: Past.
Number: Sing, Plur.
Gender: Masc, Fem, Neut.
Voice: Act.

CCONJ

Coordinating conjunctions typically have no morphological features (feat=_).

DET

Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Gender: Masc, Fem, Neut.
Number: Sing, Plur.
PronType: e.g. Dem, Ind, Int, Rel, Tot, etc.
Poss: Yes (possessive determiners).
Reflex: Yes (e.g. reflexive possessives like “свой”).

INTJ

Interjections typically have no morphological features (feat=_).

NUM

Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Gender (where applicable): Masc, Fem, Neut.
Number (where applicable): Sing, Plur.
NumType — numeral type. Values: Card, Ord, Frac, Mult, Dist, Range, Sets. Examples: Card (три), Ord (третий), Frac (полтора). UD: NumType
NumForm — numeral form. Values: Word, Digit, Roman. Examples: Word (три), Digit (3), Roman (III). UD: NumForm

PART

Particles typically have no morphological features (feat=_).

PROPN

Case: Nom, Acc, Gen, Dat, Ins, Loc, Par, Voc.
Gender: Masc, Fem, Neut.
Number: Sing, Plur.
Animacy (where applicable): Anim, Inan.
NameType — named-entity subtype for proper names. Values: Geo, Prs, Giv, Sur, Com, Pro, Nat, Oth. Examples: Geo (Москва), Prs (Иван), Com (Газпром). UD: NameType

PUNCT

Punctuation tokens have no morphological features (feat=_).

SCONJ

Subordinating conjunctions typically have no morphological features (feat=_).

SYM

Symbols typically have no morphological features (feat=_).

The tag X is used for other/unknown tokens; morphological features are typically empty (feat=_).

Source documents

UD Russian SynTagRus treebank description: https://universaldependencies.org/treebanks/ru_syntagrus/index.html
Lyashevskaya, O. et al. (2016), Universal Dependencies for Russian: A New Syntactic Dependencies Tagset (working paper; UD for Russian and CoNLL‑U fields): (attached PDF).