OrthRus

OrthRus is a large, multi-genre Russian corpus of Orthodox religious language compiled from multiple portals and content sections. The data were collected in December 2025 and January 2026; the current index was compiled in January 2026. It covers news, topical articles, and Q&A materials, with public subcorpora by genre (core/news/Q&A) and by portal (source website).

Corpus query interface (NoSketch Engine): https://noske.fisun.org/#dashboard?corpname=orthrus

Tagset and annotation notes: https://corpora.fisun.org/corpus-pages/tagset.html

Sources

Portal coverage

Portal coverage is intentionally uneven: the index is built around high-volume portals and complemented with smaller sources to increase genre and register diversity.

PortalTokensShare (%)
azbyka.ru143,317,22161.592
patriarchia.ru47,080,13020.233
foma.ru35,067,92115.071
dialog.elitsy.ru6,764,5012.907
pravmir.ru457,7950.197

Reach (where available)

Traffic and audience figures are heterogeneous across portals and are provided only as contextual information.

azbyka.ru

Azbyka.ru is a large Orthodox portal with multiple sections (library, catechesis, Q&A, thematic guides, calendars, and media). OrthRus includes selected Azbyka sections that match the corpus scope and the current compilation and annotation pipeline.

patriarchia.ru

Patriarchia.ru is presented as the official website of the Russian Orthodox Church (Moscow Patriarchate). The OrthRus selection from patriarchia.ru includes: News, Patriarch (news and messages related to the Patriarch), and official Patriarchate documents.

foma.ru

The foma.ru component is compiled from the Orthodox media project “Foma”. In OrthRus, foma.ru contributes a large share of editorial writing in a public media register and complements the institutional register of patriarchia.ru.

dialog.elitsy.ru

The dialog.elitsy.ru component represents Q&A-style interaction hosted within the “Elitsy” Orthodox social network.

pravmir.ru

In the current OrthRus index, pravmir.ru has limited token coverage because only the Q&A segment is included at this stage.

Coverage notes (azbyka.ru)

Azbyka.ru contains many sections. OrthRus includes only the parts that contain Russian-language Orthodox discourse and are compatible with the current compilation and annotation scope.

Included (current)

Excluded (current)

Public subcorpora

OrthRus provides two public dimensions of subcorpora: by genre and by portal. Genre subcorpora reflect the main site-section types used across portals.

By genre

By portal

Document structure

Units and segmentation

<doc> metadata fields

Metadata is extracted from the source websites and normalized where possible. Field availability varies by portal and content type; some values are noisy due to source heterogeneity.

FieldMeaning
doc.text_idInternal unique identifier for the document in the corpus.
doc.urlSource URL of the document (when available).
doc.portalSource portal label (azbyka / patriarchia / foma / elitsy / pravmir).
doc.titleDocument title (headline) as provided by the source page.
doc.authorAuthor name string when present on the source page.
doc.status Author status / role descriptor, often reflecting church hierarchy or institutional role (e.g., “иерей”, “священник”, “архиепископ”, “патриарх”). Values are automatically extracted and may be imperfect.
doc.genreHigh-level genre label used for public subcorpora: core, news, qa.
doc.rubricRubric/section label as defined by the source site (site category or editorial rubric).
doc.pubdateFull publication date (day-month-year) when available.
doc.pubyearPublication year only.
doc.languageLanguage label when available (the OrthRus index targets Russian-language content).
doc.sourceSource/bibliographic string when present (mainly for library-style materials with edition metadata).

Linguistic annotation

The corpus is annotated in the Universal Dependencies (UD) framework using UDPipe (lemmatization and UPOS tagging; morphological features in UD FEATS format).

Tagset and annotation notes: https://corpora.fisun.org/corpus-pages/tagset.html

Size (current index)

Tokens232,687,568
Documents (<doc>)223,666
Sentences (<s>)13,571,482
Q&A answer segments (<answer>)92,741

Minimal corpus profile

OrthRus is dominated by core article sections (80.7% of tokens). News accounts for 12.5% of tokens, while Q&A accounts for 6.8% and provides a distinct dialogic register. The portal contribution is intentionally uneven and reflects both the underlying portal sizes and the selective coverage policy documented above.

Subcorpus sizes (tokens)

SubcorpusTokensShare (%)
genre_core187,745,21380.686
genre_news29,148,08512.527
genre_qa15,794,2706.788

How to cite

Fisun, Roman. 2026. OrthRus: Russian Orthodox Web Corpus. Compiled from azbyka.ru, patriarchia.ru, foma.ru, dialog.elitsy.ru, pravmir.ru (multi-genre; news, core articles, and Q&A; selective coverage as documented; data collected in December 2025 and January 2026). Available at: https://corpora.fisun.org/ (corpus name: orthrus). Accessed: <YYYY-MM-DD>.

Software

Terms of use

Access to this corpus is restricted (password-protected) and provided on an “as is” basis for research and educational use only. This service does not grant any license or other rights to the underlying texts.

The corpus (including any excerpts, downloads, or derived copies of the original texts) is not freely distributable. Reproduction, redistribution, republication, mirroring, or making the content publicly available is prohibited unless you have explicit permission from the respective rights holders and/or the source websites.

Copyright and any other rights in the original texts remain with the respective source websites and/or their authors. Users are solely responsible for ensuring that any use complies with applicable law and the terms of the original sources.

The maintainer makes no warranties regarding completeness, accuracy, fitness for a particular purpose, or continued availability of the service.

Contact

Maintainer: roman.fisun@ur.de