OrthRus
OrthRus is a large, multi-genre Russian corpus of Orthodox religious language compiled from multiple portals and content sections.
The data were collected in December 2025 and January 2026; the current index was compiled in January 2026.
It covers news, topical articles, and Q&A materials, with public subcorpora by genre (core/news/Q&A) and by portal (source website).
Sources
Portal coverage
Portal coverage is intentionally uneven: the index is built around high-volume portals and complemented with smaller sources to increase genre and register diversity.
| Portal | Tokens | Share (%) |
| azbyka.ru | 143,317,221 | 61.592 |
| patriarchia.ru | 47,080,130 | 20.233 |
| foma.ru | 35,067,921 | 15.071 |
| dialog.elitsy.ru | 6,764,501 | 2.907 |
| pravmir.ru | 457,795 | 0.197 |
Reach (where available)
Traffic and audience figures are heterogeneous across portals and are provided only as contextual information.
- azbyka.ru: Similarweb estimated total visits — 27.8M (snapshot shown for February 2025).
- patriarchia.ru: Similarweb estimated total visits — 783.3K (snapshot shown for February 2025).
- foma.ru: Similarweb estimated total visits — 3.4M (snapshot shown for February 2025); project-reported — ~2.5M monthly unique visitors to foma.ru and ~17M monthly total reach across platforms (December 2025).
- pravmir.ru: Similarweb estimated total visits — 4M (snapshot shown for February 2025).
azbyka.ru
Azbyka.ru is a large Orthodox portal with multiple sections (library, catechesis, Q&A, thematic guides, calendars, and media).
OrthRus includes selected Azbyka sections that match the corpus scope and the current compilation and annotation pipeline.
patriarchia.ru
Patriarchia.ru is presented as the official website of the Russian Orthodox Church (Moscow Patriarchate).
The OrthRus selection from patriarchia.ru includes: News, Patriarch (news and messages related to the Patriarch), and official Patriarchate documents.
foma.ru
The foma.ru component is compiled from the Orthodox media project “Foma”.
In OrthRus, foma.ru contributes a large share of editorial writing in a public media register and complements the institutional register of patriarchia.ru.
dialog.elitsy.ru
The dialog.elitsy.ru component represents Q&A-style interaction hosted within the “Elitsy” Orthodox social network.
pravmir.ru
In the current OrthRus index, pravmir.ru has limited token coverage because only the Q&A segment is included at this stage.
Coverage notes (azbyka.ru)
Azbyka.ru contains many sections. OrthRus includes only the parts that contain Russian-language Orthodox discourse and are compatible with the current compilation and annotation scope.
Included (current)
- Otechnik (library): only authors born in the 20th century are included in the current index.
- Church law (Pravo).
- Way to God (Way) and Catechesis (Katekhizatsiya).
- Questions (Vopros): Q&A materials.
- Guides (Shemy), Sermons (Propovedi), Icons (Ikona).
- Sueveriyam.net, Art, Apokalipsis.
- Church calendar (Days): only saint entries that include icon images and descriptive text.
- Family sections: Marriage (Semya), Parenting (Deti), Health (Zdorovie).
Excluded (current)
- Azbyka Vernosti (dating): excluded because it belongs to a non-religious service genre and is outside the corpus scope.
- Azbyka Sadovoda (gardening): excluded because it is topic-wise unrelated to Orthodox religious discourse and would distort the thematic profile of the corpus.
- Prayers and liturgical texts: excluded because the content is predominantly in Church Slavonic; OrthRus targets Russian-language discourse in the current index.
- Forum: not included yet; planned for future integration.
Public subcorpora
OrthRus provides two public dimensions of subcorpora: by genre and by portal. Genre subcorpora reflect the main site-section types used across portals.
By genre
genre_core: core editorial sections, mainly articles and explanatory texts with an Orthodox focus.
genre_news: news sections and chronicle-style updates.
genre_qa: Q&A discourse (questions and published answers), including pastoral-style consultations and expert replies.
By portal
portal_azbyka: selected sections from azbyka.ru (see coverage notes).
portal_patriarchia: patriarchia.ru News, Patriarch, and official documents.
portal_foma: foma.ru.
portal_elitsy: dialog.elitsy.ru.
portal_pravmir: pravmir.ru (currently Q&A only).
Document structure
Units and segmentation
<doc> is the main document unit (one page/item from a source portal: article, news item, document page, or a Q&A entry).
<s> marks sentence boundaries and is used for sentence-level querying and statistics.
<question> and <answer> are used only when a document is Q&A-like; non-Q&A genres do not contain these segments.
<doc> metadata fields
Metadata is extracted from the source websites and normalized where possible. Field availability varies by portal and content type; some values are noisy due to source heterogeneity.
| Field | Meaning |
doc.text_id | Internal unique identifier for the document in the corpus. |
doc.url | Source URL of the document (when available). |
doc.portal | Source portal label (azbyka / patriarchia / foma / elitsy / pravmir). |
doc.title | Document title (headline) as provided by the source page. |
doc.author | Author name string when present on the source page. |
doc.status |
Author status / role descriptor, often reflecting church hierarchy or institutional role (e.g., “иерей”, “священник”, “архиепископ”, “патриарх”). Values are automatically extracted and may be imperfect. |
doc.genre | High-level genre label used for public subcorpora: core, news, qa. |
doc.rubric | Rubric/section label as defined by the source site (site category or editorial rubric). |
doc.pubdate | Full publication date (day-month-year) when available. |
doc.pubyear | Publication year only. |
doc.language | Language label when available (the OrthRus index targets Russian-language content). |
doc.source | Source/bibliographic string when present (mainly for library-style materials with edition metadata). |
Linguistic annotation
The corpus is annotated in the Universal Dependencies (UD) framework using UDPipe (lemmatization and UPOS tagging; morphological features in UD FEATS format).
Tagset and annotation notes:
https://corpora.fisun.org/corpus-pages/tagset.html
Size (current index)
Tokens232,687,568
Documents (<doc>)223,666
Sentences (<s>)13,571,482
Q&A answer segments (<answer>)92,741
Minimal corpus profile
OrthRus is dominated by core article sections (80.7% of tokens). News accounts for 12.5% of tokens, while Q&A accounts for 6.8% and provides a distinct dialogic register.
The portal contribution is intentionally uneven and reflects both the underlying portal sizes and the selective coverage policy documented above.
Subcorpus sizes (tokens)
| Subcorpus | Tokens | Share (%) |
genre_core | 187,745,213 | 80.686 |
genre_news | 29,148,085 | 12.527 |
genre_qa | 15,794,270 | 6.788 |
How to cite
Fisun, Roman. 2026. OrthRus: Russian Orthodox Web Corpus.
Compiled from azbyka.ru, patriarchia.ru, foma.ru, dialog.elitsy.ru, pravmir.ru (multi-genre; news, core articles, and Q&A; selective coverage as documented; data collected in December 2025 and January 2026).
Available at: https://corpora.fisun.org/ (corpus name: orthrus). Accessed: <YYYY-MM-DD>.
Software
Terms of use
Access to this corpus is restricted (password-protected) and provided on an “as is” basis for research and educational use only. This service does not grant any license or other rights to the underlying texts.
The corpus (including any excerpts, downloads, or derived copies of the original texts) is not freely distributable. Reproduction, redistribution, republication, mirroring, or making the content publicly available is prohibited unless you have explicit permission from the respective rights holders and/or the source websites.
Copyright and any other rights in the original texts remain with the respective source websites and/or their authors. Users are solely responsible for ensuring that any use complies with applicable law and the terms of the original sources.
The maintainer makes no warranties regarding completeness, accuracy, fitness for a particular purpose, or continued availability of the service.
Contact
Maintainer: roman.fisun@ur.de