Overview
The ParlText CEE dataset, developed as part of the V-Shift Momentum project at the HUN-REN Centre for Social Sciences, is designed to address the significant gaps in the availability and accessibility of political research data for Central-Eastern Europe (CEE). Many existing datasets, while valuable, are limited in their metadata scope, geographical coverage, or time frames, making comprehensive comparative research across CEE countries challenging. ParlText CEE seeks to overcome these limitations by offering a more extensive, standardized collection of data covering four CEE countries: Czechia, Hungary, Poland, and Slovakia. The dataset includes nearly 1.3 million text vectors and metadata on parliamentary speeches, bills, and laws, with a time frame extending from the democratic transitions of 1990-1991 to 2022-2024.
The ParlText CEE repository is available at (…). The ParlText CEE database was built on an open-science framework. All data is published in public repositories, providing access based on the CC BY-NC license (Attribution-NonCommercial 4.0 International), constituting its only official version.
Coverage
Despite the perception that CEE countries form a relatively homogenous group, they differ significantly in aspects such as social diversity, party systems, institutional settings or policy-making. These differences are critical for comparative political research, but existing legislative datasets for the region have often lacked the scope or structure needed to support such studies. Additionally, while public legislative archives are available for the region, they are often difficult for researchers to navigate due to inconsistent metadata structures, lack of user-friendly APIs, and challenges in web scraping and data cleaning.
ParlText CEE bridges this gap by providing a deployment-ready database with data on the unicameral legislatures of Hungary and Slovakia and the lower chambers of the bicameral legislatures in Czechia (Chamber of Deputies) and Poland (Sejm). This dataset offers broader coverage in both time and metadata than previous databases. It includes a relational database structure that integrates distinct subcorpora for speeches, bills, and laws. By linking bills to the relevant speeches and laws through a unique identifier based on the ParlLawSpeech standard, the dataset allows users to trace legislative processes from inputs to their final legal outcomes.
The dataset was compiled through a labor-intensive web scraping process tailored specifically to each country’s parliamentary archives and legal databases. The data collection process ensured that the ParlText CEE dataset is not only comprehensive but also structured to meet the specific needs of empirical researchers, providing machine-readable data suitable for large-scale text analysis.
Country | Code | Parliament | Time period |
Czechia | CZ | Poslanecká snemovna Parlamentu Ceské republiky | 1990-2023 |
Hungary | HU | Országgyűlés | 1990-2022 |
Poland | PL | Sejm | 1991-2023 |
Slovakia | SK | Národná rada Slovenskej republiky | 1990-2023 |
Data Structure
For each of the included polities, ParlText provides three separate data files: one for bills, one for laws, and one for plenary speeches.
The files are basically structured along the following columns, although additional metadata can be included:
- Corpus_bills_[country].RDS: Data on legislative bills tabled in the respective parliament
- law_id: Identification number of the bill document, following the conventions of the respective parliament identical to law ID in the case of adopted bills. It can be used for linking bills, laws and speeches.
- bill_link: Link to the bill’s data sheet on the official parliamentary website.
- electoral_cycle: A string containing the electoral cycle during which the bill has been introduced.
- title: The title of the bill.
- no_document: The number under which the bill has been introduced.
- date_introduced: The date of the bill’s introduction.
- text: Full text of the bill
- Corpus_laws_[country].RDS: Data on laws finally adopted by the respective parliament
- law_id: Identification number of the law document, following the conventions of the respective parliament identical to bill ID. It can be used for linking bills, laws and speeches.
- law_link: Link to the bill’s data sheet on the official parliamentary website.
- electoral_cycle: A string containing the electoral cycle during which the bill has been introduced.
- title: The title of the bill.
- year_published: The year of the law’s publication.
- no_published: The number under which the law has been published.
- date_published: The date of the law’s pulication.
- law_text: Full text of the bill
- Corpus_speeches_[country].RDS: Data on plenary speeches in the respective parliament (in chronological order across and within sessions)
- id: The unique identifier of each individual speech.
- link: The link to the given speech on the official parliamentary website.
- agenda: The agenda item under which the speech has been given.
- electoral_cycle: A string containing the electoral cycle during which the bill has been introduced.
- speechnumber: The number of the speech within the given session.
- speaker: The name of the speaker as it appears on the official parliamentary website.
- date: The date of the speech.
- chair: A dummy variable that shows whether or not the speech was given by the chair.
- law_id: Identification number of the bill document discussed by the speech (based on the agenda item). If multiple bills are related to the speech all of them are included.
If not otherwise indicated, all variables are provided as UTF-8 encoded strings. The lists above show the minimal data structure that is available across all countries and periods. In a growing circle of cases, individual ParlText files include additional meta information, for example on bill introducer’s position, speaker’s party etc.
Overview of the Available Datasets
The data have been initially compiled as .rds files for programming use in the free and open-source R environment. Users working in other environments can easily export them from R to any other format, using either base R’s export functions or add-on packages such as haven or feather, for example. The table below summarizes the individual data files provided in ParlLawSpeech, indicating the file size (MB), the number of variables offered, and the number of documents (bills, laws, speeches) therein.
Country | File | Variables | Count |
Czechia | parltext_bills_czechia.RDS | 7 | 12,471 |
Czechia | parltext_laws_czechia.RDS | 8 | 3,179 |
Czechia | parltext_speeches_czechia.RDS | 11 | 493,872 |
Hungary | parltext_bills_hungary.RDS | 7 | 8,220 |
Hungary | parltext_laws_hungary.RDS | 8 | 4,733 |
Hungary | parltext_speeches_hungary.RDS | 11 | 564,655 |