About Debate a Base

The database

Parliamentary data

The debates originate from the Parlamint dataset created by the Clarin European Research Infrastructure Consortium which is funded by the countries that have joined the Clarin initiative. This dataset makes it possible to actually make comparisons between countries.

It consists of debate data from 25 countries and can be browsed through on the debates page. The entire dataset consists of 7,5 million speeches, that are stored in one index.

The data for every country is available in both the original language and English. In total 27 languages have been translated and the source code can be found on Taja Kuzman's GitHub.

Ngrams data

Every Ngram has been precalculated from the parliamentary data to ensure a swift user experience. Ngrams are generated with a maximum length of five words long.

The search engine consists of 5,8 billion Ngrams that are in five different indexes, based on the sentence length of the Ngram. Different sentence lengths are stored seperately to browse through the least amount of Ngrams as possible.

Because of the sheer amount of data, it was decided that only the English translations of every country would be indexed. This is still 246 Gigabytes and it also requires a lot of processing power to stay fast.

Debates

The debates page gives access to a large varierty of debates that can be searched through using a simple interface. More info on the debates page.
Don't know where to start? Click the button below for an example.

Debates: "elections"

Ngrams

The ngrams page gives an overview of the relevance of a specified term for a specific country on a specific date. More info on the ngrams page.
Don't know where to start? Click the button below for an example.

Ngrams: "democracy"