Part I: Who wrote the last coalition agreement in Germany?
Using embeddings and cosine similarity, comparing the coalition agreement with party programs, I try to determine who wrote the largest part of the last coalition agreement.
In the following blog entry, I will try to find out which of the parties taking part in the recently broken up coalition wrote which part of the agreement that now lies in shambles. For that I will read out the party programs of each party, embed them together with the coalition agreement in a common vector space and calculate a sentence-wise cosine-similarity. This allows us to find for each sentence in the coalition agreement the most similar sentence in each party program.
Part II will look at whether the content of the sentences that we can attribute to single parties lies in what we would determine to be their core-issues, such as ecology for the greens. For that we will us a pretrained transformer model of the manifesto project to classify all sentences of the agreement.
Why are coalition agreements important?
In Germany, two factors make it a frequent phenomenon that coalitions of different parties govern the country. Firstly, Germany has a proportional voting system, following the decades old Duverger´s law making it a candidate for a more or less fragmented multi-party system (Chapman 1955), hence seldomly does one party gain a majority to govern with sovereignty, and secondly, Germany is a parlamentarian government system, which makes functioning governments without a majority in the legislative chamber unlikely (Steffani 1979). Hence, multiple parties need to agree on a coalition to establish a majority to form a government.
The coalition agreement is a legally non-binding contract between the participating parties as to which legislative projects they are going to develop in the years they will govern together. It also establishes the division of portfolios, which parties gets which ministries. We will focus on the content of the coalition agreement and try to see, which party might have “dominated” the coalition agreement in terms of textual contributions and similarity to their party program.
There is obvious limits to this, as we can establish textual similarity, but we cannot establish the importance of given sentences. Having introduced a sentence that promises to “make state actions more efficient and faster” (Wir wollen staatliches Handeln schneller und effektiver machen und besser auf künftige Krisen vorbereiten., p.5) is clearly not as significant as a sentence that reads: “We want 30 percent organic farming by 2030” (Wir wollen 30 Prozent Ökolandbau bis zum Jahr 2030 erreichen., p. 46), which includes a very tangible goal.
Reading in the PDFs
We are going to read in the texts, using the tabulapdf package, because it does a really good job at detecting and correctly reading in two-column pdfs. We do some basic cleaning, taking out some PDF footnotes, headers and repeating titles that are not very interesting to us.
Show the code
## get a list of pdf filespdfs<-list.files('raw_data', full.names =T, pattern =".pdf")## read in each of the pdfstexts<-map_dfr(pdfs, ~{tabulapdf::extract_text(.x)%>%tibble(text =., party =.x)})clean_text<-function(text){# First, replace only the problematic "- " cases with a placeholdertext<-str_replace_all(text, "- (?=[a-z])", "PLACEHOLDER")# Restore legitimate cases where it should be retainedtext<-str_replace_all(text, "PLACEHOLDER(?=als\\b|auch\\b|sondern\\b|oder\\b|noch\\b|wie\\b|und\\b)", "- ")# Remove the placeholder from other positionstext<-str_replace_all(text, "PLACEHOLDER", "")return(text)}texts_clean<-texts%>%mutate(text =str_squish(text)|>str_trim()|>str_remove_all('Bereit, weil Ihr es seid.')|>str_remove_all('Bundestagswahlprogramm 2021')|>str_remove_all('Das Zukunftsprogramm der SPD')|>str_remove_all('BÜNDNIS 90 / DIE GRÜNEN')|>str_remove_all('SPD-Parteivorstand 2021')|>clean_text())%>%mutate(party =str_remove_all(party, '^raw_data/')%>%str_remove(.,'\\.pdf'))texts_collapsed<-texts_clean%>%group_by(party)%>%summarise(text =paste0(text, collapse =' '))
Frequent words
Having these party programs in theory allows for a myriad of different interesting metrics, like how diverse is their vocabulary, how complicated is their language, how is their readibility rated, which topics are most prominently present and uncountable possibilities more. We are just going to look at the most frequent words to get a feel for the documents and some obvious differences in language between the parties.
Gendering: the first thing that is very evident, is the fact that “innen” is such a prominent word for SPD and GRUENE, but not for the FDP. This is not related to some form of domestic politics or inner-something, but the fact that SPD and GRUENE use gender-sensitive language, accounting for all genders in the word “citizen” in german reads as follow: “Bürger:innen” or “Bürger*innen”.
Fordern vs Fördern Words that seem similar, but are very different. The FDP puts an emphasis on “fordern”, which means “demanding”, whereas SPD and GRUENE put a stronger emphasis on “Fördern”, so “promote” or “foster”.
Also tipical and something we would expect from the small government promoters FDP is the focus on companies, or “Unternehmen”.
EU vs europäisch vs. Europa The difference in the most frequent reference to the European Union versus the idea of a european identity also seems worth mentioning.
All these observations need to be interpreted with care though, just looking at the (relative) frequency of words, can bear little meaning, but it gives us a feeling for the policy documents we are looking at.
library(ggplot2)library(tidytext)corpus<-corpus(texts_collapsed)docnames(corpus)<-texts_collapsed$partytokens<-tokens(corpus, remove_punct =T)%>%tokens_remove(stopwords::stopwords('de'))dfm<-dfm(tokens)excluded_words<-c("ka", "pi","te","l", ">", "2021", "f", "ff", "seite", "kapitel", "dass","ab", "demokraten", "freie")freq<-quanteda.textstats::textstat_frequency(dfm, group=party)|>filter(!feature%in%excluded_words)|>slice_max(order_by =frequency, n =15,by =group)# Prepare the data for plottingplot_data<-freq%>%arrange(group, desc(frequency))|>filter(!group=="koalition")# Plotggplot(plot_data, aes(frequency, reorder_within(feature, frequency, group), fill =group))+geom_col(show.legend =FALSE)+geom_label(aes(label =feature), label.size =NA, color ="white", fontface ="bold", size =5, position =position_stack(vjust =0.5), show.legend =FALSE)+scale_y_reordered()+facet_wrap(~toupper(group), scales ="free", ncol =3)+theme_minimal(base_size =14)+theme( strip.text =element_text(face ="bold"), panel.spacing =unit(1, "lines"), axis.text.y =element_blank(), axis.ticks =element_blank(), axis.title =element_blank(), plot.title =element_text(hjust =0.5, face ="bold"), panel.grid.major.y =element_blank(), # Remove vertical grid lines panel.grid.minor.y =element_blank()# Remove minor grid lines)+scale_fill_manual(values =list("fdp"=paletteer::paletteer_d("LaCroixColoR::Pamplemousse")[3],"gruene"=paletteer::paletteer_d("LaCroixColoR::Pamplemousse")[4],"spd"=paletteer::paletteer_d("LaCroixColoR::Pamplemousse")[1]))+labs(caption =glue("Excluded the words: {paste0(excluded_words, collapse = ', ')}")|>str_wrap(50))
Splitting sentences
In the next step, since we want to determine authorship on the sentence-level, we need to split the entire text into single sentences. The tidytext package has a handy function for this, it is as simple as the following:
sentences<-texts_clean|>tidytext::unnest_sentences(output ="text", ## specify the name of the output column input ="text")## specify the name of the input column
We can make a quick descriptive table, comparing the length of each single document in terms of sentences and see that the program of the greens is by far the longest (in terms of sentences).
Extracted with tabulapdf and tokenized into sentences with tidytext
Table 2: Sentence Counts
sentences|>count(party)|>arrange(desc(n))|>mutate(party =if_else(party=="koalition","Koalitionsvertrag",str_c(toupper(party), " Wahlprogramm")))|>gt()|>gt::tab_header(title =md("**Sentence Count**"), subtitle ="by document")|>gt::cols_label(party ="Document", n ="Sentence Count")|>gt::tab_footnote(footnote =md("*Extracted with tabulapdf and tokenized into sentences with tidytext*"))|>gtExtras::gt_plt_bar( column =n, keep_column =T, color =paletteer::paletteer_d("LaCroixColoR::Pamplemousse")[5])
Embedding the sentences
At this point, we are going to leave the r-universe and move over to python, because most state-of-the-art natural language processing libraries are being developed in python. Since this document is written in quarto, this is not an issue, as we can easily move back and forth between the different languages and even transfer objects and data.frames with ease!
Sentence transformers are a model architecture that excels at representing semantic information of short text chunks (versus single-words). They are frequently used to find similar sentences among large collections of text, e.g. for search engines. They can also be used for paraphrase-mining or evaluating the textual similarity of two strings, taking into account the contextualized meaning of words (e.g. a theatre play being something different than a word play).
For a documentation on the library we will use, see the docs. For a more in-depth explanation of the SBERT architecture see here and here
We will embed each sentence from the documents and then find the most similiar sentences for each party.
Loading the model
To load the model, we use the SentenceTransformer Class and a model that is optimized for multilingual analysis, since our texts are in German, not in English.
from sentence_transformers import SentenceTransformer, utilimport pandas as pdfrom nltk.tokenize import sent_tokenizemodel = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
Encoding text
Now we encode the coalition text, using the .encode method.
We transform these to a tensor. It then has a rectangular shape, where each sentence of the coalition agreement receives a numeric score on the 384 dimensions of the embedding vector from the model.
embedding_coal.shape## torch.Size([3257, 768])
On dimension 1, sentence 106 (für einen echten innovationsschub müssen wir ausgründungen vorantreiben.) receives a value of 0.091, whereas sentence 200 (wir werden hochschulen mittel des bundes zur schaffung einer gründungsinfrastruktur für technologisches wie soziales unternehmertum bereitstellen.) receives a value of 0.0505. Since these embeddings and dimensions are fully computationally determined, we have no simple way of mapping this back to meaningful semantic dimensions, although there have been attempts at doing so.
When comparing the angle between these vectors in their multidimensional embedding space, the larger the value the more similar they are.
We can do this using the simple util.pytorch_cos_sim function.
für einen echten innovationsschub müssen wir ausgründungen vorantreiben.
and
wir werden hochschulen mittel des bundes zur schaffung einer gründungsinfrastruktur für technologisches wie soziales unternehmertum bereitstellen.
are evaluated as relatively similar (cosine similarity of 0.7142), even though they are textually quite different. A simple metric, like the Jaccard difference might have underestimated the shared common semantic meaning between “Ausgründungen” and “Gründungsinfrastruktur”.
für einen echten innovationsschub müssen wir ausgründungen vorantreiben.
and
7 eine starke demokratie lebt von den menschen, die sie tragen.
yield a mnuch lower cosine similarity of 0.2328, as these are semantically quite different.
Encoding all texts and comparing them
In the following code-chunk, we will repeat the embedding process for each collection of party sentences, then create a matrix with a pair-wise cosine similarity calculation of each coalition sentence with each party sentence. We will keep only the most similar party sentence for each coalition agreement sentence.
from tqdm import tqdmimport pandas as pdfrom sentence_transformers import utildata = pd.DataFrame()# Assuming `coalition_text` and `embedding_coal` are precomputedfor party in tqdm(['fdp', 'spd', 'gruene'], desc="Processing Parties"): party_text = r.sentences[r.sentences['party'] == party].reset_index(drop=True).text embedding_party = model.encode(party_text, convert_to_tensor=True, show_progress_bar=True)# Compute cosine similarity matrix cosine_scores = util.pytorch_cos_sim(embedding_party, embedding_coal).cpu().numpy()# Create a DataFrame for all combinations using vectorized operations rows, cols = cosine_scores.shape results = pd.DataFrame({'party': party,'sentence': party_text.repeat(cols),'sentence_coal': list(coalition_text) * rows,'score': cosine_scores.flatten() })# Keep top match for each sentence in party_text top_pair = results.sort_values('score', ascending=False).groupby(['party', 'sentence_coal']).head(1) data = pd.concat([data, top_pair], ignore_index=True)
alleinerziehende, die heute am stärksten von armut betroffen sind, entlasten wir mit einer steuergutschrift.
alleinerziehende, die heute am stärksten von armut betroffen sind, entlasten wir mit einer steuergutschrift.
1.0000
GRUENE
sichere und leistungsfähige datenverarbeitung, kombiniert mit mobiler it und klar geregelten kompetenzen, ist dabei eine grundvoraussetzung moderner polizeiarbeit.
sichere und leistungsfähige datenverarbeitung, kombiniert mit mobiler it und klar geregelten kompetenzen, sind grundvoraussetzung moderner polizeiarbeit.
0.9984
GRUENE
wir setzen uns dafür ein, die rechte von minderheiten auf internationaler ebene zu stärken – auch innerhalb der eu.
wir wollen die rechte von minderheiten auf internationaler ebene und insbesondere innerhalb der eu stärken.
0.9902
SPD
die prinzipien offenen regierungshandelns - transparenz, partizipation und zusammenarbeit sind für uns handlungsleitend.
uns leiten die prinzipien offenen regierungshandelns – transparenz, partizipation und zusammenarbeit.
0.9855
GRUENE
die anhaltende bedrohung des staates israel und seiner souveränität in seiner nachbarschaft und den terror gegen seine bevölkerung verurteilen wir.
die anhaltende bedrohung des staates israel und den terror gegen seine bevölkerung verurteilen wir.
0.9851
SPD
einseitige schritte von allen seiten erschweren die friedensbemühungen und müssen unterbleiben.
einseitige schritte erschweren die friedensbemühungen und müssen unterbleiben.
0.9842
FDP
im mittelpunkt stehen dabei der abbau von bürokratiepflichten, die prüfung von investitionsbezuschussungen für produktionsstätten, sowie die prüfung von zuschüssen zur gewährung der versorgungssicherheit.
dazu gehören der abbau von bürokratie, die prüfung von investitionsbezuschussungen für produktionsstätten, sowie die prüfung von zuschüssen zur gewährung der versorgungssicherheit.
0.9807
SPD
wir werden die nationalen aktionspläne im rahmen der open government partnership deutschlands umsetzen und weiterentwickeln.
wir wollen die nationalen aktionspläne im rahmen der open-government- partnership (ogp) deutschlands umsetzen und weiterentwickeln.
0.9765
FDP
wir setzen uns für einen leistungsstarken europäischen bankenmarkt ein, der durch wettbewerb und vielfalt der geschäftsmodelle geprägt ist.
wir setzen uns für einen leistungsstarken europäischen bankensowie kapitalmarkt ein, der durch wettbewerb und vielfalt der geschäftsmodelle geprägt ist.
0.9759
GRUENE
ziel sind 30 prozent ökolandbau bis 2030.
wir wollen 30 prozent ökolandbau bis zum jahr 2030 erreichen.
0.9732
Excluding sentences shorter than 20 characters.
Table 4: Most similar Sentences
Looking at the most similar sentences, this has worked remarkably well, the model has identified almost identical sentences that seem to have come from individual party programs.
What is clearly visible from this extract is also one large caveat about what we can infer from these metrics. Just because a sentence was copied almost identically from a party program does not mean that the sentence has a high impact on policies.
For example the sentence: “die prinzipien offenen regierungshandelns - transparenz, partizipation und zusammenarbeit sind für uns handlungsleitend.” - the principles of open government action - transparency, participation and cooperation are our guiding principles - is not policy related. This related to the above mentioned caveats. So we need to take care in the conclusions we draw from the following results.
Measuring party contributions
To systematically evaluate the impact of different party programs on the coalition agreement, I will take two simple metrics.
how many sentences reach an arbitrary high threshold of similarity (0.8 and 0.9 respectively)?
how often does a party have the most similar sentence in a party program in a pairwise comparison with the coalition agreement
In the following table, we can see that for very similar sentences with a similarity of higher than .9, all parties contributed an almost equal amount of sentences, about 30. If we lower this threshold, we can see that the greens contribute a significantly higher share of similar sentences, almost twice as many as the other coalition partners. This might be due to the party program of the greens being much longer and hence possible accounting for many more variations of similar policy areas or even covering policy areas much more exhaustively.
If we look at the amount of times the Greens contributed the most fitting sentence for a given sentence in the coalition agreement, the difference is even more staggering. For more than 53% of sentences, the greens are the program that is most similar, with SPD and FDP contributing an almost equal share of the remaining ~ 46%.
Of a total of 3193 sentences in the coalition agreement. Excluding sentences shorter than 20 characters.
Table 6: Highscoring Party Contributions
clean_data|>group_by(sentence_coal)|>mutate(score_over_90=score>=.9, score_over_80=score>=.8, highest_score_sentence =score==max(score))|>group_by(party)|>summarise(score_over_90 =sum(score_over_90), score_over_80 =sum(score_over_80), highest_score_sentence =sum(highest_score_sentence))|>gt::gt()|>gt::cols_label( party =md("**Party**"), score_over_80 =md("*.8*"), score_over_90 =md("*.9*"), highest_score_sentence =md("**Highest score across parties**"))|>gt::tab_header(title =md("**Party High Scores**"), subtitle ="by cosine similarity")|>tab_footnote(md(glue("*Of a total of {nrow(clean_data)/3} sentences in the coalition agreement. Excluding sentences shorter than 20 characters.*")))|>tab_spanner(label =md("**Similarity >=**"), columns =c("score_over_80", "score_over_90"))
If we look at this through the lens of the distribution of similarity scores this picture becomes more nuanced. It is clearly visible that the greens have a higher median score, in 50% of the cases between 0.7 and 0.8 round about, but it becomes clear that the most similar sentence might not always be the most similar by far or the decision might be far from clear.
If we want to validate this a bit further it makes sense to look at two groups of sentences.
Firstly, a group where the variance of scores in between the most similar sentence for each party is high, hence we would explain a clear winner of one party program that contributed - a sort of best case scenario.
Secondly, a group where the variance is very little, to see how well our sentence similarity scores for cases where all three party programs are all related to the coalition agreement to a similar degree.
Best-case scenario validation
For the first case, we look at three sentences from the coalition agreement, where the variance is high. In two, the greens win by far, because they simply contributed this specific sentence which does not even appear remotely related to any sentence in the other party programs (in dark green). In the third case, related to a minimum wage, the SPD takes the crown by a small margin (in light red). This is actually not really true. Qualitatively the sentence of the greens is more similar, as it simply states the raising of a minimum wage to 12€ (although it mentions to so immediately, which is absent from the coalition agreement), the SPD goes much further to imply that they want to raise it beyond 12€ eventually, which in the wording of the coalition agreement is clearly ruled out (one-time raise). Hence I would qualitatively say the greens are closer, however this is an edge case where both parties have clearly contributed to the agreement.
wir treiben die un-dekade für menschen afrikanischer herkunft voran (z.
machen wir den weg frei für diese entwicklung!
0.6156
0.04478
SPD
wir treiben die un-dekade für menschen afrikanischer herkunft voran (z.
die partnerschaft zwischen europa und afrika wollen wir politisch und wirtschaftlich deutlich ausbauen und auf ein neues level der zusammenarbeit heben.
0.5927
0.04478
GRUENE
wir treiben die un-dekade für menschen afrikanischer herkunft voran (z.
deshalb wollen wir die un-dekade für menschen afrikanischer herkunft vorantreiben.
0.9701
0.04478
FDP
wir wollen 30 prozent ökolandbau bis zum jahr 2030 erreichen.
wir brauchen land- und forstwirtschaft, die nachhaltig ist und flächen, die zusätzliche beiträge zum naturschutz leisten.
0.5922
0.04361
SPD
wir wollen 30 prozent ökolandbau bis zum jahr 2030 erreichen.
dementsprechend werden wir im einklang mit den europäischen klimazielen unser minderungsziel für 2030 deutlich (auf 65 %) anheben; auch für 2040 werden wir ein minderungsziel festschreiben (88 %).
0.6347
0.04361
GRUENE
wir wollen 30 prozent ökolandbau bis zum jahr 2030 erreichen.
ziel sind 30 prozent ökolandbau bis 2030.
0.9732
0.04361
FDP
mindestlohn wir werden den gesetzlichen mindestlohn in einer einmaligen anpassung auf zwölf euro pro stunde erhöhen.
die mindest- und maximalbeträge wollen wir erhöhen, auch als inflationsausgleich.
0.5519
0.03752
SPD
mindestlohn wir werden den gesetzlichen mindestlohn in einer einmaligen anpassung auf zwölf euro pro stunde erhöhen.
wir werden den gesetzlichen mindestlohn zunächst auf mindestens zwölf euro erhöhen und die spielräume der mindestlohnkommission für künftige erhöhungen ausweiten.
0.8980
0.03752
GRUENE
mindestlohn wir werden den gesetzlichen mindestlohn in einer einmaligen anpassung auf zwölf euro pro stunde erhöhen.
den gesetzlichen mindestlohn werden wir sofort auf 12 euro anheben.
0.8756
0.03752
Excluding sentences shorter than 20 characters.
Table 8: Highest variance in scores
clean_data|>group_by(sentence_coal)|>relocate(sentence_coal, .after =party)|>mutate(variance =var(score))|>ungroup()|>slice_max(order_by =variance, n =7)|>gt::gt()|>gt::cols_label( party =md("**Party**"), sentence_coal =md("*Coalition \nAgreement*"), sentence =md("*Party \nProgram*"), score =md("**Cosine \nSimilarity**"), variance =md("**Variance**"),)|>gt::tab_header(title =md("**Highest variance in scores**"), subtitle ="by variance of cosine similarity")|>tab_footnote(md("*Excluding sentences shorter than 20 characters.*"))|>tab_spanner(label =md("**Texts**"), columns =c("sentence_coal", "sentence"))|>gt_highlight_rows(rows =c(3,6), fill =paletteer::paletteer_d("lisa::FridaKahlo")[2])|>gt_highlight_rows(rows =c(8), fill =paletteer::paletteer_d("lisa::FridaKahlo")[5])|>data_color(columns ="sentence_coal", palette =paletteer_d("lisa::C_M_Coolidge"))
Worst-case scenario validation
If we look at the sentences with the lowest variance, we would expect the estimation to perform worse. When there is three sentences that are equally similar, it might not be possibly to predict a perfect match, the resulting sentence might be a mix of different sentences or be completely void of content or unrelated to any of the party programs. This is the case in the below three cases.
entsprechend wird die befüllung eines sondervermögens als abfluss aus dem kernhaushalt den verschuldungsspielraum reduzieren.
ebenso kann es sinnvoll sein, künftig stärker mit von der steuerschuld abzuziehenden - steuergutschriften zu arbeiten.
0.7276
4.171e-07
SPD
entsprechend wird die befüllung eines sondervermögens als abfluss aus dem kernhaushalt den verschuldungsspielraum reduzieren.
durch ausgeweitete vorsorgende beratungsmöglichkeiten soll der weg in die überschuldung am besten von vornherein vermieden werden.
0.7269
4.171e-07
GRUENE
entsprechend wird die befüllung eines sondervermögens als abfluss aus dem kernhaushalt den verschuldungsspielraum reduzieren.
überhöhte dispozinsen und gebühren, insbesondere für das basiskonto, werden wir begrenzen.
0.7263
4.171e-07
FDP
ihre leitung wird vom bundestag gewählt.
das parlament kann ihr oder ihm durch die mehrheit seiner mitglieder das misstrauen aussprechen und eine andere person zum kommissionspräsidenten wählen.
0.6308
2.785e-06
SPD
ihre leitung wird vom bundestag gewählt.
für bestehende ehen werden wir zudem ein wahlrecht einführen.
0.6336
2.785e-06
GRUENE
ihre leitung wird vom bundestag gewählt.
das gilt erst recht für diese bundestagswahl am 26.
0.6337
2.785e-06
FDP
die großen herausforderungen unserer zeit lassen sich nur in internationaler kooperation und gemeinsam in einer starken europäischen union bewältigen.
zukunftsfähige und starke europäische union europa muss bereit sein, die großen herausforderungen unserer zeit zu bewältigen – die folgen der coronapandemie, den klimawandel, terrorismus und migration.
0.8151
3.400e-06
SPD
die großen herausforderungen unserer zeit lassen sich nur in internationaler kooperation und gemeinsam in einer starken europäischen union bewältigen.
die europäische zusammenarbeit werden wir ausbauen.
0.8145
3.400e-06
GRUENE
die großen herausforderungen unserer zeit lassen sich nur in internationaler kooperation und gemeinsam in einer starken europäischen union bewältigen.
soweit europäische einigungen nicht gelingen, gehen wir voran, in verstärkter zusammenarbeit oder gemeinsam mit einzelnen staaten.
0.8179
3.400e-06
Excluding sentences shorter than 20 characters.
Table 10: Lowest variance in scores
clean_data|>group_by(sentence_coal)|>mutate(variance =var(score))|>ungroup()|>slice_min(order_by =variance, n =7)|>gt::gt()|>gt::cols_label( party =md("**Party**"), sentence_coal =md("*Coalition \nAgreement*"), sentence =md("*Party \nProgram*"), score =md("**Cosine \nSimilarity**"), variance =md("**Variance**"),)|>gt::tab_header(title =md("**Lowest variance in scores**"), subtitle ="by variance of cosine similarity")|>tab_footnote(md("*Excluding sentences shorter than 20 characters.*"))|>tab_spanner(label =md("**Texts**"), columns =c("sentence_coal", "sentence"))|>data_color(columns ="sentence_coal", palette =paletteer_d("lisa::C_M_Coolidge"))
Hence we need to draw the conclusion that our metric of looking at a winner across the three parties probably overestimates the contribution of the greens by far, by artificially introducing a sort of winner takes it all mechanism that benefits longer party programs and larger lexical diversity.
Conclusions
We have succesfully read out the coalition agreement and party programs. We also managed to properly embed the sentence with an appropriate library and detect very similar sentences that were certainly or highly likely copied from one of the party programs.
Our initial metric for establishing which party wrote how much of the coalition agreement probably over-estimates the share of the greens.
Nonetheless, we can say that the greens have, even though they were only the second largest coalition partner, greatly contributed to the wording of the coalition agreement.
This raises a number of interesting follow-up questions about the determinants of this contribution to the coalition agreement. Do we expect parties with a larger vote-share to contribute more to the agreement? How could we measure the importance of different wordings in terms of their policy relevance? Can parties with a more narrow issue focus on specific areas dominate specific topics? What about king-maker parties that make or break a coalition?
In a follow-up blog entry, I will look at the thematic issues at stake and try to extend this analysis with an automated classification based on pre-trained models.
References
Chapman, Brian. 1955. “Political Parties: Their Organization and Activity in the Modern State.”International Affairs 31 (2): 208–8. https://doi.org/10.2307/2604342.
Steffani, Winfried. 1979. Parlamentarische Und Präsidentielle Demokratie: Strukturelle Aspekte Westlicher Demokratien. VS Verlag für Sozialwissenschaften. https://doi.org/10.1007/978-3-663-14351-2.
Citation
BibTeX citation:
@online{bochtler2024,
author = {Bochtler, Paul},
title = {Part {I:} {Who} Wrote the Last Coalition Agreement in
{Germany?}},
date = {2024-11-30},
url = {https://www.paulbochtler.de/blog/2024/03/},
langid = {en}
}