It was raining buckets when Jeremy Bowers and Nicholas Diakopoulos met at Big Bear Cafe in Washington D.C. for lunch to discuss a potential partnership between The Washington Post and Northwestern University. Initially, Diakopoulos, a professor at Northwestern and the director of its Computational Journalism Lab, had reached out in hopes of getting a tour of The Post‘s newsroom during his time in D.C., but when he revealed that he did not have a project lined up yet for his upcoming sabbatical, the two began to plan.
“We went and had lunch, and over an hour and a half, we napkin-sketched out what it would look like for Nick to spend a sabbatical at The Post,” Bowers, an Engineering Director at The Post, says. “The thing that we came up with was this computational journalism lab that was [focused on] politics, because we knew there was going to be a ton of work in that space in preparation for the election.”
What they cooked up was a four-month project to make complex election data more accessible and useful to reporters, to begin in September 2019 with a deadline of December 2019 — when Diakopoulos would need to return to Illinois to teach.
Helping Reporters Spot Shifts in the Electorate
At the start of the project, Diakopoulos set up a temporary workspace at The Post, along with engineering intern Madison Dong, data scientist Lenny Bronner, and Bowers.
“After doing a bunch of interviews in the newsroom, [Diakopoulos] figured out that reporters wanted more information about…where voters had newly registered, and all these intricacies of the electorate,” Bowers explains. “But they also are really turned off by charts, maps, and things like that. They’re pretty complex, and they take time to sit and read. … [Diakopoulos] figured out that what reporters really wanted was a bite-sized snippet of text that summarized something for them into two or three sentences.
This need was filled by the Lead Locator — a tool Diakopoulos, Bowers, and their team built using a national voter file, which Bowers describes as “a gigantic spreadsheet that has a single row for every voter in America, and then 600 columns [of] demographic information [per row].”
Every 30 days, the Lead Locator takes in new data from every county in the US, tracking changes in voter registration. Then, using “lightweight machine learning” and natural-language generation software written in Python, the program produces short blurbs on what it has determined to be the most interesting counties. For example, Bowers says, it may create a blurb that says: “This county has seen a 35 percent increase in Hispanic voter registration over the last 30 days. This is higher than the state as a whole, and much higher than the national average.”
Not only are these data-driven sentences fact-checked, and usable in a story, but they also influence where reporters decide to interview residents — ultimately leading to new kinds of coverage in counties that may have otherwise been ignored.
Taking on a Project with a Reasonable Scope
What’s important about this tool isn’t its technological complexity, Bowers says, but the fact that the scope of the project was perfect for The Post and Diakopoulos to handle in just a few months.
“Technically speaking, any graduate student could build something that is much more sophisticated,” he says. “It’s just sophisticated enough that we never would have built it by ourselves. And it’s just interesting enough as an academic challenge that it appeals to [Diakopoulos]. … It was also something that we could build without it going wildly wrong. If you build some big piece of machine learning…you can take months, or even years, to really refine it into something that’s useful. We had four months.”
Using Data to Create an Accurate Election Map
At the same time that Diakopoulos and his team were creating the Lead Locator, Bowers was also in the final months of testing election mapping software that had been in the works since November 2019. Called the Expected Votes Model, this tool helped visualize which way states were expected to fall in on an election night — using both votes that have already been counted, along with ones that had yet to be.
“There’s this problem in elections,” Bowers explains. “The [votes] that come in early-on might be more Democratic or more Republican than all the rest of the votes that are like outstanding. So, you can get this false narrative where it looks like a candidate has pulled ahead.”
In the 2020 Presidential Election, the phenomenon that he is referring to was called “the Red Mirage” — the appearance that incumbent President Donald Trump was winning the election, due to the fact that in-person votes (which leaned Republican) were counted before mail-in ballots (which heavily favored Democrat Joe Biden).
The Expected Votes Model tool combats this problem by visualizing how many of the uncounted votes are expected to be either Republican or Democratic based on voter history data, similar to that of the Lead Locator. So, instead of just coloring a state red or blue on the site’s virtual election map on November 3, The Post went a step further. Below the election map, there were a few bars. One showed Trump’s total counted votes, and his expected outstanding votes based on the tool’s analysis of voter history data. The other showed the same information for Biden.
In hotly-contested states like Pennsylvania, these bars illustrating the level of uncertainty actually made calling the state easier. While people well-versed in politics may be able to hypothesize that mail-in Democratic votes would outweigh the in-person count, it wasn’t necessarily clear to all readers.
“Donald Trump had a very small number of possible outcomes — a really narrow window of how many votes he was expected to get in Pennsylvania. Biden’s [margin of uncounted votes] was much wider,” Bowers says. The Expected Votes Model, he continues, “doesn’t just help us confirm or deny our suspicions… It also puts this really quantifiable number on them, and it shows us a range of outcomes and says, ‘Hey, this is what that range of outcomes looks like.'”
Bowers’s team ran this software behind the scenes for every election in the United States between November 2019 and November 2020. So by election night this year, it had been tested dozens of times and was proven to be an accurate way to call states and predict elections. On November 3, the team finally pulled back the curtain and published the tool online — and “it totally worked,” Bowers says. The tool continues to be live for elections, including this month’s highly-anticipated Georgia Senate runoff election, and The Post‘s engineering team has even made the public by sharing open-source code.
“This is the equivalent of our moonshot,” he says of the election model’s success. “It’s like landing people right smack dab in the middle of the crater.”