The cost of data privacy

The other day I was talking to a friend of mine, a senior medical doctor at a research hospital. We were discussing clinical trials and how the recent staff shortages in the US made it difficult to start new clinical trials there. He mentioned off hand that clinical trials have been difficult to do in Europe for a few years now because of GDPR.

The European Union’s General Data Protection Regulation, or GDPR, is a regulation on data protection and privacy. It provides people with rights related to their data including, for example, the right to ask companies for data they collect about an individual (Article 15). GDPR is implemented in countries of the European Union, members of the European Economic Area and other countries which chose to implement it. The latter group includes Andorra, Argentina, Canada (only commercial organizations), Faroe Islands, Guernsey, Israel, Isle of Man, Jersey, New Zealand, Switzerland, Uruguay, Japan, the United Kingdom and South Korea.

The flip side of GDPR is that, for both companies and other organizations, it’s much harder to collect and process data. This may be a good idea when we’re thinking about companies which use these data to sell us more stuff, but it may be that these regulations have a less than beneficial effect for medical science. I wanted to see if there’s evidence for the latter.

In recent years medical researchers have begun registering their clinical trials on the US government’s website. This can help patients find relevant trials, improve recruitment, and also reduce the likelihood of cheating (see Ben Goldacre’s wonderful talk on this subject). I took these data and extracted from them the country where each clinical trial is held (some clinical trials are held in multiple countries and I accounted for those) and the date at which it was first registered.

The figure below shows the number of clinical trials registered each month between January 2010 and July 2021. I divided the countries where the trials were held into three groups: the United States, countries where GDPR was implemented, and all other countries of the world.

Number of new clinical trials per month in the United States (top), GDPR-implementing countries (middle) and other countries (bottom). Light colors are before May 2018 and dark ones after it. Dotted lines are linear fits to the curves, with the slopes and fit shown below them.

In the graph the light colors are the number of clinical trials per month prior to the implementation of GDPR and the dark colors are the same numbers after it. I’ve also fit linear regression curves to each of these. As one can see, the number of clinical trials up to May 2018, when GDPR was implemented, rises slowly. Interestingly, it rises more slowly in the US than in the other two groups.

After May 2018 the rise in the number of clinical trials in the US and in countries where GDPR was implemented abruptly stops and flattens. However, in countries which did not implement GDPR (and are not the US), the pace of growth rises dramatically and accounts for the expected growth in both this group of countries and most of what we would have expected in the previous two groups. It seems as though GDPR put a break on clinical trials in countries where it was implemented, as well as in the United States.

Which countries benefited from this move from the US and GDPR-implementing countries? To test this, I computed for each country, the fraction of clinical trials conducted after the implementation of GDPR from all trials in the registry. I only looked for countries which had at least 500 clinical trials in the data. The 5 countries which had the largest fraction of trials post-GDPR are Pakistan, Egypt, Turkey, Indonesia, and China. Unfortunately, these countries are not bastions of human rights. According to Freedom House they are judged either “Not free” or “Partly free”.

Thus, it seems that one of the negative aspects of GDPR was the movement of clinical trials from countries which implemented it to those which did not. Whether this is a price worth paying is a personal judgment. To me, it seems that GDPR must be changed so that studies which improve the lives of people should be able to continue even at minimal cost to data privacy.

The current state of things reminds me of a story, possibly apocryphal, told to me by a lecturer during my graduate studies: A colleague of my lecturer who was a pain researcher from one of the industrialized countries took his sabbatical in Libya. This was, I think, in the late 1980s. My lecturer said that he asked the researcher, “why Libya?”. The reply was “it’s easier to do work there”…

Let’s not have GDPR cause medical research to move to countries which don’t take human rights seriously.

Caveat: I know there may be confounders that appeared at similar times. This isn’t a scientific paper, so take my explanations above with a grain of salt.

COVID19 vaccines and Ivermectin: The strange story of trust, politics, and media sensationalism

Over the past couple of years we’ve seen several unusual (to say the least) methods for treating COVID19, ranging from anti-malaria drugs to Yoga. Some of us may recall bleach as another idea suggested by then-president Trump, but he didn’t actually suggest it.

One of the more recent ideas was to repurpose an anti-parasitic medication, Ivermectin, to treat COVID19. This drug is licensed for use in both humans and livestock, leading to the derogatory “cow dewormer” moniker. The evidence for effectiveness of this drug came initially from lab studies, but doses were far greater than approved for human use.

Several randomized controlled trials followed, with the most recent meta-analysis finding an interesting outcome: Studies in some countries outside the US found the drug to be effective, while those conducted in the US did not. It may be that in countries where parasitic infections are common, treating these infections helps people defeat COVID19, but it doesn’t help those who don’t have it.

Nevertheless, some media channels and politicians recommended using the drug, and if you believe recent media stories, many people decided to use ivermectin rather than chose the more effective solution and vaccinate against COVID19. It seems that overdoses of the drug became more common.

However, I wanted to see, how many people were really interested in ivermectin, compared to the vaccine?

As usual, I looked at Google trends data (at the state level) for Ivermectin, Hydroxychloroquine, and the COVID19 vaccine. The volume of searches for ivermectin is negatively correlated with interest in the vaccine during 2021. However, there is no such relationship for hydroxychloroquine. In the graphs below the axes are search volumes.

Interest in ivermectin (horizontal axis) and the COVID19 vaccine (vertical axis) in different states, as measured through Google Trends search volume. The line is a linear regression curve.
Interest in hydroxychloroquine (horizontal axis) and the COVID19 vaccine (vertical axis) in different states, as measured through Google Trends search volume. The line is a linear regression curve.

Second, interest in ivermectin is small compared to interest in the COVID19 vaccine, even in the states where it had the highest search volume. Below are figures for the entire USA and for Oklahoma.

Google Trends search volume in the USA for ivermectin and for the COVID19 vaccine.
Google Trends search volume in Oklahoma for ivermectin and for the COVID19 vaccine.

I tried to see if the voting results for the presidential elections in 2016, 2020 and the current governor of each state were a predictor of the search volume for the vaccine or for ivermectin. The most predictive factor for the 2016 election results is interest in vaccination. The accuracy of the prediction is very high (Area Under the Receiver Operating Curve of 0.91), meaning that more interest in the vaccine correlated with voting for a democrat in the 2016 elections. Outcomes of the 2020 elections are much harder to predict using interest in the vaccine (AUC=0.64).

Interest in hydroxychloroquine doesn’t predict election results, but search volume for ivermectin, and even better the ratio of search volume for ivermectin to the volume for vaccine predicts the 2016 election results (AUC=0.86). Here higher ratios of ivermectin to vaccine searches predict a vote for Trump.

What do all these findings show?

To me, the most interesting finding is that support for former-President Trump is a strong predictor of interest in ivermectin over vaccines. This is somewhat similar to my previous blog post and to a study, about Israeli politics.

As an aside, it seems to me that the ivermectin story was somewhat overblown up by media. Interest (as measured in search engine data) was much lower in actuality.

Politics, income and COVID19 vaccines in Israel

The rate of COVID19 vaccination is strongly correlated with party affiliation. Specifically, the Kaiser Family Foundation found that people in counties that voted for Biden during the last elections had significantly higher rates of COVID19 vaccination compared to those who voted for Trump.

I wondered if the same was true for Israel. So, I download from the Israeli Health Ministry the vaccination rates at 253 towns and cities (as of August 10th), the voting data from our last elections from the Israeli Central Elections Committee and income data from the Israeli National Insurance Bureau. The last source also gave me the Gini index for each location.

I manually labelled the towns and cities as to whether they were predominantly Jewish or not. I also computed the percent of voters in each location who voted for the current coalition government.

Here are a few results. First, in predominantly Jewish towns and cities vaccination rates are strongly correlated with income, but even more strongly (and significantly statistically more so) with voting for the current government.

Vaccination rates as a function of income in Jewish cities
Vaccination rates as a function of the percentage of people who voted for the current coalition in Jewish cities

In predominantly non-Jewish cities the picture is more complicated. First, the correlation is much lower than the one we observed in Jewish cities. More interestingly, while income is still correlated with vaccination rates, voting for the current government is negatively correlated with vaccination rates.

Vaccination rates as a function of income in non-Jewish cities
Vaccination rates as a function of the percentage of people who voted for the current coalition in non-Jewish cities

A linear model of the data (with interactions) bears this out:

The model for Jewish towns reaches R2 of 0.67, which is extremely high. The statistically significant variables are vote for the coalition (positively correlated), Gini index (negatively correlated), and the interaction of income with the Gini index (positive) and with income (negative). Therefore, cities that voted for the government and had less inequality were more likely to vaccinate.

The model for non-Jewish towns reaches a lower R2 of 0.46. Here the statistically significant variables are vote for the coalition (negatively correlated) and the interaction of the Gini index with voting for the government (negatively correlated). This means that the most indicative variable for vaccination rate was not voting for the current government and, in cities that have more inequality and higher income this is even stronger.

My understanding from these results is that, in Israel as in the US, voting is correlated with vaccination rates. I don’t think, however, that one is causal of the other. Instead (at least in Israel) there is probably a third variable driving both. For example, the Arab party which joined the coalition is the Islamic party, who’s voters tend to come from populations with lower income and that live in areas with less access to healthcare. In the Jewish population, one of the main blocks not part of the current government is the Ultra Orthodox, who are also less likely to vaccinate. They are also poorer than the general population.

The bottom line? Vaccination rates in Israel are correlated with political affiliation, but perhaps for different reasons than those in the US.

COVID19: How are we doing?

Note: The following is somewhat different from my usual blog posts because it doesn’t involve internet data. It’s my analysis of publicly available health data which I did to answer a question I had.

The current phase of the COVID19 pandemic is affected by several trends which are driving the pandemic in opposing directions. One the one hand, the vaccination rate is high in many developed countries. On the other, new strains such as the Delta strain are more infective and the vaccines are thought to be less effective against these strains (even though they are still highly effective!).

What is the overall trend?

Scotland may be a good area to examine this question. On the one hand, at the time of writing 54% of the population has received two vaccine doses (73% received only one). On the other, since mid-May 2021 the delta vaccine is the dominant strain in Scotland.

Here is a plot of four indicators (source) of the pandemic: Number of daily positive cases, hospital admissions, ICU admissions and deaths. They are smoothed using a 7-day moving average. 

Four indicators of the COVID19 pandemic in Scotland

On average, hospital and ICU admissions are best correlated with daily cases when those are taken 7 days later (that is, it takes around a week until a case is hospitalized), and another 7 days until deaths occur.

Therefore, I used the daily positive data to predict both hospital admissions and deaths at the appropriate lag (7 and 14 days). In both cases I used a non-linear model (second order polynomial to predict the quadratic root of the dependent variables) trained on data until the end of April 2021. The models had a good fit (R2=0.69 and 0.52, respectively).  

Here are the actual and predicted hospitalizations, compared to case numbers:

Daily positive cases, hospital admissions and predicted hospital admissions

As we can see, hospital admissions are rising since mid-May, but not as fast as the prediction. We would expect around 170 people to be hospitalized at this point, but there are around 45. That’s around one quarter of the expected number.   

A look at deaths is even more telling:

Daily positive cases, deaths and predicted deaths

Deaths have risen very slightly: We would have expected almost 40 per day at this stage, but are seeing around 2 (that’s one twentieth of the expected!).

My takeaway from this is that we will see a rise in hospitalizations and in deaths, but it will be much smaller than in previous waves of COVID19, especially in terms of deaths. The vaccines are providing significant protection against the worst aspects of COVID19.

Why does flu happen in winter? COVID19 could help us answer the question

There are reports of a Respiratory Syncytial Virus (RSV) outbreak in Israel. RSV is a virus which causes a flu-like illness and is especially dangerous for children. What’s strange about this outbreak is that it’s happening in early summer, whereas previously RSV outbreaks always happened in winter.

I was wondering if this is something special to RSV and to Israel or perhaps something bigger?

Luckily, a few years ago we looked at the association of search engine query volume and the incidence of RSV and found that it was quite high. Therefore, I extracted Google Trends data (using the Google Trends Anchor Bank toolbox) for RSV from the US, United Kingdom and Canada and plotted it below:

Query volume for RSV in Canada, United Kingdom and United States between May 2016 and June 2021. Note that Canada and UK are on different axis from that of the US.

However, starting from April 2021 there is a dramatic rise of RSV in the US and UK, but not in Canada. Thus, Israel is similar to US and UK, but Canada seems an outlier.

Is there something special about RSV?

Here are the time series for several other seasonal viruses in the US:

Query volume for RSV, Rotavirus, Norovirus and Common cold in the United States between May 2016 and June 2021. Note that common cold is on a different axis from that of the other viruses.

Here we see similar correspondence, except for two outliers: First, common cold queries happened in the winter of 2021, but to a lesser extent. Second, RSV is rising, but so is norovirus, which started earlier and may already be on its way down.

Here is another virus, Rabies, compared to RSV. Rabies usually spike in summer, and in the summer of 2020 there was no spike. This year, however, it seems to be rising to normal levels. Note that it is unlikely that the search query volume for rabies represents rabies cases, as it does for RSV. Even though there is evidence for seasonality of rabies, in this case it probably reflects worry about rabies due to close contact with mammals.

Query volume for RSV and rabies in the United States between May 2016 and June 2021. Note that RSV is on the left axis and rabies on the right axis.

What’s happening here? Perhaps opening up for social gatherings in Israel, US, and UK have enabled RSV and other viruses to spike. We are looking into whether there is supporting evidence for this question.

These findings raise the interesting question of why RSV (and other viruses) occur in winter? Is it because of the colder weather which causes people to congregate indoors and perhaps constricts our airways? Is it because there is some level of immunity in the population which slowly decays over the year until, early in winter, it is low enough for an epidemic to begin?

COVID19 may allow us to resolve this question.

(Special thanks to Prof. Lev Muchnik for interesting discussions on this topic)

Should the president be impeached? It all depends who you ask

“The President, Vice President and all civil Officers of the United States, shall be removed from Office on Impeachment for, and Conviction of, Treason, Bribery, or other high Crimes and Misdemeanors.” Article II, Section 4, US Constitution

Professor Anat Rafaeli and I teach a course at the Technion which is intended to teach students with a background in psychology the tools of Data Science and specifically how to answer research questions from the social sciences with internet data. As part of the course, students chose a research question and answer it during the semester, using the tools we teach them.

A couple of years ago, while explaining their research question, one of the students (whose name I don’t have anymore, but will be happy to add him, if he’s reading this) showed us an intriguing chart: It displayed the search volume (from Google Trends) for the term “impeachment” for the period of a few months before and after President Trump was inaugurated. There were several spikes during that year from people searching for how to impeach the president. I didn’t find this particularly surprising given the news coming from the US.

What was surprising was a similar chart he showed us for the same period around President Obama’s inauguration. It showed similar spikes! I hadn’t heard of anyone who wanted to impeach Obama, so that spike was shocking to me.

Recently I repeated the exercise this time adding data from similar time periods around President Biden’s inauguration (Technical note: For the first two presidents I used the “impeachment” topic, while for Biden I used the term “impeach Biden”, to exclude searches related to Trump’s impeachment trial). You can see the results in the figure below.

Search volume for impeachment during the months before and after inauguration of presidents Obama (blue), Trump (orange) and Biden (gray).

As you can see, each president has people searching for how to impeach them, first around the time that they are elected and then around inauguration. After these two events “impeachment” spikes every so often (as you can see in the Obama and Trump spikes at the end of May). More broadly, here’s Obama’s entire second term:

Google Trends search volume for impeachment over the period of President Obama’s second term in office.

Who are these people, who are so eager to impeach their president?

We can try to answer this question by looking at the correlation between how each state voted in the recent Presidential elections and the percentage of people searching for impeachment of a president. The graphs below plot the percentage of people in each state who voted for Biden as president (i.e., roughly speaking, are Democrats) compared to the search volume for impeachment.

Percentage of Biden voters in each state (Democrats) versus search volume for impeachment of Obama (top), Trump (center) and Biden (bottom). Outliers are marked.

People who wanted to impeach Obama where mostly from republican states, as shown by the negative correlation between search volume and the percentage of Democrats in the state. The opposite is true for people who search for impeachment during Trump’s first months in office. With Biden the correlation is much worse, but the data is skewed by a single point when, if removed, is again reasonable (R2=0.13): For some reason, the mostly Democratic voters in Vermont are those searching more often for Biden’s impeachment.

What’s the bottom line? If you don’t know anyone who thinks your favorite president should be impeached, you just don’t know the right people.

Can we predict a stroke event?

Our newest paper suggests an intriguing possibility: We may be able to predict a stroke event by observing people’s activity on a search engine.

What does the evidence show?

We started with a group of anonymous Bing users who, at some point, indicated in their queries that they had undergone a stroke. We filtered these users to those who were active pretty much every day, then they were inactive for between one and several days, and indicated their stroke just after that inactivity period. We hypothesize that the inactivity was due to their stroke which happened just after they disappeared. We then tried to separate these users from other users, some who were of similar ages and others who indicated having other medical conditions.

To separate the users we represented them through a variety of attributes such as the time of day of queries, the time since their previous session, etc., but more importantly, attributes which were previously linked to cognitive decline such as the complexity of queries, the deepest link that was clicked, and more.

We found was that it was quite easy to separate these populations of users. Of course, it may be that there were other things that were different between these populations even though we took care to select them in the same way that we chose the stroke population. However, we did find that people with cardiovascular diseases were harder to differentiate from the stroke population than people with other conditions.

We also applied our model to data that was collected a year later. Here we didn’t have many people who indicated a stroke, so used a weaker label, which was the number of times each user was interested in stroke. This is an indicator that was used in the past to find people who are suffering from cancer. The model successfully found those people who are interested in stroke, just by looking at the meta data of their queries, through attributes such as those described above.

Predicting when a stroke will occur

It seems that it’s possible to differentiate populations of users who will undergo stroke from others. Can we also localize the stroke in time? That is, can we predict if a stroke will occur within the next few days?

The results here are not as strong, but they do indicate the possibility of localizing the stroke event. According to our findings, in in the 3-4 months prior to the stroke event something begins to change and peoples attributes begin to be more similar to those of people who will undergo a stroke. This could be because of microstrokes or other cardiovascular events.

The summary of our findings is the intriguing possibility that stroke causes cognitive changes some time before a stroke happens, and that these changes can be identified through people’s interactions with search engines. If this is true, the upshot could be dramatic: We may be able to prevent stroke by analyzing people’s queries and, if they indicate a possible event in future, have doctors prescribe simple medications such as aspirin. As our medical partners (Prof. Stern and Dr. Shaklai) said, there’s a lot to do before stroke and not a lot after it.

However, all our data is derived from queries of people who indicated their health conditions. We don’t have their medical records. Therefore, we’re now trying to set up a clinical trial which will collect both query data and medical records from people and validates our hypothesis.

What do you want to do after you get your COVID19 vaccine?

 As I write this blog post (May 18th, 2021), more and more people have been vaccinated against COVID-19. In the US, 47% of the population ever seized the first dose of the vaccine and another 37% are fully vaccinated ( In Israel around 70% of the population are fully vaccinated.

I thought to try and find out what people were most looking for now that COVID-19 will not be a risk for them. I started with Google’s autocomplete:

As you can see, some people want to know how to deal with the immediate aftermath of the vaccine. They ask about Tylenol and other pain medications, but also how soon they can eat, drink, or smoke. Many people ask about things they could do before COVID-19 but not during the pandemic. These include travel and exercise (presumably at the gym).

A fun exercise is to look at these needs across U.S. states and across different countries of the world. To do this, I queried Google Trends for the volume of queries for each of these needs (e.g., “after covid vaccine can I smoke?”) during the past 3 months and also for the volume of queries beginning “after covid vaccine”. The latter served as a baseline. I calculated the ratio between these two volume indicators given for each state. On a technical note, Google only gives a normalized score for each of the volumes, so we can’t treat this as excess searches per-se. Also, if the volume of queries is too low Google does not provide a number and these are missing data for us.

Interestingly, the correlation between query volume for “after covid vaccine” and the percentage of fully vaccinated people in each state is quite high at 0.80, and only sightly lower (0.78) with the percentage of people who received at least one shot. Therefore, this does seem like interest by people who are getting their vaccines.

Here are maps for these ratios, first for the immediate interests and then for the longer-term ones:

Gray countries are those for which there were too few queries. Colors represent how much more volume there was for the query in the title compared to the query “after covid vaccine” (scale is on the right of each image).

Side effects seem to worry everyone, but the least likely to be worried are people from South Dakota, Maine, Montana and Nevada. Californians and New Jerseyites really want their Tylenol. Once they stop worrying about their vaccines, many Texans would like a smoke and a drink (but drinking is also a favorite in California and Ohio).

As for the longer-term wants:

Gray countries are those for which there were too few queries. Colors represent how much more volume there was for the query in the title compared to the query “after covid vaccine” (scale is on the right of each image).

Californians (of course?) want to go back to the gym. Travel is yearned for in Georgia, New York and Washington.

Worldwide the data is much sparser. This is probably because I’m looking at queries in English. Nevertheless, here are some findings of note: Folks in the Philippines would like to go back to the gym. Alcohol is sought by people in Mexico, UK, India and (presumably expats) in the UAE. This is also true, albeit to a lesser extent, in Canada and Australia. Travel features high on the list for UAE, Canada and Australia.

What does all this mean? Probably not much beyond the obvious, but it’s still fun to see it in the data.

Should we expect a surge of COVID19 babies?

When the COVID19 pandemic began spreading, people started making predictions on what its short-term effects would be. There were predictions of a global recession, more home cooking, and even a rise in the divorce rate. One prediction was highly specific: There would be many “COVID babies.

I’ve been trying to figure out if that prediction is true using search data. Here’s the trend of searches for pregnancy tests in the USA for the past 5 years, taken from Google Trends. It tells an interesting story.

As you can see, every year at roughly the end of march or early April there’s a spike of searches. I’ve marked them with triangles. There is also a wave of searches around the July timeframe.

The spikes correspond to the week or two after spring break. You can guess why… The July surge might be related to planned spring babies or perhaps it’s summer love?

But what happened this year? Interestingly, there’s a drop in searches corresponding to the time of the beginning of the pandemic. Perhaps people couldn’t go out to buy pregnancy tests or perhaps they were under stress due to the pandemic so they couldn’t care for those tests. Interestingly, spring break spike is there in all its glory. So is the July surge. In fact, it’s probably larger than in most years, seemingly compensating for the March dip.

Therefore, the bottom line is, there’s no abnormal spike in searches for pregnancy tests in the USA since the pandemic began. Does that mean there won’t be a surge of babies in another few months? I don’t know, but my guess is, probably not.

Black Lives Matter – Is it a party-political issue?

Black Lives Matter protests have been dominating my Twitter feed for the past several days. Google Trends data shows a similar trend:

(Interest in “Black Lives Matter” in the US over the past 30 days)

But is it the same experience across the US?

It turns out that interest is strongly related to political affiliation. Here is a scatter plot of Google Trends interest in “Black Lives Matter” at the US state level, compared to the percentage of the vote for Clinton in the 2016 presidential elections.

(50 states, excluding Washington DC)

The fit is pretty remarkable (55%), with states that have more Democrat votes showing more interest in the topic.

State which defy the trend are Utah, Idaho, and Wyoming (more interest in the topic than expected by their voting patterns) and, on the opposing side, Mississippi, South Dakota, and Florida (less interest than expected). Also, anecdotally, the “Related queries” shown by Google Trends in California and Oregon are related to donations to Black Lives Matter, whereas in Mississippi and Florida they are for merchandise.

I also tested interest in the terms “protests” and “looting” across the different states. The former behaves similarly to “Black Lives Matter” while the latter had a similar trend to that of “Black Lives Matter”, only breaking for highly Democrat states, where there was far less interest in looting than expected by the overall trend.

Political scientists may want to theorize if this will change election results, but at least overall the pattern seems to suggest that this is (still?) an issue where interest depends on who you vote for.

Addendum (16 June 2020):

In July 2016 large-scale Black Lives Matter demonstrations were held in 88 cities. Google Trends data from that period (May – July 2016) shows no correlation (R2=0.00) with voting patterns.