One of the most common ways to model the spread of an infectious disease in a population is through compartment models, so called because they divide the population into compartments, with each person residing in one compartment. Perhaps the most common variant is the Susceptible-Infected-Recovered (SIR) model, where people are in one of those 3 compartments. A simple set of 3 differential equations describes the movement of people between these compartments. Thus, for example, the number of infected people in the next time step is dependent on the number of currently susceptible individuals, the number of infected people they come into contact with, and the infection rate, minus the number of people who recover in a time step.
most cases people don’t just belong to one compartment, because populations are
not homogenous. For example, it makes sense to divide the population not just
to the SIR compartments but also according to the country they live in.
publish an extended SIR model which can model homogeneous populations, divided,
for example, by area of residence, age group, etc. By fitting the model to Google
Trends data for two common viruses, we reveal information about the
complex spatial structure of disease spread.
The viruses we rested were Respiratory Syncytial Virus (RSV) and West Nile Virus (WNV). No COVID-19 data here. Sorry.
make no prior assumptions on spatial structure, human movement patterns
in the US explain 27%–30% of the estimated inter-state transmission rates. The
transmission rates within states are correlated with known demographic
indicators, such as population density and average age.
Our model also allows prediction of disease spread in subsequent seasons using the model parameters estimated for previous seasons and as few as 7 weeks of data from the current season.
The work was done mostly by our then intern, Dr. Inbar Seroussi.
On September 6th this year the Centers for Disease Control and Prevention (CDC) put out an Investigation Notice concerning a (suspected) outbreak of lung illness associated with using E-cigarette products. According to this notice, CDC is reviewing reports of a severe pulmonary disease associated with E-cigarette products, Following reports from 33 US states.
People who are suspected to have this disease report the following symptoms:
cough, shortness of breath, or chest pain
nausea, vomiting, or diarrhea
fatigue, fever, or weight loss
Brief summary if you want to decide whether or not to read further: Bing data seems to show that these symptoms appear in people who are likely using E-cigarettes, and offers a few additional likely symptoms.
I suspect it took a while to realize the possible adverse reactions associated with E-cigarettes because nobody thought of asking people who turned up at the doctors’ if they were using E-cigarettes. Additionally, the CDC reports that it can take weeks and sometimes longer for symptoms to develop.
Late-appearing symptoms and ones that might not immediately seem obvious to a doctor are exactly the kinds of symptoms that people’s search engine queries are good at detecting. Thus, I turned to search data to see what it might show.
I extracted 9 months (October 2018 – June 2019) of Bing search data. I chose this period of time because it was well before information of the new pulmonary disease were widely reported in the media. These data include searches by people in the United States. Each record comprises of the text of the search, it’s time and date, and an anonymous user identifier.
To analyze the data I followed the methodology Evgeniy Gabrilovich and I developed for our paper on pharmacovigilance, which showed that it was possible to discover new side effects of drugs from search data. Specifically, I filtered the data to focus on those users who mentioned E-cigarette products. My list comprised of general terms related to electronic cigarettes and vaporizers, as well as the brand names of popular E-cigarettes. Although not everyone who mentioned an E-cigarette in their queries uses them, our experience with other product suggests that many who mentioned them are users. Approximately half a million users mentioned these products in their queries during the data period.
I then found all mentions of one of 195 medical symptoms that these users made before or after the first time they queried for an E-cigarette product. As a control population I found all the users who mentioned symptoms in their queries but did not mention an E-cigarette product. For those users I picked a random reference date between their first and last query in our data. I also removed topical queries (which spiked for a few days and then disappeared) and popular queries that were obviously unrelated to medical symptoms. These include, for example, queries mentioning celebrities and their medical issues.
I then scored each symptom using QLRS statistics (see our 2013 paper). Briefly stated, a symptom will receive a high score if we saw a significant rise in the likelihood that it will be queried in the population that also queried for E-cigarettes after their first mention of the product, compared to the control population.
The symptoms that
received the highest scores are shown in the table below. Notice that among the top 10 symptoms at least 4 are also mentioned in the
CDC report. The Top 3
are all known symptoms.
in CDC report
Incidentally, some of the other symptoms reported by CDC are ranked high, though not in the top 10. For example, diarrhea is ranked 12th.
The temporal profile of symptom mentions seems to support the CDC report. In the figure below I plotted the likelihood that a person in the E-cigarette population would ask about cough over time, normalized by the likelihood of asking about cough in the control population. According to this figure, in the first few days after the first mention of an E-cigarette product, cough is slightly less likely and than in the general population. However, within a few weeks, cough becomes more prominent, to the point that it’s about 20% more likely than in the control population.
Given these findings I would suggest that search query data shows the traces of this mysterious new pulmonary disease, recently reported by CDC.
These results also suggest the people that researchers should investigate the possibility of additional adverse effects of E-cigarette use including depression, anxiety, and perspiration.
Crowdsourced Health is a research project which aims to learn about health and medicine from online data. The latter include search engine queries, social media, and other online data.
The output of the project is published in academic papers, listed here. Each paper I’ve published is accommodated by a social media post describing the paper, for people who prefer not to read the more lengthy paper.
During our work we often have findings that are interesting, but perhaps not worth an entire paper. That’s why I’m going to try a new publication format, through short, concise blog posts. I hope these will be interesting for you.