If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
To evaluate whether a digital surveillance model using Google Trends is feasible for obtaining accurate data on coronavirus disease 2019 and whether accurate predictions can be made regarding new cases.
Data on total and daily new cases in each US state were collected from January 22, 2020, to April 6, 2020. Information regarding 10 keywords was collected from Google Trends, and correlation analyses were performed for individual states as well as for the United States overall.
Among the 10 keywords analyzed from Google Trends, face mask, Lysol, and COVID stimulus check had the strongest correlations when looking at the United States as a whole, with R values of 0.88, 0.82, and 0.79, respectively. Lag and lead Pearson correlations were assessed for every state and all 10 keywords from 16 days before the first case in each state to 16 days after the first case. Strong correlations were seen up to 16 days prior to the first reported cases in some states.
This study documents the feasibility of syndromic surveillance of internet search terms to monitor new infectious diseases such as coronavirus disease 2019. This information could enable better preparation and planning of health care systems.
Further sequencing analysis revealed the involvement of a novel strain of virus named severe acute respiratory syndrome coronavirus 2 obtained from the samples of the lower respiratory tract of infected patients.
The number of cases quickly accelerated, and eventually the disease spread to the United States, with the first confirmed case announced in January 2020; the World Health Organization labeled the situation a pandemic on March 11, 2020.
Web-based big data analytics has been gaining popularity in its potential to predict the distribution of infectious diseases.
Internet usage has brought about a revolution when it comes to health care knowledge accessibility to the public. Monitoring and analysis of Internet data has come under the research field known as infodemiology, defined as obtaining data from Web-based resources and repurposing it to inform public health and health policymaking.
Web-based activity detection tools can play a vital role in early detection of infectious events and help in the timely preparedness of respective health care systems in order to avoid the adverse consequences of being caught by surprise. Among these Web-based surveillance tools, one of the most prominent is Google Trends.
Google Trends is one of the most efficient trend analyzers to determine Internet search behavior. Google search is based on pattern analysis focused on the most searched keywords that are centered around concerns of the general public. Google Trends provides valuable insights into community dynamics and health-related problems, particularly in the area of infectious diseases. Big data produced by Google Trends has proved to be valuable for correlation assessments and forecasting models of a number of infectious diseases including influenza, Middle East respiratory syndrome (MERS), Zika virus, and more; it has also been found to be a useful tool for the assessment of dementia cases in the population.
Since the first case of coronavirus disease 2019 (COVID-19) appeared in the United States, there has been an exponential increase in the daily number of cases. The United States now has the highest number of cases in the world, with the most deaths globally.
The purpose of this study was to explore whether there is a correlation between certain keywords searched by the general public in Google and the number of COVID-19 cases in the United States on a state-by-state basis. Significant correlations could suggest the utilization of Google Trends to predict new COVID-19 case locations and hotspots.
Google Trends Data
Google Trends processes the magnitude of Web searches performed for a specified keyword, among other searches, providing the relative search volume (RSV) for each keyword. This standardized value is calculated by dividing the total number of searches for a keyword by the total searches of the geography and time range it represents to compare relative popularity. The resultant number ranges from 0 to 100 and is based on the topic’s daily popularity compared with its search popularity over a given time frame.
Trend changes are displayed online for time series of interest. Keywords can be filtered by location (worldwide, country, state, city) and time span. Data are collected in a time series presented on a normalized scale of 0 to 100, where 0 represents no search and 100 represents the peak search activity for a particular keyword or string. Data can be downloaded as a “.csv” (comma-separated values) file. Google Trends’ daily base data were mined in our study from January 22, 2020, to April 6, 2020. In total, 10 keywords related to COVID-19 were chosen on the basis of popularity and increasing patterns on the Internet and Google News in the study period. The following keywords were searched: COVID symptoms, coronavirus symptoms, sore throat+shortness of breath+fatigue+cough, coronavirus testing center, loss of smell, Lysol (sanitizer), antibody, face mask, coronavirus vaccine, and COVID stimulus check. Keyword categories included disease symptoms, prevention, testing, and possible treatments. Our search method was to perform a query for each keyword for each US state individually. In total, we obtained data for 50 states for each selected keyword.
COVID-19 Case Data
Data for the daily new and total number of confirmed cases and deaths has been tracked and reported by Johns Hopkins University Center for Systems Science and Engineering. At the time of this study, the data provided included COVID-19 case data on a county-by-county basis for each of the 50 states. Total US cases reported from January 22, 2020, to April 6, 2020, were available; this is the time frame utilized in this study. County data for each state were combined to create a state-by-state data set.
To assess the relationship between COVID-19 cases and keyword patterns in Google Trends, correlation analysis was performed using R version 3.6.2 (R Foundation for Statistical Computing). Ten keywords in Google Trends were searched and data were collected from January 22, 2020, to April 6, 2020. We plotted each keyword’s RSV from January to April of 2020. Pearson correlation coefficients were calculated between each keyword’s standardized RSV and the number of daily new COVID-19 cases, and 95% CIs were also calculated. We used the correlation coefficients of selected keywords and daily new COVID-19 cases to create a heat map for each of the 50 states at time zero (the day of the first case in the state). To study the association between COVID-19 cases and Google search trends for each of the 10 keywords, we created scatterplots showing the number of COVID-19 cases against a standardized daily Google search RSV value.
Lag and lead Pearson correlation coefficients were calculated for all 50 states as well as the United States as a whole. The lag/lead times for each state started 16 days prior to time zero (day of the first case in that state) and 16 days after time zero. We compared the correlation coefficients for each keyword’s RSV and daily new COVID-19 cases between day −16 and day +16 in all 50 states as well as the United States as a whole.
Ten keywords were searched in Google Trends and data were compounded from January 22, 2020, to April 6, 2020. Keywords generally increased in search popularity over time compared with baseline; some keywords, such as COVID symptoms, peaked in popularity toward mid-March, while others, such as face mask, continued to increase in popularity into April (Figure 1). Correlation coefficients were calculated between each keyword and each of the 50 states’ daily new COVID-19 cases as well as the daily new COVID-19 cases in the United States as a whole. When looking at the United States as a whole, keyword correlations ranged from R=0.06 (coronavirus symptoms) to R=0.88 (antibody); 6 of the 10 keywords had moderate correlations (R=0.3 to 0.7) with daily new COVID-19 cases in the United States, while 3 of the 10 keywords had strong correlations (R=0.7 to 1) (Table 1). When looking at correlations on a state-by-state basis, 4 keywords with considerable correlations nationwide included COVID symptoms, coronavirus testing center, loss of smell, and face mask. The 3 keywords with strong correlations when looking at the United States as a whole include "face mask," "Lysol," and "COVID stimulus check," which have R values of 0.88, 0.82, and 0.79 respectively. COVID symptoms had correlations ranging from 0.37 to 0.80, coronavirus testing center had correlations ranging from −0.06 to 0.63, loss of smell had correlations ranging from 0.02 to 0.76, and face mask had correlations ranging from 0.35 to 0.90 (Supplemental Table 1, available online at http://www.mayoclinicproceedings.org). These correlations are further represented in Figure 2 as a United States heat map.
Table 1Overall US Correlation Coefficients for 10 Google Keywords and Daily New COVID-19 Cases
95% CI (lower)
95% CI (upper)
Sore throat + shortness of breath + fatigue + cough
Search popularity for each keyword varied with COVID-19 case numbers. Some keywords such as antibody and Lysol had higher popularity as COVID-19 cases increased; other keywords such as COVID symptoms and coronavirus vaccine had higher popularity when COVID-19 case numbers were lower (Figure 3). To further assess this difference, lag and lead Pearson correlation coefficients were calculated for all 10 keywords and each of the 50 states, along with the United States as a whole. Lag correlations were calculated up to 16 days before the first case, and lead correlations were calculated up to 16 days after the first case. Most of the keywords had moderate to strong correlations days before the first COVID-19 cases appeared, with diminishing correlations following the first case (Figure 4). Coronavirus symptoms, for example, had its strongest correlations 16 days prior to the first case in the United States (R=0.77) and in most of the 50 states individually. All calculated lag and lead correlation coefficients for each of the 10 keywords and the 50 states, as well as the United States overall, are displayed in Supplemental Table 1 (available online at http://www.mayoclinicproceedings.org). When looking at Minnesota, Arizona, Florida, and New York, strong keyword correlations were seen up to 16 days prior to the first reported cases in each of these states. These 4 states are reported here individually because our institution (Mayo Clinic) has campuses in 3 (Minnesota, Arizona, and Florida) and New York was selected because it was the most strongly impacted area during the beginning of the pandemic in the United States. For Minnesota, the strongest correlations for COVID symptoms, coronavirus symptoms, Lysol, and coronavirus vaccine were seen on lag day 8 (R=0.87), lag day 14 (R=0.85), lag day 15 (R=0.70), and lag day 16 (R=0.82), respectively (Table 2). For Arizona, the strongest correlations for COVID symptoms, coronavirus symptoms, sore throat + shortness of breath + fatigue + cough, loss of smell, Lysol, coronavirus vaccine, and COVID stimulus check were seen on lag day 9 (R=0.80), lag day 16 (R=0.82), lag day 11 (R=0.73), lag day 3 (R=0.66), lag day 1 (R=0.73), lag day 14 (R=0.69), and lag day 2 (R=0.84), respectively (Table 3). For Florida, nearly every keyword had strong correlations prior to the first case in the state; the strongest correlations for COVID symptoms, coronavirus symptoms, loss of smell, and coronavirus vaccine were seen on lag days 10 and 11 (R=0.74), lag day 16 (R=0.78), lag day 8 (R=0.70), and lag day 15 (R=0.75), respectively (Table 4). For New York, the strongest correlations for COVID symptoms, coronavirus symptoms, coronavirus testing center, loss of smell, and coronavirus vaccine were seen on lag days 5, 6, and 7 (R=0.87), lag day 16 (R=0.87), lag day 9 (R=0.76), lag days 2, 4, and 5 (R=0.78), and lag day 15 (R=0.80), respectively (Table 5).
Table 2Minnesota Lag and Lead Correlation Coefficients for Each Google Keyword’s Relative Search Volume and New COVID-19 Cases
Our study found moderate to strong correlations between data obtained from searching COVID-19– related keywords in Google Trends and total COVID-19 cases in the United States as obtained from national data aggregators. Strong correlations were seen up to 16 days prior to the first reported cases in some states. This finding emphasizes the importance of digital surveillance and suggests that it can be a useful addition to our toolbelt when trying to monitor new infectious disease outbreaks.
Over the years, several studies have pointed to the role of Internet surveillance in helping with early prediction of other infectious disease outbreaks, including diseases such as dengue fever, Zika virus, H1N1, influenza, measles, and MERS.
There are several benefits to utilizing Internet surveillance methods vs traditional methods, and employing a combination of the two is likely the key to an effective surveillance system. One benefit to an Internet model is minimal costs because all of the data gathered from Google Trends were available free. Furthermore, the data are made available to the public in real time, with near-instant updates in regard to search results. This factor is extremely important when attempting to predict outbreaks and new hotspots for a pandemic because any delay in information could potentially miss the “golden window” that would allow for preparation prior to an outbreak in a certain location. Several other articles focusing on influenza have emphasized the pitfalls of traditional surveillance and how the US Centers for Disease Control and Prevention surveillance reports were often weeks behind search engine results and estimates because traditional systems take 1 to 2 weeks to gather and process surveillance data.
This type of lag was further supported in our study of COVID-19, as Google data on search trends predated the first reports of cases on a state-by-state basis. In a study on MERS reported in 2016, Shin et al
found a similar lag pattern, with social media and search engine data reflecting disease outbreak earlier than conventional surveillance models. Scientists in China also looked for this data lag with COVID-19 in their country and had similar results.
They looked back 14 days prior to the first reported cases and found that “the peak Internet searches and social media data about the COVID-19 outbreak occurred 10-14 days earlier than the peak of daily incidences in China.”
We suspect that our US data reveal similar lags in traditional surveillance data for a number of reasons. First, hospital reporting can vary from state to state and even county to county. Although we try to standardize reporting guidelines, during a time of a pandemic when hospital systems and the country are becoming increasingly stressed, appropriate reporting can break down. In fact, inappropriate reporting can lead to significant inaccuracies when data is released using traditional surveillance models. For example, on April 17, 2020, China raised its coronavirus death toll in Wuhan by 50% in comparison with their previously reported numbers.
A second important source of data lag using traditional surveillance in the United States is the lack of testing required for the current pandemic. Testing is evolving on a day-by-day basis, and, thankfully, we are moving in the right direction; however, the United States and the world still have a ways to go. Testing capabilities were sparse at the beginning of the US outbreaks, and many areas were backlogged in their abilities to test for COVID-19. Even if patient samples were available, the time to test that sample and report the diagnosis back to the physician and patient were delayed because testing capabilities were not robust. This issue, of course, results in a delay in reported cases and is where Internet surveillance could add value. As the pandemic continues to evolve, the need for quicker testing and an increase in the quantity of testing for COVID-19 is paramount. In an article by Gottlieb et al
regarding the reopening of the United States, the authors stated, “We estimate that a national capacity of at least 750,000 tests per week would be sufficient. In conjunction with more widespread testing, we need to invest in new tools to make it efficient for providers to communicate test results and make data easily accessible to public-health officials working to contain future outbreaks.” Data accessibility and speed of communication are key; search engine surveillance meets both of these criteria and thus provides important up-to-date information while traditional models catch up.
It is important to note that our study looked at 10 keywords, and each had varying strengths of correlation with case numbers. If we had looked at 100 keywords, even stronger correlations may have been found. Search terms will also evolve as a pandemic progresses. Furthermore, Google itself is widely used in the United States, which makes it a good candidate for digital surveillance, but this is not the case for every country. For example, Google is not a major search engine in China.
utilized Google and Twitter when conducting their study on MERS and found strong correlations using both sites. One other limitation of Google Trends is the granularity it provides. Although it does provide information on some cities, it does not currently provide a comprehensive town-by-town breakdown of its data. This issue would make it difficult to create appropriate forecast models on a town-by-town basis, and individuals would have to rely on broader state-wide predictions.
This study reveals the benefits of internet surveillance models and the use of Google Trends to monitor new infectious diseases such as COVID-19. For the United States, Google Trends data were highly correlated with cases of COVID-19 on a state-by-state basis and could potentially be used to predict new areas of outbreak and possible high-impact zones as the disease progresses. Furthermore, this study documents that there is information present in Google Trends that precedes outbreaks, and these data should be utilized to allow for better resource allocation in regard to tests, personal protective equipment, medication, and more.