Methods

How the Monitoring Lab collects data, analyzes impact to health, and generates insights

Tracking all available public media data, our monitoring systems provide real-time awareness of the health narratives trending across the country. This page is updated frequently as methodologies and data sources evolve.

Data sources and collection

Infodemiology.com currently shows insights and data on conversations about four health topics: mental health, opioids, reproductive health, and vaccines. All topics are monitored in English, with the exception of vaccines, which is also monitored in Spanish. PGP's Monitoring Lab produces the dashboards and insights available through Infodemiology.com. The Monitoring Lab operates continuously, monitoring discourse around the clock, year round. Analysts engage in real-time and longitudinal media monitoring, or social listening, to gauge public knowledge, attitudes, and behaviors related to health issues. Analysts use various software to gather available public media data, including Meta’s CrowdTangle platform and link checker, Quid, Talkwalker, Zignal Labs, Google Trends, Google Alerts, and others. The Monitoring Lab’s systems collect data in real time across multiple media sources, including millions of websites, social and digital media, video-sharing sites, online forums, and traditional media such as newspapers, magazines, and television.

Data is collected into monitoring platforms from keyword searches created by Monitoring Lab analysts. Each health topic that is covered by Infodemiology.com is the result of hundreds of keywords that are organized into complex Boolean search strings, or queries. Analysts update queries frequently to reflect the natural evolution of public discourse and to filter out irrelevant data. The information automatically collected, aggregated, and presented on health topic dashboards is from available public sources, meaning that they are accessible by anyone through a simple internet search. 

Much of public media data is not linked to a specific location; when insights do reference a geographic location, this determination is made by a combination of variables. For example, accounts may reference a location in their bio, or a conversation may make reference to a specific location. Analysts review data and adjust geographic filters manually to ensure location data is as accurate as possible. 

Given that dashboards do not present comparisons across states or regions, data is not normalized—meaning that analysts don’t attempt to standardize data in order to allow comparisons across areas. Raw numbers shown on dashboards are presented exactly as they are collected. 

Process for analysis

Once data is collected, Monitoring Lab analysts identify conversation spikes and conversation themes. Conversation spikes refer to sudden and notable increases in the volume or intensity of discussions on a particular topic or event across various platforms. Recognizing when spikes in conversation are happening is crucial for understanding shifts in conversations and pinpointing the events responsible for these surges. Identifying conversation spikes is also valuable for staying up to date on events and trending narratives. Conversation spikes are identified by media monitoring software platforms via trend lines or alerts. When mapped over time, this data indicates whether narratives are emerging, persisting, or declining. 

Conversation themes are persistent, recurring, and overarching ideas or topics in conversation. Grasping the primary themes in conversations is essential for understanding the kinds of content that consistently steer and sustain discussions. To identify and quantify themes, data is organized into themes and categories through human coding and artificial intelligence such as natural language processing. Monitoring Lab analysts examine thousands of conversations to identify dominant conversation themes and the common words, phrases, and hashtags used in conversation within those themes. Lists of unique keywords are created for each theme throughout the coding process. Theme keywords are programmed into software that automatically tags mentions containing one of those keywords into a theme. Analysts engage in multiple rounds of theme generation, combining separate but related themes into larger themes and adding exclusion terms to reduce extraneous messages. This methodology can be referred to as theoretical coding in grounded theory. Data on the dashboards presents the top themes, and analysts create new themes over time depending on the fluctuations in conversation. Within conversation spikes and across narratives, analysts also use sentinel surveillance to identify and track influential voices driving conversations. Sentinel surveillance is an epidemiological method used to gather information on disease trends rather than individual case investigation. In other words, certain nodes in information networks play a larger role in promoting narratives than others. These nodes provide accurate signals that conversation shifts are occurring, reducing the need to examine more of the network.

Insights

To create insights, Monitoring Lab analysts use machine learning and natural language processing to surface trending narratives and relevant themes across health topics. PGP’s Monitoring Lab has contracts with several competing media monitoring systems, and the data dashboards available on Infodemiology.com represent only one platform that analysts use. Analysts review all data from these systems and then deliver this data to trained journalists and science writers who follow the Society of Professional Journalists Code of Ethics for fact-checking. The Monitoring Lab takes a “weight of evidence” or “weight of experts” approach, linking to peer-reviewed research, scientific organizations, and reputable fact-checking sources to accurately report on how much agreement exists among scientists on a topic. 

Because not all narratives seen in data have equal potential to impact health, whether or not a narrative is included as an insight on Infodemiology.com is based on two factors: how a narrative might impact health decisions and the spread and velocity of the narrative. Assessing the potential impact of a narrative includes tracking how far a narrative has spread, where it’s circulating, who is driving the spread, and who the narrative is reaching. Analyzing velocity includes determining how quickly the narrative is increasing in shares or views and whether future spread is expected. 

Infodemiology.com insights reports may include any or all of this information, summarized for particular audiences. Some insights may describe narratives that are limited in reach, or lack the qualities necessary for future spread; these insights are often included because they indicate where the public has information gaps, confusion, or concerns. Other insights summarize narratives that pose a higher risk to health due to the current or predicted velocity, the tactics used to spread the narrative, or who the narrative targets. Narratives with the highest risk to impact health decisions or behaviors are usually circulating widely across communities, engaging a large audience with rapid speed. Insights in this category on Infodemiology.com are categorized with a blue flag to emphasize their importance. However, each organization or individual reading the insights or dashboards on Infodemiology.com likely has their own way of assessing risk based on personal or organizational priorities. For partnership opportunities to tailor insights for your organization, head here for more. 

Limitations

Insights included on Infodemiology.com are provided for informational purposes only and are not intended as medical advice. PGP’s Monitoring Lab monitors trending narratives across health topics; it is not intended as a tool solely to identify misinformation. While some insights reference specific false claims, myths, and conspiracy theories, others are included in order to indicate gaps in knowledge, concerns, opinions, or speculation and should not be considered false merely because they appear on Infodemiology.com.

Finally, data provided on dashboards is intended to give a general overview of the conversations happening in each topic. Data shown on dashboards is based on a keyword search query, meaning that posts that don’t contain these search terms are not included in the system. This means that the data may not be representative of the overall conversation about the topic. For theme coding, it is possible that posts are miscoded, especially those using sarcasm. Monitoring Lab analysts regularly check themes to ensure posts are tagged as the correct theme, but it is not possible to review every post in a theme. To address limitations, analysts have extensively tested this system and stay up to date on research regarding supervised machine learning.

Further resources

Further details about the data collection and thematic analysis process have been documented in scientific journals, both by PGP (see Bonnevie, 2020; Bonnevie, 2020) and others in the field (Chew, 2010; Mollema, 2025; Mooney, 2018; Jamison, 2020; Karafillakis, 2021; Teague, 2023). PGP’s media monitoring methodology is regarded as the standard by many organizations and has been reviewed by an independent Institutional Review Board and considered exempt.