Description
This method offers a computationally inexpensive and fully transparent way to identify scientific online discourse from a dataset of social media posts. It was originally developed for analyzing tweets, but can be used for all types of social media posts. It applies word-based heuristics (see Technical Details) to identify three different forms of scientific relevance in posts:
- Category 1: The post contains a claim or question that can be scientifically verified.
- Category 2: The post references scientific findings (via a link).
- Category 3: The post mentions a scientific research context (e.g., mention of a scientist, scientific research efforts, research findings).
The strength of the method lies in its transparency, as it is based exclusively on rules that are understandable to humans.
Use Cases
To quickly and transparently create a social media dataset in which scientific online discourse occurs significantly more frequently than in the original data. To do this, the method is applied to a very large dataset of social media posts and all posts that do not fall into any of the three categories according to the method are filtered out. Example:
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., & Dietze, S. (2022, October). SciTweets-a dataset and annotation framework for detecting scientific online discourse. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (pp. 3988-3992).
Input Data
The method accepts a tab-separated value file with at least one text column as input. The category 2 heuristics also require a ‘urls’ column with all resolved URLs of the main text (i.e., the actual target URLs, not the shortened ones like “https://t.co/xxx”) as a comma-separated list of strings (possibly empty) in square brackets. The input file can contain any other columns (e.g., with the post IDs). All columns in the input file are copied to the output file, with new columns added to the right.
Example input data (data/example_tweets.tsv):
| text | urls |
|---|---|
| In @sciencemagazine: Rare variant found that raises HDL cholesterol and increases risk of coronary heart disease: https://t.co/4xpL3KlGy9 | [‘https://www.science.org/doi/full/10.1126/science.aad3517’] |
| Teach children to treat animals responsibly do not teach captivity! Join us http://t.co/UR15gQPatU #FreeAllCetacea via @FreeAllCetacea | [‘http://www.wdcs.org/’] |
| Violence is a leading cause of death for Americans 10-24 yrs old. @NICHD_NIH research on youth violence prevention https://t.co/9ZD58zGUhI | [‘https://1.usa.gov/1q0MqOd/’] |
| ““At a point now where I understand plant temperaments — how vulnerable they are, but also how strong they can become”” #Precision #AgTech” | [] |
Output Data
This method adds the output of the heuristics as new columns to the right of the input data. The new columns is_catX (X being 1, 2 or 3), contains True in a row if the respective text is identified to be from a scientific discourse as per category X. For full transparency, the other new columns provide the output of all intermediate steps of the heuristics (see Technical Details).
The heuristics are applied one category at a time. If the heuristics are applied in sequence to the example data (See How to Use), the output is the same as in (data/example_tweets_cat1_cat2_cat3.tsv):
| text | is_claim | claim_sentence | has_sciterm | sciterms | is_cat1 | urls | sci_subdomain | has_sci_subdomain | sci_mag_domain | has_sci_mag_domain | sci_news_domain | has_sci_news_domain | is_cat2 | mentions_science_research_in_general | mentions_scientist | mentions_publications | mentions_research_method | is_cat3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| In @sciencemagazine: Rare variant found that raises HDL cholesterol and increases risk of coronary heart disease: https://t.co/4xpL3KlGy9 | True | in @sciencemagazine: rare variant found that raises hdl cholesterol and increases risk of coronary heart disease: https://t.co/4xpl3klgy9 | True | [‘cholesterol’, ‘disease’, ‘heart’, ‘risk’, ‘variant’] | True | [‘https://www.science.org/doi/full/10.1126/science.aad3517’] | www.science.org | True | www.science.org | True | False | True | False | |||||
| Teach children to treat animals responsibly do not teach captivity! Join us http://t.co/UR15gQPatU #FreeAllCetacea via @FreeAllCetacea | True | teach children to treat animals responsibly do not teach captivity! | True | [‘animals’] | True | [‘http://www.wdcs.org/’] | False | False | False | False | False | |||||||
| Violence is a leading cause of death for Americans 10-24 yrs old. @NICHD_NIH research on youth violence prevention https://t.co/9ZD58zGUhI | True | violence is a leading cause of death for americans 10-24 yrs old. | True | [‘prevention’, ‘research’] | True | [‘https://1.usa.gov/1q0MqOd/’] | False | False | False | False | research on | True | ||||||
| ““At a point now where I understand plant temperaments — how vulnerable they are, but also how strong they can become”” #Precision #AgTech” | False | True | [‘plant’] | False | [] | [] | False | [] | False | [] | False | False | False |
Hardware Requirements
This method has low hardware requirements. For example, it runs on a small machine with 1 x86 CPU core, 2 GB RAM, and 2 GB HDD. This method can run out of memory if your dataset is too large to fit into main memory (RAM).
Environment Setup
This method requires at least Python version 3.9. To avoid problems with your system’s Python installation, create and activate a virtual environment.
Then install all requirements using:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
How to Use
Apply the heuristics for each category to the example input data, then display the result:
# Apply category 1 heuristics and write to data/example_tweets_cat1.tsv
python3 src/apply_cat1_claim_question_heuristics.py data/example_tweets.tsv
# Apply category 2 heuristics and write to data/example_tweets_cat1_cat2.tsv
python3 src/apply_cat2_reference_heuristics.py data/example_tweets_cat1.tsv
# Apply category 3 heuristics and write to data/example_tweets_cat1_cat2_cat3.tsv
python3 src/apply_cat3_research_context_heuristics.py data/example_tweets_cat1_cat2.tsv
# Display result
cat data/example_tweets_cat1_cat2_cat3.tsv
Technical Details
This method employs a set of simple human-made rules (“heuristics”) to identify whether a social media post falls into one of three categories related to scientific online discourse. The heuristics for each category are:
Category 1: The post contains a claim or question that can be scientifically verified (source code).
- If the post contains a phrase with a verb from our list of verbs that indicate a scientific inquiry (output column
is_claim). The list was extracted from four research works on claims [1-4]. Moreover, in the sentence (output columnclaim_sentence), the verb has to be preceded by a noun and followed by a noun or by an adjective, and there has to be no personal pronoun, possessive pronoun or person entity in the same sentence (as these would indicate a specific person doing something, not a generic case). - And the post contains at least one term from a list of ~30k scientific terms from Wikipedia Glossaries, from which we hand-picked the categories that we deemed to be related to science (e.g, “medicine,” “history,” “biology”), or a list of ~1.5k scientific methods (see List of Social Science Research Methods), not counting terms that are slang words or that we identified as false scientific terms when looking at the most frequent terms in the dataset. Identified terms are in the output column
sciterms, withhas_scitermbeingTrueif it has at least one.
- If the post contains a phrase with a verb from our list of verbs that indicate a scientific inquiry (output column
Category 2: The post references scientific findings (via a link; source code).
- If the post contains a URL with a domain or subdomain of a scientific repository (17463 URLs, see List of Repositories; output columns
has_sci_subdomain(if any match) andsci_subdomain(first match)). - Or the post contains a URL with a domain or subdomain of a scientific magazine (14 magazines, manually curated; output columns
has_sci_mag_domain(if any match) andsci_mag_domain(first match)). - Or the post contains a URL with a domain or subdomain of a scientific news outlets and
/science(23 URLs, manually curated from major English language newspaper outlets with a dedicated science section; output columnshas_sci_news_domain(if any match) andsci_news_domain(first match)).
- If the post contains a URL with a domain or subdomain of a scientific repository (17463 URLs, see List of Repositories; output columns
Category 3: The post mentions a scientific research context (e.g., mention of a scientist, scientific research efforts, research findings; source code).
- If the post contains a noun or proper noun from a list of research or science phrases (15 terms, manually curated, output column
mentions_science_research_in_generalcontains the first noun found, if any). - Or the post contains a noun or proper noun from a list of scientists (12 terms, manually curated, output column
mentions_scientistcontains the first noun found, if any). - Or the post contains a noun or proper noun from a list of scientific publication types (10 terms, manually curated, output column
mentions_publicationscontains the first noun found, if any). - Or the post contains a term from a list of social science research methods (1476 methods, see List of Social Science Research Methods) with at least two words (output column
mentions_research_methodcontains the first term found, if any).
- If the post contains a noun or proper noun from a list of research or science phrases (15 terms, manually curated, output column
List of Repositories
The list of 17,463 subdomains is collected as follows:
- All full-text links included in the public CrossRef Snapshot from January 2021 were extracted
- Using the CrossRef API, we extracted the full-text links that were registered after January 2021
- All extracted links were annotated with their subdomain using the Python library tldextract
- The frequency for every subdomain was computed
- We excluded subdomains with a frequency lower than 50
- We excluded 38 subdomains that are not scientific (e.g., “youtube.com”, “yahoo.com”)
- We added 46 subdomains from a manually curated list (e.g., “semanticscholar.org”, “www.biorxiv.org”)
References
[1] Pinto, J. M. G., Wawrzinek, J., & Balke, W. T. (2019, June). What Drives Research Efforts? Find Scientific Claims that Count!. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 217-226). IEEE.
[2] González Pinto, J. M., & Balke, W. T. (2018, September). Scientific claims characterization for claim-based analysis in digital libraries. In International Conference on Theory and Practice of Digital Libraries (pp. 257-269). Springer, Cham.
[3] Kilicoglu, H., Rosemblat, G., Fiszman, M., & Rindflesch, T. C. (2011). Constructing a semantic predication gold standard from the biomedical literature. BMC bioinformatics, 12(1), 1-17.
[4] Smeros, P., Castillo, C., & Aberer, K. (2021, October). SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (pp. 1692-1702).
Contact Details
For questions or feedback, contact sebastian.schellhammer@gesis.org