Methods Hub - GESIS

Description

This method allows users to perform a semantic search across a collection of social media posts (e.g., tweets) and retrieve the most relevant posts for a given query. For example, a social scientist studying public discourse on topics like social media, gender issues, or elections can use this tool to identify posts that share a similar meaning to the input query.

Use Case(s)

This method supports all use cases that require finding tweets (or other social media posts) for a specific topic, entity, or keyword. For example, one use case explores how users express emotions and build social connections on Twitter. By analyzing tweets for emotional sentiment, interaction patterns, and cultural references, researchers can uncover insights into individual well-being, community dynamics, and cultural identity trends.

Input Data

The input (query) text can be a word, phrase or sentence or a social media post, for semantic search over the corpora. For multiple queries update data/input_queries.txt having each query per line. A single query can be directly provided in the semantic-search-over_social-media-posts.ipynb.

User Query: The easiest way to change the query is by editing the data/input_queries.txt.

Example Query: The current file contains the following keywords:
- Social Norms
- Cultural Identity - Community Interaction

These keywords will be used to find tweets relevant to these topics in the dataset.

Input Dataset: It can be social media posts in json format e.g., Tweets. We use NLTK sample tweets (corpora/tweets.20150430-223406.json) for demonstration.

Output Data

After running all the scripts in semantic-search-over_social-media-posts.ipynb, the results will be saved as a JSON file in the following location:
File: data/output.json

Structure of Output:
Each result in the JSON output file includes the following fields:
- Post ID: The unique identifier of the social media post.
- Post Text: The content of the post.
- Similarity Score: A numerical value (ranging from 0 to 1) indicating how closely the post matches the input query.

Sample Output:
Below are the top-K most similar posts to the given query (with top-K set to 5 in this example):

  {
    "social media": [
      {
        "post ID": "13567",
        "post text": "There's something a bit \"dad dancing\" about the way the Tories try to electioneer via social media https://t.co/WH0cmv76VD",
        "sim score": "0.9372139191497816"
      },
      {
        "post ID": "9732",
        "post text": "It's extremely comforting to know that the power of mainstream media has been diluted by social media? #SNP",
        "sim score": "0.9371564729455584"
      },
      {
        "post ID": "18324",
        "post text": "@mmaher70 @RichardJMurphy So why cant they defend the position thats just total incompetence constantly allow Tories to set agenda esp media",
        "sim score": "0.918129503287474"
      }
    ],
    "women": [
      {
        "post ID": "287",
        "post text": "RT @macplus4: And. Miliband stumbled. Much bigger issues to discuss - NHS, mental health, foodbanks, homelessness, usual cuts to women &amp; ch…",
        "sim score": "0.9999048991755727"
      },
      {
        "post ID": "2902",
        "post text": "Pigs sweat, men perspire https://t.co/6ZIU37HYPh",
        "sim score": "0.7674937266310939"
      }
    ],
    "election": [
      {
        "post ID": "19237",
        "post text": "#ELECTION2015 https://t.co/WgCyxkkAkc",
        "sim score": "0.9999999995861624"
      },
      {
        "post ID": "14156",
        "post text": "#NigelFarage #UKIP #Election2015 http://t.co/oyr8o5aJCv",
        "sim score": "0.99999999834465"
      }
    ]
  }

Hardware Requirements

The method runs on a cheap virtual machine provided by cloud computing company (2 x86 CPU core, 4 GB RAM, 40GB HDD).

Environment Setup

Python v3.8 (preferably through Anaconda)
This method requires a collection of social media posts in JSON format. Place the collection in the corpora/ folder and update the file name in the config.json to point to your dataset. For demonstration purposes, the method uses a sample dataset of NLTK tweets located in corpora/tweets.20150430-223406.json.
Using Anaconda:

>conda create -n semantic_search python=3.8
>conda activate semantic_search
>conda install -c conda-forge notebook
>pip install -r requirements.txt

Using Python:

>python -m venv semantic_search
>cd semantic_search
>Scripts\activate
>cd ..
>pip install -r requirements.txt

How to Use

Start Jupyter Lab or Notebook:

>jupyter lab

Open and execute all cells in semantic-search-over_social-media-posts.ipynb.
Add new queries (one per line) in data/input_queries.txt.
Update the path to your JSON collection in config.json.
Results are saved in data/output.json, including post IDs, text, and similarity scores.

Technical Details

The method reads search queries from data/input_queries.txt (with one query per line) and writes the top-K most similar posts to data/output.json. It uses Fasttext embeddings loaded from embeddings/en_embeddings.p to get word/token embeddings that are averaged to compute post/document embeddings. Users can customize the behavior of the method by specifying their preferences and paths to resources in the config.json file. It assists in replicability by allowing to execute the method under different settings e.g., with different posts collection, different value of top-K and with/without cleaning. For reproducibility of results across executions, the working environment of the method is preserved in requirements.txt file, random seed variables are defined and the necessary details to reuse the method are provided in How to Use section. Update config.json to adjust parameters like input_query_filepath, top-K, or preprocessing options ("ifpreprocess": true/false). To easily run and explore the method in a pre-configured environment, you can use Binder. It allows you to execute the notebook without needing to set up the environment locally. Click the badge below to get started. The following figure explains the working of the method that computes embeddings for words in the corpora posts and the input query, aggregate them at the document level, compute cosine similarity for between each query embedding and corpora posts embeddings and finally restults tne top-K most similar posts from the corpora.

Contact Details

For questions or feedback, contact Fakhri Momeni via fakhri.momeni@gesis.org.

Methods

Tutorials

Taxonomy

Semantic Search Over Social Media Posts