Text Preprocessing Toolkit - Methods Hub

Learning Objective

This tutorial will teach you the essential techniques for text preprocessing using Python and spaCy, with a focus on practical applications in social science research. You will learn how to clean, structure, and transform raw text data—making it ready for analysis, modeling, and interpretation.

Text preprocessing is a critical first step in any Natural Language Processing (NLP) workflow. By mastering these methods, you will be able to: - Remove noise and inconsistencies from textual data - Standardize and normalize language for better analysis - Extract meaningful information for downstream tasks such as sentiment analysis, topic modeling, and entity recognition

Whether you are working with survey responses, interview transcripts, or social media data, these skills will help you unlock deeper insights for your research.

Target Audience

This project is designed for:

Researchers who want to analyze qualitative data from surveys, interviews, or media sources using modern NLP techniques.
Students and educators who are looking for a practical introduction to text pre-processing and its applications in social science research.
Data analysts and practitioners who are interested in cleaning, structuring, and extracting insights from large volumes of textual data.
Anyone new to NLP who wants a step-by-step notebook and clear code examples that make text processing accessible for beginners with basic Python knowledge.

No prior experience with spaCy or advanced machine learning is required. The tutorial guides you through each concept, making it easy to apply these techniques on your own.

Duration

~ 2 hours

Use Cases

Text preprocessing is a crucial step in social science research, enabling scholars to analyze large volumes of qualitative data efficiently and accurately. Here are some practical applications:

Survey and Interview Analysis. Automatically extract key themes, sentiments, and entities from open-ended survey responses or interview transcripts. For example, lemmatization and stopword removal help in identifying the most frequent topics discussed by participants.
Political Discourse Analysis. Tokenization, named entity recognition, and sentiment analysis can be used to study political speeches, debates, or social media posts. Researchers can track how politicians discuss certain issues, measure emotional tone, and identify key actors or organizations.
Media and News Studies. Use sentence segmentation and TF-IDF keyword extraction to compare coverage of events across different news outlets. Named entity recognition helps in mapping relationships between people, places, and organizations mentioned in articles.
Comparative Linguistic Studies. Vocabulary comparison functions allow researchers to analyze language differences between demographic groups, regions, or time periods. This is useful for studying language evolution, cultural trends, or the impact of policy changes.
Public Opinion and Sentiment Tracking. Sentiment analysis provides insights into public attitudes toward policies, social issues, or brands by analyzing social media, forums, or feedback forms.

By applying these techniques, social scientists can transform unstructured text into actionable data and uncover hidden patterns in texts.

Environment Setup

This tutorial is self-contained. All methods defined in it are also available in the text_processing_toolkit.py in the code repository.

Let’s start by installing and importing the necessary libraries for text processing and analysis.

!pip install --quiet spacy==3.8.7 scikit-learn==1.7.2
import collections
import json
import os
import spacy

To process text in some language, we need to load the appropriate spaCy model. The following functions help load and manage language models. We use the English model in this tutorial.

def choose_spacy_model(language):
    """
    Loads a spaCy language model for the specified language. If the model is not found,
    attempts to download it and then load it.
    """
    try:
        return spacy.load(f"{language}_core_web_sm")  # Small model (sm) is often sufficient
    except OSError:
        print(f"Model '{language}_core_web_sm' not found. Downloading...")
        try:
            download_command = f"python -m spacy download {language}_core_web_sm"
            exit_code = os.system(download_command)
        except:
            raise ValueError(f"Language '{language}' is not supported.")
        return spacy.load(f"{language}_core_web_sm")

nlp = choose_spacy_model("en")  # Loading the English model as 'nlp'

Model 'en_core_web_sm' not found. Downloading...

Collecting en-core-web-sm==3.8.0

  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.8 MB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 118.2 MB/s eta 0:00:00


Installing collected packages: en-core-web-sm

Successfully installed en-core-web-sm-3.8.0

✔ Download and installation successful

You can now load the package via spacy.load('en_core_web_sm')

1. Tokenization

Tokenization is the process of splitting text into individual words or tokens. This is the first step in most NLP pipelines and necessary for most advanced linguistic processing. The spaCy library adds metadata to each “token” that we will exploit in the next steps (the spaCy documentation contains the complete list). For the start, think of tokens to be like the words and punctuation marks of a text.

Useful for: word frequency analyses.

def tokenize(nlp, text):
    """
    Tokenize the input text using spaCy.

    Parameters:
    - nlp: The spaCy instance.
    - text (str): Input text.

    Returns:
    - list: List of tokens.
    """
    document = nlp(text)
    tokens = [token for token in document]
    return tokens

text = "Natural Language Processing enables computers to understand human language with most accuracy. It also allows computers to compute with text data more effectively."
tokens = tokenize(nlp, text)
print("Tokens:", tokens)

Tokens: [Natural, Language, Processing, enables, computers, to, understand, human, language, with, most, accuracy, ., It, also, allows, computers, to, compute, with, text, data, more, effectively, .]

The output displays each word and punctuation mark as a separate token. This allows us to analyze the structure and content of the text at the word level. For example to count how often each toke occurs:

print(collections.Counter(tokens))

Counter({Natural: 1, Language: 1, Processing: 1, enables: 1, computers: 1, to: 1, understand: 1, human: 1, language: 1, with: 1, most: 1, accuracy: 1, .: 1, It: 1, also: 1, allows: 1, computers: 1, to: 1, compute: 1, with: 1, text: 1, data: 1, more: 1, effectively: 1, .: 1})

The metadata stored with each token also allows to reconstruct the original text. In this case we need the text attribute to get the token as simple text string and whitespace_ attribute to tell us whether a space was after the token in the original text (e.g., this is not the case for “accuracy”, which is directly followed by a “.” in the original text).

def get_token_texts(tokens):
    """
    Get the "raw" text of the tokens, removing all metadata.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: Tokens as simple text strings.
    """
    token_texts = [token.text for token in tokens]
    return token_texts

def reconstruct_text_from_tokens(tokens):
    """
    Recreate the text from the tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - str: Text as string.
    """
    text = "".join([token.text + token.whitespace_ for token in tokens])
    return text

print(get_token_texts(tokens))
print(reconstruct_text_from_tokens(tokens))

['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'It', 'also', 'allows', 'computers', 'to', 'compute', 'with', 'text', 'data', 'more', 'effectively', '.']
Natural Language Processing enables computers to understand human language with most accuracy. It also allows computers to compute with text data more effectively.

2. Removing Stopwords and Punctuation

Stopwords are common words (typically function words like “the”, “is”, “and”) that usually do not add significant meaning to text analysis. Removing them helps to focus on meaningful words.

Useful for: keyword extraction, topic modeling.

def remove_stopwords_from_tokens(tokens):
    """
    Remove stopwords from spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of tokens without stopwords.
    """
    tokens_without_stopwords = [token for token in tokens if not token.is_stop]
    return tokens_without_stopwords

def remove_punctuation_from_tokens(tokens):
    """
    Remove punctuation from spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of tokens without punctuation.
    """
    tokens_without_punctuation = [token for token in tokens if not token.is_punct]
    return tokens_without_punctuation

tokens_without_stopwords = remove_stopwords_from_tokens(tokens)
tokens_without_stopwords_and_punctuation = remove_punctuation_from_tokens(tokens_without_stopwords)
print("Tokens:                                  ", tokens)
print("-----")
print("Tokens without stopwords:                ", tokens_without_stopwords)
print("-----")
print("Tokens without stopwords and punctuation:", tokens_without_stopwords_and_punctuation)

Tokens:                                   [Natural, Language, Processing, enables, computers, to, understand, human, language, with, most, accuracy, ., It, also, allows, computers, to, compute, with, text, data, more, effectively, .]
-----
Tokens without stopwords:                 [Natural, Language, Processing, enables, computers, understand, human, language, accuracy, ., allows, computers, compute, text, data, effectively, .]
-----
Tokens without stopwords and punctuation: [Natural, Language, Processing, enables, computers, understand, human, language, accuracy, allows, computers, compute, text, data, effectively]

Inference:
The result contains only the meaningful words, with common stopwords removed. This helps focus analysis on the most relevant terms in the text.

3. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma). For example, “running” becomes “run”.

Useful for: reducing vocabulary size, improving matching in analysis.

def lemmatize_tokens(tokens):
    """
    Get the lemmas of spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of token lemmas.
    """
    lemmatized_tokens = [token.lemma_ for token in tokens]
    return lemmatized_tokens

lemmatized_tokens = lemmatize_tokens(tokens)
print("Tokens:           ", [token.text for token in tokens])
print("-----")
print("Lemmatized tokens:", lemmatized_tokens)

Tokens:            ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'It', 'also', 'allows', 'computers', 'to', 'compute', 'with', 'text', 'data', 'more', 'effectively', '.']
-----
Lemmatized tokens: ['Natural', 'Language', 'processing', 'enable', 'computer', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allow', 'computer', 'to', 'compute', 'with', 'text', 'datum', 'more', 'effectively', '.']

Inference:
Each word is reduced to its base form (lemma), which standardizes variations and improves the accuracy of further text analysis. Another often used standardization is to only use the lowercase texts (token.text.lower() or token.lemma_.lower()).

4. Sentence Segmentation

Sentence segmentation splits text into individual sentences. This is useful for analyzing sentence structure and, for example, assessing the readability of a text (long sentences tend to be harder to read).

Useful for: readability analysis, sentiment per sentence.

def split_into_sentences(nlp, text):
    """
    Split the input text in its sentences using spaCy.

    Parameters:
    - nlp: The spaCy instance.
    - text (str): Input text.

    Returns:
    - list: List of sentences.
    """
    doc = nlp(text)
    assert doc.has_annotation("SENT_START")
    sentences = [sentence.text for sentence in doc.sents]
    return sentences
    
sentences = split_into_sentences(nlp, text)
print("Sentences:", sentences)
print("-----")
print("Sentence lengths (in characters):", [len(sentence) for sentence in sentences])

Sentences: ['Natural Language Processing enables computers to understand human language with most accuracy.', 'It also allows computers to compute with text data more effectively.']
-----
Sentence lengths (in characters): [94, 68]

Inference:
The output lists each sentence found in the text. This segmentation allows us to analyze text structure, readability, and perform sentence-level operations such as sentiment analysis or topic detection.

5. Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies key entities in text, such as people, organizations, and locations.

Useful for: extracting actors, places, and organizations from documents.

def extract_named_entities(nlp, text):
    """
    Extract named entities (people, organizations, locations, etc.) from text.
    
    This function identifies and extracts important entities mentioned in your text,
    such as person names, company names, geographical locations, dates, and monetary values.
    This is particularly useful for analyzing political speeches, news articles, or 
    interview transcripts where you want to identify key actors and locations.

    Parameters:
    - text (str): The input text to analyze (e.g., "Barack Obama visited Paris in 2015")

    Returns:
    - list: List of dictionaries, each containing:
            - 'text': the entity text (e.g., "Barack Obama")
            - 'label': the entity type (e.g., "PERSON", "GPE" for geopolitical entity)
            - 'description': human-readable description of the entity type
    
    Example:
    Input: "Apple Inc. was founded by Steve Jobs in California."
    Output: [{'text': 'Apple Inc.', 'label': 'ORG', 'description': 'Organization'},
                {'text': 'Steve Jobs', 'label': 'PERSON', 'description': 'Person'},
                {'text': 'California', 'label': 'GPE', 'description': 'Geopolitical entity'}]
    """
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append({
            'text': ent.text,
            'label': ent.label_,
            'description': spacy.explain(ent.label_)
        })
    return entities
    
text_NER = 'The film was shot in Los Angeles and many other locations, for example Berlin.'
entities = extract_named_entities(nlp, text_NER)
print("Named Entities:", entities)

Named Entities: [{'text': 'Los Angeles', 'label': 'GPE', 'description': 'Countries, cities, states'}, {'text': 'Berlin', 'label': 'GPE', 'description': 'Countries, cities, states'}]

Inference:
The output lists the named entities found in the text, such as people, organizations, and locations. This is useful for extracting key actors and places from documents.

For extended use of named entity recognition tools, e.g., to link the detected entitites to knowledge bases, see the Entity Fishing tutorial on the Methods Hub.

6. Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) identifies important words and phrases in a collection of documents.

Useful for: finding distinctive themes, comparing language use across groups.

from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords_tfidf(texts, max_features=20, ngram_range=(1, 2)):
    """
    Extract the most important keywords from a collection of texts using TF-IDF analysis.
    
    TF-IDF (Term Frequency-Inverse Document Frequency) helps identify words that are
    important in individual documents but not too common across all documents.
    This is excellent for finding distinctive themes in survey responses, interview
    transcripts, or comparing different groups' language use.

    Parameters:
    - texts (list): List of text documents to analyze (e.g., survey responses)
    - max_features (int): Maximum number of top keywords to return (default: 20)
    - ngram_range (tuple): Range of n-grams to consider. (1,1) for single words,
                           (1,2) for single words and two-word phrases (default: (1,2))

    Returns:
    - list: For each input document a list of tuples containing
            (keyword, importance_score), sorted by importance (highest first)
    
    Example:
    For analyzing political speeches, this might return:
    [('economic policy', 0.45), ('healthcare reform', 0.38), ('job creation', 0.32), ...]
    
    Note: You need at least 2 documents for meaningful TF-IDF analysis.
    """
    if len(texts) < 2:
        raise ValueError("TF-IDF analysis requires at least 2 documents for comparison.")
    
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    feature_names = vectorizer.get_feature_names_out()
    keywords_per_text = []
    for tfidf_row in tfidf_matrix:
        scores = [float(number) for number in tfidf_row.toarray()[0]]
        keywords_scores = list(zip(feature_names, scores))
        keywords_scores.sort(key=lambda x: x[1], reverse=True)
        keywords_per_text.append(keywords_scores)
    
    return keywords_per_text

texts = [
    "Natural Language Processing enables computers to understand human languages and process text data efficiently.",
    "Text analytics and machine learning are important for extracting insights from large volumes of textual data.",
    "Deep learning models help in analyzing and comprehending complex language patterns."
]

# Convert text to lowercase and remove punctuation and stopwords
processed_texts_tokens = [
    remove_stopwords_from_tokens(
        remove_punctuation_from_tokens(
            tokenize(nlp, text.lower())
        )
    )
    for text in texts
]
# Then combine the tokens again (using " ".join) as the method required untokenized text
processed_texts = [
    " ".join(get_token_texts(tokens)) for tokens in processed_texts_tokens
]

keywords_per_text = extract_keywords_tfidf(processed_texts)
print("Texts:          ", texts)
print("-----")
print("Processed texts:", processed_texts)
print("-----")
print("Keywords with TF-IDF-measured importance:")
for i in range(len(keywords_per_text)):
    print("- Text " + str(i + 1) + ":", keywords_per_text[i])
    print()

Texts:           ['Natural Language Processing enables computers to understand human languages and process text data efficiently.', 'Text analytics and machine learning are important for extracting insights from large volumes of textual data.', 'Deep learning models help in analyzing and comprehending complex language patterns.']
-----
Processed texts: ['natural language processing enables computers understand human languages process text data efficiently', 'text analytics machine learning important extracting insights large volumes textual data', 'deep learning models help analyzing comprehending complex language patterns']
-----
Keywords with TF-IDF-measured importance:
- Text 1: [('computers', 0.38532288602703124), ('computers understand', 0.38532288602703124), ('data efficiently', 0.38532288602703124), ('enables', 0.38532288602703124), ('enables computers', 0.38532288602703124), ('data', 0.2930479866955796), ('language', 0.2930479866955796), ('text', 0.2930479866955796), ('analytics', 0.0), ('analytics machine', 0.0), ('analyzing', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehending', 0.0), ('comprehending complex', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('extracting', 0.0), ('extracting insights', 0.0), ('learning', 0.0)]

- Text 2: [('analytics', 0.4175666238781924), ('analytics machine', 0.4175666238781924), ('extracting', 0.4175666238781924), ('extracting insights', 0.4175666238781924), ('data', 0.3175701804283441), ('learning', 0.3175701804283441), ('text', 0.3175701804283441), ('analyzing', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehending', 0.0), ('comprehending complex', 0.0), ('computers', 0.0), ('computers understand', 0.0), ('data efficiently', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('enables', 0.0), ('enables computers', 0.0), ('language', 0.0)]

- Text 3: [('analyzing', 0.3501387057719138), ('complex', 0.3501387057719138), ('complex language', 0.3501387057719138), ('comprehending', 0.3501387057719138), ('comprehending complex', 0.3501387057719138), ('deep', 0.3501387057719138), ('deep learning', 0.3501387057719138), ('language', 0.2662895107233706), ('learning', 0.2662895107233706), ('analytics', 0.0), ('analytics machine', 0.0), ('computers', 0.0), ('computers understand', 0.0), ('data', 0.0), ('data efficiently', 0.0), ('enables', 0.0), ('enables computers', 0.0), ('extracting', 0.0), ('extracting insights', 0.0), ('text', 0.0)]

Inference:
The result shows the most important keywords and phrases identified by TF-IDF across the provided documents. Typically, the documents are not just sentences but at least paragraphs and up to whole articles (originally: web pages), whereas the collection contains typically hundreds of documents. These keywords represent distinctive themes and help summarize the main topics present in the text collection.

For more details and different variants, see the Contrastive Keyword Extractor method on the Methods Hub.

Keyword Extraction with TF-IDF: Lemmatized vs. Non-Lemmatized Text

Keyword extraction using TF-IDF can yield different results depending on whether the input text is lemmatized. Lemmatization reduces words to their base forms, which helps group similar words and may improve the relevance of extracted keywords. Here, we compare the keywords extracted from raw text and lemmatized text.

# Lemmatize the tokens
processed_texts_lemmas = [ [ token.lemma_ for token in tokens ] for tokens in processed_texts_tokens ]

# Then combine the lemmas again (using " ".join) as the method required untokenized text
processed_lemmatized_texts = [
    " ".join(lemmas) for lemmas in processed_texts_lemmas
]

keywords_per_lemmatized_text = extract_keywords_tfidf(processed_lemmatized_texts)

for i in range(len(keywords_per_text)):
    print("Text " + str(i + 1) + ":")
    print("- Not lemmatized")
    print("  Text:    ", processed_texts[i])
    print("  Keywords:", keywords_per_text[i])
    print("- Lemmatized")
    print("  Text:    ", processed_lemmatized_texts[i])
    print("  Keywords:", keywords_per_lemmatized_text[i])
    print()

Text 1:
- Not lemmatized
  Text:     natural language processing enables computers understand human languages process text data efficiently
  Keywords: [('computers', 0.38532288602703124), ('computers understand', 0.38532288602703124), ('data efficiently', 0.38532288602703124), ('enables', 0.38532288602703124), ('enables computers', 0.38532288602703124), ('data', 0.2930479866955796), ('language', 0.2930479866955796), ('text', 0.2930479866955796), ('analytics', 0.0), ('analytics machine', 0.0), ('analyzing', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehending', 0.0), ('comprehending complex', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('extracting', 0.0), ('extracting insights', 0.0), ('learning', 0.0)]
- Lemmatized
  Text:     natural language processing enable computer understand human language process text datum efficiently
  Keywords: [('language', 0.5226272583901186), ('computer', 0.3435960195291681), ('computer understand', 0.3435960195291681), ('datum efficiently', 0.3435960195291681), ('enable', 0.3435960195291681), ('enable computer', 0.3435960195291681), ('datum', 0.2613136291950593), ('text', 0.2613136291950593), ('analytic', 0.0), ('analytic machine', 0.0), ('analyze', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehend', 0.0), ('comprehend complex', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('extract', 0.0), ('extract insight', 0.0), ('learning', 0.0)]

Text 2:
- Not lemmatized
  Text:     text analytics machine learning important extracting insights large volumes textual data
  Keywords: [('analytics', 0.4175666238781924), ('analytics machine', 0.4175666238781924), ('extracting', 0.4175666238781924), ('extracting insights', 0.4175666238781924), ('data', 0.3175701804283441), ('learning', 0.3175701804283441), ('text', 0.3175701804283441), ('analyzing', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehending', 0.0), ('comprehending complex', 0.0), ('computers', 0.0), ('computers understand', 0.0), ('data efficiently', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('enables', 0.0), ('enables computers', 0.0), ('language', 0.0)]
- Lemmatized
  Text:     text analytic machine learning important extract insight large volume textual datum
  Keywords: [('analytic', 0.4175666238781924), ('analytic machine', 0.4175666238781924), ('extract', 0.4175666238781924), ('extract insight', 0.4175666238781924), ('datum', 0.3175701804283441), ('learning', 0.3175701804283441), ('text', 0.3175701804283441), ('analyze', 0.0), ('complex', 0.0), ('complex language', 0.0), ('comprehend', 0.0), ('comprehend complex', 0.0), ('computer', 0.0), ('computer understand', 0.0), ('datum efficiently', 0.0), ('deep', 0.0), ('deep learning', 0.0), ('enable', 0.0), ('enable computer', 0.0), ('language', 0.0)]

Text 3:
- Not lemmatized
  Text:     deep learning models help analyzing comprehending complex language patterns
  Keywords: [('analyzing', 0.3501387057719138), ('complex', 0.3501387057719138), ('complex language', 0.3501387057719138), ('comprehending', 0.3501387057719138), ('comprehending complex', 0.3501387057719138), ('deep', 0.3501387057719138), ('deep learning', 0.3501387057719138), ('language', 0.2662895107233706), ('learning', 0.2662895107233706), ('analytics', 0.0), ('analytics machine', 0.0), ('computers', 0.0), ('computers understand', 0.0), ('data', 0.0), ('data efficiently', 0.0), ('enables', 0.0), ('enables computers', 0.0), ('extracting', 0.0), ('extracting insights', 0.0), ('text', 0.0)]
- Lemmatized
  Text:     deep learning model help analyze comprehend complex language pattern
  Keywords: [('analyze', 0.3501387057719138), ('complex', 0.3501387057719138), ('complex language', 0.3501387057719138), ('comprehend', 0.3501387057719138), ('comprehend complex', 0.3501387057719138), ('deep', 0.3501387057719138), ('deep learning', 0.3501387057719138), ('language', 0.2662895107233706), ('learning', 0.2662895107233706), ('analytic', 0.0), ('analytic machine', 0.0), ('computer', 0.0), ('computer understand', 0.0), ('datum', 0.0), ('datum efficiently', 0.0), ('enable', 0.0), ('enable computer', 0.0), ('extract', 0.0), ('extract insight', 0.0), ('text', 0.0)]

Inference:
The keywords extracted from lemmatized text are more standardized and may group similar concepts (e.g., for the first text, “language” and “languages” both become “language”). This reduces redundancy and highlights the most relevant terms. In contrast, keywords from raw text may include multiple forms of the same word, leading to less focused results. Lemmatization generally improves the quality and interpretability of keyword extraction for downstream analysis.

7. Basic Sentiment Analysis

Sentiment analysis determines whether text expresses positive, negative, or neutral emotions. We here use a simple and very fast approach using custom word lists, as this allows you to look inside the method. In general you should rather use an existing implementation, like the ones discussed in the Sentiment Analysis Tutorial on the Methods Hub.

Useful for: analyzing public opinion, customer feedback, or political discourse.

def analyze_sentiment_basic(tokens):
    """
    Perform basic sentiment analysis to determine if text expresses positive or negative emotions.
    
    This function analyzes the emotional tone of text by looking for positive and negative
    words. While not as sophisticated as machine learning approaches, it provides a quick
    way to gauge overall sentiment in survey responses, social media posts, or interviews.
    
    Useful for: analyzing public opinion, customer feedback, political discourse, or
    any text where emotional tone matters for your research.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - dict: Dictionary containing:
            - 'sentiment': overall sentiment ('positive', 'negative', or 'neutral')
            - 'positive_words': list of positive words found
            - 'negative_words': list of negative words found
            - 'score': numerical score (positive = above 0, negative = below 0)
    
    Example:
    Input: tokenize(nlp, "I love this new policy but I hate the implementation.")
    Output: {'sentiment': 'neutral', 'positive_words': ['love'], 
                'negative_words': ['hate'], 'score': 0}
    """
    # Basic positive and negative word lists (you might want to expand these)
    positive_words = {'good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic', 
                        'love', 'like', 'enjoy', 'happy', 'pleased', 'satisfied', 'positive',
                        'benefit', 'advantage', 'success', 'improve', 'better', 'best'}
    
    negative_words = {'bad', 'terrible', 'awful', 'horrible', 'hate', 'dislike', 'angry',
                        'sad', 'disappointed', 'frustrated', 'negative', 'problem', 'issue',
                        'difficult', 'hard', 'impossible', 'fail', 'failure', 'worse', 'worst'}
    
    found_positive = []
    found_negative = []
    
    for token in tokens:
        if token.lemma_.lower() in positive_words:
            found_positive.append(token.text)
        elif token.lemma_.lower() in negative_words:
            found_negative.append(token.text)
    
    score = len(found_positive) - len(found_negative)
    
    if score > 0:
        sentiment = 'positive'
    elif score < 0:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'
    
    return {
        'sentiment': sentiment,
        'positive_words': found_positive,
        'negative_words': found_negative,
        'score': score
    }

texts_sentiment = [
    "I love Natural Language Processing. It is sooo wonderful!",
    "I love Natural Language Processing, but sometimes it is terrible.",
    "I hate Natural Language Processing."
]
for text in texts_sentiment:
    tokens = tokenize(nlp, text)
    sentiment_result = analyze_sentiment_basic(tokens)
    print("Text:     ", text)
    print("Sentiment:", sentiment_result["sentiment"])
    print("Analysis: ", sentiment_result)
    print()

Text:      I love Natural Language Processing. It is sooo wonderful!
Sentiment: positive
Analysis:  {'sentiment': 'positive', 'positive_words': ['love', 'wonderful'], 'negative_words': [], 'score': 2}

Text:      I love Natural Language Processing, but sometimes it is terrible.
Sentiment: neutral
Analysis:  {'sentiment': 'neutral', 'positive_words': ['love'], 'negative_words': ['terrible'], 'score': 0}

Text:      I hate Natural Language Processing.
Sentiment: negative
Analysis:  {'sentiment': 'negative', 'positive_words': [], 'negative_words': ['hate'], 'score': -1}

Inference:
The sentiment score and lists of positive/negative words indicate the overall emotional tone of the text, which can be used to gauge public opinion or feedback. While state-of-the-art models use AI to also cope with negations–“it is not wonderful!” is rather negative, despite “wonderful”–, dictionary methods like the one above still perform surprisingly well and are much more efficient, which is important for big data processing.

For more details, see the Sentiment Analysis Tutorial on the Methods Hub.

8. Text Statistics

Text statistics provide quantitative measures of text complexity and structure, such as word count, sentence count, and lexical diversity.

Useful for: comparing documents, analyzing readability, and studying vocabulary richness.

def get_text_statistics(nlp, text):
    """
    Calculate comprehensive statistics about a text document.
    
    This function provides detailed quantitative measures of text complexity and structure.
    These statistics are valuable for comparing different types of documents, analyzing
    readability, or understanding the linguistic characteristics of different speakers
    or writers in your research.

    Parameters:
    - nlp: The spaCy instance.
    - text (str): The text to analyze.

    Returns:
    - dict: Dictionary containing detailed statistics:
            - 'word_count': total number of words
            - 'sentence_count': total number of sentences
            - 'character_count': total characters (including spaces)
            - 'avg_words_per_sentence': average sentence length
            - 'avg_characters_per_word': average word length
            - 'unique_words': number of unique words (vocabulary richness)
            - 'lexical_diversity': ratio of unique words to total words (0-1 scale)
            - 'pos_distribution': distribution of parts of speech (nouns, verbs, etc.)
    
    Example use cases:
    - Compare complexity of political speeches across different candidates
    - Analyze linguistic development in student essays
    - Study vocabulary richness in interview responses across different demographics
    """
    doc = nlp(text)

    # Basic counts
    words = [token for token in doc if not token.is_space and not token.is_punct]
    sentences = list(doc.sents)
    
    word_count = len(words)
    sentence_count = len(sentences)
    character_count = len(text)
    
    # Calculate averages
    avg_words_per_sentence = word_count / sentence_count if sentence_count > 0 else 0
    word_lengths = [len(token.text) for token in words]
    avg_characters_per_word = sum(word_lengths) / len(word_lengths) if word_lengths else 0
    
    # Vocabulary analysis
    word_texts = [token.text.lower() for token in words if token.is_alpha]
    unique_words = len(set(word_texts))
    lexical_diversity = unique_words / len(word_texts) if word_texts else 0
    
    # Parts of speech distribution
    pos_counts = {}
    for token in words:
        pos = token.pos_
        pos_counts[pos] = pos_counts.get(pos, 0) + 1
    
    # Convert to percentages
    pos_distribution = {pos: (count/word_count)*100 for pos, count in pos_counts.items()}
    
    return {
        'word_count': word_count,
        'sentence_count': sentence_count,
        'character_count': character_count,
        'avg_words_per_sentence': round(avg_words_per_sentence, 2),
        'avg_characters_per_word': round(avg_characters_per_word, 2),
        'unique_words': unique_words,
        'lexical_diversity': round(lexical_diversity, 3),
        'pos_distribution': pos_distribution
    }

stats = get_text_statistics(nlp, text)
print("Text:", text)
print("Text Statistics:", stats)

Text: I hate Natural Language Processing.
Text Statistics: {'word_count': 5, 'sentence_count': 1, 'character_count': 35, 'avg_words_per_sentence': 5.0, 'avg_characters_per_word': 6.0, 'unique_words': 5, 'lexical_diversity': 1.0, 'pos_distribution': {'PRON': 20.0, 'VERB': 20.0, 'PROPN': 40.0, 'NOUN': 20.0}}

Inference:
The statistics provide a quantitative overview of the text, including word and sentence counts, average lengths, vocabulary richness, and part-of-speech distribution. These metrics are useful for comparing documents, assessing complexity, and understanding linguistic characteristics.

9. Comparing Vocabulary Between Texts

Vocabulary comparison helps identify similarities and differences in word usage between two texts.

Useful for: comparing speeches, analyzing language differences between groups, or studying terminology evolution.

def compare_texts_vocabulary(nlp, text1, text2, top_n=10):
    """
    Compare the vocabulary usage between two texts to identify similarities and differences.
    
    This function is particularly useful for comparative analysis in social science research,
    such as comparing political speeches from different parties, analyzing language differences
    between demographic groups, or studying how language use changes over time.

    Parameters:
    - nlp: The spaCy instance.
    - text1 (str): First text for comparison
    - text2 (str): Second text for comparison  
    - top_n (int): Number of top unique words to return for each text (default: 10)

    Returns:
    - dict: Dictionary containing:
            - 'common_words': words that appear in both texts with their frequencies
            - 'unique_to_text1': words that appear only in the first text
            - 'unique_to_text2': words that appear only in the second text
            - 'similarity_score': percentage of vocabulary overlap (0-100)
    
    Example use cases:
    - Compare campaign speeches from different political candidates
    - Analyze language differences between age groups in survey responses
    - Study evolution of terminology in policy documents over time
    """
    # Process both texts
    doc1 = nlp(text1.lower())
    doc2 = nlp(text2.lower())
    
    # Extract meaningful words (no stopwords, punctuation, or short words)
    words1 = [token.lemma_ for token in doc1 if not token.is_stop and not token.is_punct 
                and token.is_alpha and len(token.text) > 2]
    words2 = [token.lemma_ for token in doc2 if not token.is_stop and not token.is_punct 
                and token.is_alpha and len(token.text) > 2]
    
    # Count word frequencies
    freq1 = collections.Counter(words1)
    freq2 = collections.Counter(words2)
    
    # Find common and unique words
    set1 = set(freq1.keys())
    set2 = set(freq2.keys())
    
    common_words = {}
    for word in set1.intersection(set2):
        common_words[word] = {'text1_frequency': freq1[word], 'text2_frequency': freq2[word]}
    
    unique_to_text1 = {word: freq1[word] for word in set1 - set2}
    unique_to_text2 = {word: freq2[word] for word in set2 - set1}
    
    # Sort by frequency and get top N
    unique_to_text1 = dict(sorted(unique_to_text1.items(), key=lambda x: x[1], reverse=True)[:top_n])
    unique_to_text2 = dict(sorted(unique_to_text2.items(), key=lambda x: x[1], reverse=True)[:top_n])
    
    # Calculate similarity score
    total_unique_words = len(set1.union(set2))
    similarity_score = (len(common_words) / total_unique_words * 100) if total_unique_words > 0 else 0
    
    return {
        'common_words': common_words,
        'unique_to_text1': unique_to_text1,
        'unique_to_text2': unique_to_text2,
        'similarity_score': round(similarity_score, 2)
    }

# Two dissimilar texts for vocabulary comparison
text1 = "Natural Language Processing enables computers to understand human language."
text2 = "Machine learning and deep learning are important for artificial intelligence."

comparison = compare_texts_vocabulary(nlp, text1, text2)
print("Vocabulary Comparison (dissimilar):")
print(json.dumps(comparison, indent=4))
print()

# Two similar texts for vocabulary comparison
text1 = "Natural Language Processing helps computers understand human language and analyze text data efficiently."
text2 = "Text analytics and Natural Language Processing enable machines to process and comprehend human language quickly."

# Compare vocabulary usage between the two texts
comparison = compare_texts_vocabulary(nlp, text1, text2)
print("Vocabulary Comparison (similar):")
print(json.dumps(comparison, indent=4))

Vocabulary Comparison (dissimilar):
{
    "common_words": {},
    "unique_to_text1": {
        "language": 2,
        "computer": 1,
        "processing": 1,
        "natural": 1,
        "enable": 1,
        "human": 1,
        "understand": 1
    },
    "unique_to_text2": {
        "learning": 2,
        "intelligence": 1,
        "important": 1,
        "machine": 1,
        "deep": 1,
        "artificial": 1
    },
    "similarity_score": 0.0
}

Vocabulary Comparison (similar):
{
    "common_words": {
        "text": {
            "text1_frequency": 1,
            "text2_frequency": 1
        },
        "processing": {
            "text1_frequency": 1,
            "text2_frequency": 1
        },
        "language": {
            "text1_frequency": 2,
            "text2_frequency": 2
        },
        "natural": {
            "text1_frequency": 1,
            "text2_frequency": 1
        },
        "human": {
            "text1_frequency": 1,
            "text2_frequency": 1
        }
    },
    "unique_to_text1": {
        "computer": 1,
        "efficiently": 1,
        "analyze": 1,
        "help": 1,
        "datum": 1,
        "understand": 1
    },
    "unique_to_text2": {
        "quickly": 1,
        "comprehend": 1,
        "enable": 1,
        "process": 1,
        "machine": 1,
        "analytic": 1
    },
    "similarity_score": 29.41
}

Inference:
The comparison highlights common and unique words between two texts, helping us understand similarities and differences in language use.

Conclusion

In this tutorial, we explored a comprehensive set of text pre-processing techniques using Python, spaCy, and scikit-learn. Starting from basic tokenization and stopword removal, we progressed through lemmatization, sentence segmentation, named entity recognition, and keyword extraction with TF-IDF. We also covered sentiment analysis, text statistics, and vocabulary comparison between texts.

These foundational steps are essential for preparing and analyzing textual data in any Natural Language Processing (NLP) project. By mastering these techniques, you can unlock deeper insights from your data, improve the performance of downstream models, and make your analyses more robust and interpretable.

Feel free to experiment further with your own texts and datasets. Text pre-processing is a powerful tool—use it to make your NLP workflows more effective and