NERD Entity Fishing - A Guide to Identifying Entities in Text - Methods Hub

Learning Objectives

Named entity recognition (NER) is the task of locating and identifying entities in a text. This tutorial aims to provide a comprehensive understanding of Named Entity Recognition using the Entity-Fishing tool, including:

Understand the basics of NER (named entity recognition) and its applications
Use the Entity-Fishing tool to extract entities from text
Analyze and interpret the results

This tutorial targets people who want to use various state-of-the-art entity linking and disambiguation tools, offering all information in one place.

The tutorial assumes basic knowledge in Python programming and natural language processing, especially of the concepts of named entities and knowledge graphs
It focuses on off-the-shelf named entity recognition and disambiguation tools; training custom models is not part of this tutorial

Duration

2 hours

Use Cases

Research on socioeconomic disparities using a dataset of news articles, research papers, and community forum posts that require the extraction of people’s names, locations, social organizations, and government agencies, etc. Knowledge gained during this tutorial can be applied to recognize and disambiguate named entities in corresponding texts. For example, Washington may refer to the President of the United States, the capital of the United States, or the state.
Research on the impact of celebrity endorsements on social attitudes by extracting celebrity name mentions to recognize people and their associated entities. For example, Ryan Reynolds is a Canadian actor as well as a New Zealand cricketer. NER tools as presented in this tutorial can be applied to large-scale social media data to link the posts to the correct person.

Environment Setup

This tutorial requires at least Python 3.7. All further packages are installed when discussing the respective tools.

Overview

In the vast landscape of natural language processing (NLP), Named Entity Recognition (NER) and Named Entity Disambiguation play pivotal roles towards understanding and extracting valuable information from text.

Named entities are specific, named elements in text, such as names of people, organizations, locations, dates, and more. Named Entity Recognition is the process of automatically identifying and classifying these named entities within a given text. It forms the foundation for a wide range of applications, from information retrieval and question answering to sentiment analysis and knowledge graph construction.

However, the journey doesn’t stop at just recognizing named entities. In real-world scenarios, the same name can often refer to multiple entities depending on the context. This brings us to the challenge of Named Entity Disambiguation, which is the process of determining the correct entity a name refers to, particularly in cases of ambiguity. For instance, does “Michael Jordan” refer to the actor or the sportsperson? This is where disambiguation comes into play, making NER not only about identification but also about understanding context.

Sample Data

For this tutorial we use a small list of three sentences as texts to illustrate named entity recognition and disambiguation. In real applications, text are typically much longer and datasets much larger.

texts = [
    'Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.',
    'Tesla CEO Elon Musk and Jeffrey Epstein associate Ghislaine Maxwell were once photographed together.',
    'Amazon CEO Andy Jassy spoke alongside climate activist Greta Thunberg at the sustainability summit in Berlin.'
]

1. Entity Fishing Tool

The first tool we discuss is NERD Entity fishing, which performs general entity recognition and disambiguation against the Wikidata knowledge base. The tool currently supports 15 languages: English, French, German, Spanish, Italian, Arabic, Japanese, Chinese (Mandarin), Russian, Portuguese, Farsi, Ukrainian, Swedish, Bengali and Hindi.

For English and French, grobid-ner is used for named entity recognition and disambiguation. GROBID (GeneRation Of BIbliographic Data) is an open-source machine learning library designed to extract and structure bibliographic metadata from scholarly documents. While GROBID’s primary focus is on bibliographic data extraction, it also includes a Named Entity Recognition (NER) component that can be used to extract entities like person names, dates, and locations from scholarly texts. GROBID NER is trained to recognize specific types of entities commonly found in scholarly documents, such as author names, publication dates, journal titles, and more. It’s particularly useful for processing academic literature and extracting structured information from research papers and articles.

Key facts

Named entities are recognized using GROBID, which is trained on Wikipedia articles and the CONLL 2003 dataset to recognise 27 named entity classes
Entity fishing disambiguates against Wikidata

Installation

!pip install --quiet entity-fishing-client

Execution

import json, logging
from nerd import nerd_client
logging.getLogger("nerd.nerd_client").setLevel("INFO")

entity_fishing_client = nerd_client.NerdClient()

def entity_fishing(text):
    return entity_fishing_client.disambiguate_text(text)[0]

# Use Entity Fishing on the sentences
entity_fishing_output = [
    entity_fishing(text) for text in texts
]

# Write them to a JSON file
with open("entity-fishing-output.json", "w") as output_file:
    json.dump(entity_fishing_output, output_file, indent=4)

print("done and saved")

done and saved

Analysis

output = entity_fishing_output[0]  # change number to 1 or 2 to look at the other sentences
# Show disambiguation data
print(json.dumps(output, indent=4))

{
    "text": "Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.",
    "entities": [
        {
            "rawName": "Microsoft",
            "offsetStart": 0,
            "offsetEnd": 9,
            "confidence_score": 0.7067,
            "wikipediaExternalRef": 19001,
            "wikidataId": "Q2283",
            "domains": [
                "Electronics",
                "Commerce",
                "Enterprise",
                "Computer_Science"
            ]
        },
        {
            "rawName": "Bill Gates",
            "offsetStart": 18,
            "offsetEnd": 28,
            "confidence_score": 0.7072,
            "wikipediaExternalRef": 3747,
            "wikidataId": "Q5284",
            "domains": [
                "Home",
                "Computer_Science",
                "Electronics"
            ]
        },
        {
            "rawName": "Indian Prime Minister Narendra Modi",
            "offsetStart": 38,
            "offsetEnd": 73,
            "confidence_score": 0.4541,
            "wikipediaExternalRef": 444222,
            "wikidataId": "Q1058",
            "domains": [
                "Sociology",
                "Biology"
            ]
        },
        {
            "rawName": "partnerships",
            "offsetStart": 96,
            "offsetEnd": 108,
            "confidence_score": 0.3742,
            "wikipediaExternalRef": 22666280,
            "wikidataId": "Q7888184",
            "domains": [
                "Administration"
            ]
        }
    ],
    "customisation": "generic",
    "sentence": "true",
    "language": {
        "lang": "en",
        "conf": 0.9999970757488004
    }
}

# Show Wikidata entries for detected entities
for entity in output["entities"]:
    name_from_text = output["text"][entity["offsetStart"]:entity["offsetEnd"]]
    wikidata_page = "https://www.wikidata.org/wiki/" + entity["wikidataId"]
    print(f"{name_from_text}: {wikidata_page}")

Microsoft: https://www.wikidata.org/wiki/Q2283
Bill Gates: https://www.wikidata.org/wiki/Q5284
Indian Prime Minister Narendra Modi: https://www.wikidata.org/wiki/Q1058
partnerships: https://www.wikidata.org/wiki/Q7888184

2. Dbpedia Spotlight

The second tool we discuss is Dbpedia Spotlight, which annotates mentions for DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data cloud through DBpedia. It performs both NER and entity linking by linking recognized entities to their corresponding entries in the DBpedia knowledge base.

Key facts:

Dbpedia Spotlight disambiguates against DBpedia.

Installation

No installation required as Dbpedia Spotlight runs on a public API that can be directly accessed using regular web requests.

Execution

import requests

def dbpedia_spotlight(text):
    data = "text=" + text
    result = requests.post(
        "https://api.dbpedia-spotlight.org/en/annotate",
        data=data,
        headers={"Accept": "application/json"}
    )
    return result.json()

# Use DBpedia Spotlight on the sentences
dbpedia_spotlight_output = [
    dbpedia_spotlight(text) for text in texts
]

# Write them to a JSON file
with open("dbpedia-spotlight-output.json", "w") as output_file:
    json.dump(dbpedia_spotlight_output, output_file, indent=4)

print("done and saved")

done and saved

Analysis

output = dbpedia_spotlight_output[0]  # change number to 1 or 2 to look at the other sentences
# Show disambiguation data
print(json.dumps(output, indent=4))

{
    "@text": "Microsoft founder Bill Gates met with Indian Prime Minister Narendra Modi to discuss technology partnerships.",
    "@confidence": "0.5",
    "@support": "0",
    "@types": "",
    "@sparql": "",
    "@policy": "whitelist",
    "Resources": [
        {
            "@URI": "http://dbpedia.org/resource/Microsoft",
            "@support": "37800",
            "@types": "Wikidata:Q4830453,Wikidata:Q43229,Wikidata:Q24229398,DUL:SocialPerson,DUL:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:Agent,DBpedia:Company",
            "@surfaceForm": "Microsoft",
            "@offset": "0",
            "@similarityScore": "0.9999578373542678",
            "@percentageOfSecondRank": "3.5435327250394117E-5"
        },
        {
            "@URI": "http://dbpedia.org/resource/Bill_Gates",
            "@support": "2491",
            "@types": "Http://xmlns.com/foaf/0.1/Person,Wikidata:Q729,Wikidata:Q5,Wikidata:Q215627,Wikidata:Q19088,DUL:NaturalPerson,Schema:Person,DBpedia:Species,DBpedia:Eukaryote,DBpedia:Animal,DBpedia:Person",
            "@surfaceForm": "Bill Gates",
            "@offset": "18",
            "@similarityScore": "0.9999998577599557",
            "@percentageOfSecondRank": "1.422313110181471E-7"
        },
        {
            "@URI": "http://dbpedia.org/resource/Prime_Minister_of_India",
            "@support": "4046",
            "@types": "",
            "@surfaceForm": "Indian Prime Minister",
            "@offset": "38",
            "@similarityScore": "1.0",
            "@percentageOfSecondRank": "0.0"
        },
        {
            "@URI": "http://dbpedia.org/resource/Narendra_Modi",
            "@support": "4248",
            "@types": "Http://xmlns.com/foaf/0.1/Person,Wikidata:Q82955,Wikidata:Q729,Wikidata:Q5,Wikidata:Q215627,Wikidata:Q19088,DUL:NaturalPerson,Schema:Person,DBpedia:Species,DBpedia:Person,DBpedia:Eukaryote,DBpedia:Animal,DBpedia:Politician",
            "@surfaceForm": "Narendra Modi",
            "@offset": "60",
            "@similarityScore": "1.0",
            "@percentageOfSecondRank": "0.0"
        }
    ]
}

# Show DBpedia entries for detected entities
for entity in output["Resources"]:
    name_from_text = output["@text"][int(entity["@offset"]):(int(entity["@offset"])+len(entity["@surfaceForm"]))]
    dbpedia_page = entity["@URI"]
    print(f"{name_from_text}: {dbpedia_page}")

Microsoft: http://dbpedia.org/resource/Microsoft
Bill Gates: http://dbpedia.org/resource/Bill_Gates
Indian Prime Minister: http://dbpedia.org/resource/Prime_Minister_of_India
Narendra Modi: http://dbpedia.org/resource/Narendra_Modi

Conclusion

In conclusion, this tutorial equips you with the skills to use some of the existing open source off-the-shelf NERD tools. It showed you tools to link entities to the two most widely used knowledge bases in research (Wikidata and DBpedia), and gives you confidence to explore any new NERD tools other than those available in this tutorial.

For more in-depth exploration, consider checking out the current state-of-the-art in NER on NLP-progress: Entity Linking.

As a further learning resource, we here collected answers to frequently asked questions.

FAQ

Why use off-the-shelf tools for entity linking and disambiguation? Off-the-shelf tools offer pre-built solutions that save time and resources. They are often trained on large datasets, providing a good starting point for various applications.

What types of entities can be linked using these tools? Most tools support common entities like persons, organizations, and locations. Some may also handle specific domains or custom entities based on the tool’s training data.

How accurate are off-the-shelf tools? Accuracy varies among tools. It depends on factors such as the quality of training data, the diversity of entities, and the specific use case. Evaluation metrics like precision, recall, and F1 score help assess accuracy.

Do these tools work for multiple languages? Many off-the-shelf tools support multiple languages, but the level of accuracy can vary. It’s essential to check the documentation for language support.

Can these tools be fine-tuned for domain-specific applications? Some tools offer the possibility of fine-tuning on domain-specific data. However, it depends on the tool’s architecture and capabilities.

How do these tools handle ambiguous references? Ambiguity resolution depends on context and available information. Some tools use machine learning models that consider surrounding words, phrases, or contextual information to disambiguate references.

Are there privacy concerns when using entity linking tools? Yes, privacy concerns may arise, especially if the text contains sensitive information. It’s crucial to review the tool’s privacy policy and consider using it with proper data anonymization practices.

What knowledge bases do these tools typically use? Tools may use popular knowledge bases like Wikidata, DBpedia, or Freebase. Some tools allow users to specify custom knowledge bases or integrate with proprietary databases.

Can these tools handle real-time processing? Real-time processing capabilities vary. Some tools are optimized for speed, while others may be more suitable for batch processing. Consider the specific requirements of your application.

How do these tools handle typos or misspellings? Some tools include mechanisms to handle typos or misspellings through fuzzy matching or probabilistic models. However, their effectiveness may vary.

Are there limitations to off-the-shelf tools? Yes, limitations can include handling rare entities, dealing with noisy or informal text, and adapting to highly specialized domains. It’s essential to understand the tool’s strengths and weaknesses.

Do these tools require internet access? Some tools may require internet access to query external knowledge bases. Check the tool’s documentation for offline or custom knowledge base options.

How scalable are these tools for large datasets? Scalability depends on the tool’s architecture. Some tools are designed for large-scale processing, while others may be more suitable for smaller datasets.

Can I combine multiple tools for better performance? Yes, combining multiple tools (ensemble methods) can improve performance and mitigate the limitations of individual tools. However, integration complexity should be considered.

Contact details

In case of questions and suggestions for this tutorial, contact Susmita.Gangopadhyay@gesis.org