Named Entity Recognition to Enrich Text

Oct 20, 2023
Open in Github

Named Entity Recognition (NER) is a Natural Language Processing task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.

This notebook demonstrates how to carry out NER with chat completion and functions-calling to enrich a text with links to a knowledge base such as Wikipedia:

Text:

In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.

Text enriched with Wikipedia links:

In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.

Inference Costs: The notebook also illustrates how to estimate OpenAI API costs.

%pip install --upgrade openai --quiet
%pip install --upgrade nlpia2-wikipedia --quiet
%pip install --upgrade tenacity --quiet
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

This notebook works with the latest OpeanAI models gpt-3.5-turbo-0613 and gpt-4-0613.

import json
import logging
import os

import openai
import wikipedia

from typing import Optional
from IPython.display import display, Markdown
from tenacity import retry, wait_random_exponential, stop_after_attempt

logging.basicConfig(level=logging.INFO, format=' %(asctime)s - %(levelname)s - %(message)s')

OPENAI_MODEL = 'gpt-3.5-turbo-0613'
openai.api_key = os.getenv("OPENAI_API_KEY")

We define a standard set of NER labels to showcase a wide range of use cases. However, for our specific task of enriching text with knowledge base links, only a subset is practically required.

labels = [
    "person",      # people, including fictional characters
    "fac",         # buildings, airports, highways, bridges
    "org",         # organizations, companies, agencies, institutions
    "gpe",         # geopolitical entities like countries, cities, states
    "loc",         # non-gpe locations
    "product",     # vehicles, foods, appareal, appliances, software, toys 
    "event",       # named sports, scientific milestones, historical events
    "work_of_art", # titles of books, songs, movies
    "law",         # named laws, acts, or legislations
    "language",    # any named language
    "date",        # absolute or relative dates or periods
    "time",        # time units smaller than a day
    "percent",     # percentage (e.g., "twenty percent", "18%")
    "money",       # monetary values, including unit
    "quantity",    # measurements, e.g., weight or distance
]

The chat completions API takes a list of messages as input and delivers a model-generated message as an output. While the chat format is primarily designed for facilitating multi-turn conversations, it is equally efficient for single-turn tasks without any preceding conversation. For our purposes, we will specify a message for the system, assistant, and user roles.

The system message (prompt) sets the assistant's behavior by defining its desired persona and task. We also delineate the specific set of entity labels we aim to identify.

Although one can instruct the model to format its response, it has to be noted that both gpt-3.5-turbo-0613 and gpt-4-0613 have been fine-tuned to discern when a function should be invoked, and to reply with JSON formatted according to the function's signature. This capability streamlines our prompt and enables us to receive structured data directly from the model.

def system_message(labels):
    return f"""
You are an expert in Natural Language Processing. Your task is to identify common Named Entities (NER) in a given text.
The possible common Named Entities (NER) types are exclusively: ({", ".join(labels)})."""

Assistant messages usually store previous assistant responses. However, as in our scenario, they can also be crafted to provide examples of the desired behavior. While OpenAI is able to execute zero-shot Named Entity Recognition, we have found that a one-shot approach produces more precise results.

def assisstant_message():
    return f"""
EXAMPLE:
    Text: 'In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread / 
    of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.'
    {{
        "gpe": ["Germany", "Europe"],
        "date": ["1440"],
        "person": ["Johannes Gutenberg"],
        "product": ["movable-type printing press"],
        "event": ["Renaissance"],
        "quantity": ["3,600 pages"],
        "time": ["workday"]
    }}
--"""

The user message provides the specific text for the assistant task:

def user_message(text):
    return f"""
TASK:
    Text: {text}
"""

In an OpenAI API call, we can describe functions to gpt-3.5-turbo-0613 and gpt-4-0613 and have the model intelligently choose to output a JSON object containing arguments to call those functions. It's important to note that the chat completions API doesn't actually execute the function. Instead, it provides the JSON output, which can then be used to call the function in our code. For more details, refer to the OpenAI Function Calling Guide.

Our function, enrich_entities(text, label_entities gets a block of text and a dictionary containing identified labels and entities as parameters. It then associates the recognized entities with their corresponding links to the Wikipedia articles.

@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))
def find_link(entity: str) -> Optional[str]:
    """
    Finds a Wikipedia link for a given entity.
    """
    try:
        titles = wikipedia.search(entity)
        if titles:
            # naively consider the first result as the best
            page = wikipedia.page(titles[0])
            return page.url
    except (wikipedia.exceptions.WikipediaException) as ex:
        logging.error(f'Error occurred while searching for Wikipedia link for entity {entity}: {str(ex)}')

    return None
def find_all_links(label_entities:dict) -> dict:
    """ 
    Finds all Wikipedia links for the dictionary entities in the whitelist label list.
    """
    whitelist = ['event', 'gpe', 'org', 'person', 'product', 'work_of_art']
    
    return {e: find_link(e) for label, entities in label_entities.items() 
                            for e in entities
                            if label in whitelist}
def enrich_entities(text: str, label_entities: dict) -> str:
    """
    Enriches text with knowledge base links.
    """
    entity_link_dict = find_all_links(label_entities)
    logging.info(f"entity_link_dict: {entity_link_dict}")
    
    for entity, link in entity_link_dict.items():
        text = text.replace(entity, f"[{entity}]({link})")

    return text

As previously highlighted, gpt-3.5-turbo-0613 and gpt-4-0613 have been fine-tuned to detect when a function should to be called. Moreover, they can produce a JSON response that conforms to the function signature. Here's the sequence we follow:

  1. Define our function and its associated JSON Schema.
  2. Invoke the model using the messages, functions and function_call parameters.
  3. Convert the output into a JSON object, and then call the function with the arguments provided by the model.

In practice, one might want to re-invoke the model again by appending the function response as a new message, and let the model summarize the results back to the user. Nevertheless, for our purposes, this step is not needed.

Note that in a real-case scenario it is strongly recommended to build in user confirmation flows before taking actions.

Since we want the model to output a dictionary of labels and recognized entities:

{   
    "gpe": ["Germany", "Europe"],   
    "date": ["1440"],   
    "person": ["Johannes Gutenberg"],   
    "product": ["movable-type printing press"],   
    "event": ["Renaissance"],   
    "quantity": ["3,600 pages"],   
    "time": ["workday"]   
}   

we need to define the corresponding JSON schema to be passed to the functions parameter:

def generate_functions(labels: dict) -> list:
    return [
        {
            "name": "enrich_entities",
            "description": "Enrich Text with Knowledge Base Links",
            "parameters": {
                "type": "object",
                    "properties": {
                        "r'^(?:' + '|'.join({labels}) + ')$'": 
                        {
                            "type": "array",
                            "items": {
                                "type": "string"
                            }
                        }
                    },
                    "additionalProperties": False
            },
        }
    ]

Now, we invoke the model. It's important to note that we direct the API to use a specific function by setting the function_call parameter to {"name": "enrich_entities"}.

@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))
def run_openai_task(labels, text):
    messages = [
          {"role": "system", "content": system_message(labels=labels)},
          {"role": "assistant", "content": assisstant_message()},
          {"role": "user", "content": user_message(text=text)}
      ]

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        messages=messages,
        functions=generate_functions(labels),
        function_call={"name": "enrich_entities"}, 
        temperature=0,
        frequency_penalty=0,
        presence_penalty=0,
    )

    response_message = response["choices"][0]["message"]
    
    available_functions = {"enrich_entities": enrich_entities}  
    function_name = response_message["function_call"]["name"]
    
    function_to_call = available_functions[function_name]
    logging.info(f"function_to_call: {function_to_call}")

    function_args = json.loads(response_message["function_call"]["arguments"])
    logging.info(f"function_args: {function_args}")

    function_response = function_to_call(text, function_args)

    return {"model_response": response, 
            "function_response": function_response}
text = """The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison, and Ringo Starr."""
result = run_openai_task(labels, text)
 2023-10-20 18:05:51,729 - INFO - function_to_call: <function enrich_entities at 0x0000021D30C462A0>
 2023-10-20 18:05:51,730 - INFO - function_args: {'person': ['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr'], 'org': ['The Beatles'], 'gpe': ['Liverpool'], 'date': ['1960']}
 2023-10-20 18:06:09,858 - INFO - entity_link_dict: {'John Lennon': 'https://en.wikipedia.org/wiki/John_Lennon', 'Paul McCartney': 'https://en.wikipedia.org/wiki/Paul_McCartney', 'George Harrison': 'https://en.wikipedia.org/wiki/George_Harrison', 'Ringo Starr': 'https://en.wikipedia.org/wiki/Ringo_Starr', 'The Beatles': 'https://en.wikipedia.org/wiki/The_Beatles', 'Liverpool': 'https://en.wikipedia.org/wiki/Liverpool'}
display(Markdown(f"""**Text:** {text}   
                     **Enriched_Text:** {result['function_response']}"""))
<IPython.core.display.Markdown object>

To estimate the inference costs, we can parse the response's "usage" field. Detailed token costs per model are available in the OpenAI Pricing Guide:

# estimate inference cost assuming gpt-3.5-turbo (4K context)
i_tokens  = result["model_response"]["usage"]["prompt_tokens"] 
o_tokens = result["model_response"]["usage"]["completion_tokens"] 

i_cost = (i_tokens / 1000) * 0.0015
o_cost = (o_tokens / 1000) * 0.002

print(f"""Token Usage
    Prompt: {i_tokens} tokens
    Completion: {o_tokens} tokens
    Cost estimation: ${round(i_cost + o_cost, 5)}""")
Token Usage
    Prompt: 331 tokens
    Completion: 47 tokens
    Cost estimation: $0.00059