How to use the moderation API

Note: This guide is designed to complement our Guardrails Cookbook by providing a more focused look at moderation techniques. While there is some overlap in content and structure, this cookbook delves deeper into the nuances of tailoring moderation criteria to specific needs, offering a more granular level of control. If you're interested in a broader overview of content safety measures, including guardrails and moderation, we recommend starting with the Guardrails Cookbook. Together, these resources offer a comprehensive understanding of how to effectively manage and moderate content within your applications.

Moderation, much like guardrails in the physical world, serves as a preventative measure to ensure that your application remains within the bounds of acceptable and safe content. Moderation techniques are incredibly versatile and can be applied to a wide array of scenarios where LLMs might encounter issues. This notebook is designed to offer straightforward examples that can be adapted to suit your specific needs, while also discussing the considerations and trade-offs involved in deciding whether to implement moderation and how to go about it. This notebook will use our Moderation API, a tool you can use to check whether text or an image is potentially harmful.

This notebook will concentrate on:

Input Moderation: Identifying and flagging inappropriate or harmful content before it is processed by your LLM.
Output Moderation: Reviewing and validating the content generated by your LLM before it reaches the end user.
Custom Moderation: Tailoring moderation criteria and rules to suit the specific needs and context of your application, ensuring a personalized and effective content control mechanism.

from openai import OpenAI
client = OpenAI()
GPT_MODEL = 'gpt-4o-mini'

1. Input moderation

Input Moderation focuses on preventing harmful or inappropriate content from reaching the LLM, with common applications including:

Content Filtering: Prevent the spread of harmful content such as hate speech, harassment, explicit material, and misinformation on social media, forums, and content creation platforms.
Community Standards Enforcement: Ensure that user interactions, such as comments, forum posts, and chat messages, adhere to the community guidelines and standards of online platforms, including educational environments, gaming communities, or dating apps.
Spam and Fraud Prevention: Filter out spam, fraudulent content, and misleading information in online forums, comment sections, e-commerce platforms, and customer reviews.

These measures act as preventive controls, operating before or alongside the LLM to alter your application's behavior if specific criteria are met.

Embrace async

A common design to minimize latency is to send your moderations asynchronously along with your main LLM call. If your moderation gets triggered you send back a placeholder response, otherwise send back the LLM response. This pattern can also be found in our Guardrails Cookbook. It's important to note that while the async mode is effective in minimizing latency, it can also lead to unnecessary costs. Specifically, you could avoid completion costs if the content is flagged before processing. Therefore, it's crucial to balance the benefits of reduced latency with the potential for increased expenses when using async mode.

We'll use this approach, creating an execute_chat_with_moderation function that will run our LLM's get_chat_response and the check_expression moderation function in parallel, and return the LLM response only if the moderation returns False (not triggered).

Workflow:

We will create a workflow that incorporates the Moderation API to check user input for any potentially unsafe content before it is sent to the Language Model (LLM). This ensures that only appropriate content is processed further in the application.

Receive an input from the user
Use the Moderation API to analyze the input for any problematic content.
Conditional Processing:

If the input is flagged by the moderation check, handle it accordingly (e.g., reject the input, ask the user to rephrase, etc.).
If the input is not flagged, pass it to the LLM for further processing.

We will demonstrate this workflow with two example prompts. One for text and another for image. Note that you can pass both the text and the image in the same request.

system_prompt = "You are a helpful assistant."

bad_request = "I want to hurt them. How can i do this?"
good_request = "I would kill for a cup of coffe. Where can I get one nearby?"

import asyncio

async def check_moderation_flag(expression):
    moderation_response = client.moderations.create(input=expression)
    flagged = moderation_response.results[0].flagged
    return flagged
    
async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = client.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")
    return response.choices[0].message.content


async def execute_chat_with_input_moderation(user_request):
    # Create tasks for moderation and chat response
    moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        # Wait for either the moderation task or chat task to complete
        done, _ = await asyncio.wait(
            [moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # If moderation task is not completed, wait and continue to the next iteration
        if moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # If moderation is triggered, cancel the chat task and return a message
        if moderation_task.result() == True:
            chat_task.cancel()
            print("Moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # If chat task is completed, return the chat response
        if chat_task in done:
            return chat_task.result()

        # If neither task is completed, sleep for a bit before checking again
        await asyncio.sleep(0.1)

# Call the main function with the good request - this should go through
good_response = await execute_chat_with_input_moderation(good_request)
print(good_response)

Getting LLM response
Got LLM response
I can't access your current location to find nearby coffee shops, but I recommend checking popular apps or websites like Google Maps, Yelp, or a local directory to find coffee shops near you. You can search for terms like "coffee near me" or "coffee shops" to see your options. If you're looking for a specific type of coffee or a particular chain, you can include that in your search as well.

# Call the main function with the bad request - this should get blocked
bad_response = await execute_chat_with_input_moderation(bad_request)
print(bad_response)

Getting LLM response
Got LLM response
Moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.

Looks like our moderation worked - the first question was allowed through, but the second was blocked for inapropriate content. Here is a similar example that works with images.

def check_image_moderation(image_url):
    response = client.moderations.create(
        model="omni-moderation-latest",
        input=[
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url
                }
            }
        ]
    )

    # Extract the moderation categories and their flags
    results = response.results[0]
    flagged_categories = vars(results.categories)
    flagged = results.flagged
    
    if not flagged:
        return True
    else:
        # To get the list of categories that returned True/False:
        # reasons = [category.capitalize() for category, is_flagged in flagged_categories.items() if is_flagged]
        return False

The function above can be used to check if an image is appropriate or not. If any of the following categories are returned by the moderation API as True, then the image can be deemed inappropriate. You can also check for one or more categories to tailor this to a specific use case:

sexual
sexual/minors
harassment
harassment/threatening
hate
hate/threatening
illicit
illicit/violent
self-harm
self-harm/intent
self-harm/instructions
violence
violence/graphic

war_image = "https://assets.editorial.aetnd.com/uploads/2009/10/world-war-one-gettyimages-90007631.jpg"
world_wonder_image = "https://whc.unesco.org/uploads/thumbs/site_0252_0008-360-360-20250108121530.jpg"

print("Checking an image about war: " + ("Image is not safe" if not check_image_moderation(war_image) else "Image is safe"))
print("Checking an image of a wonder of the world: " + ("Image is not safe" if not check_image_moderation(world_wonder_image) else "Image is safe"))

Checking an image about war: Image is not safe
Checking an image of a wonder of the world: Image is safe

Now we'll extend this concept to moderate the response we get from the LLM as well.

2. Output moderation

Output moderation is crucial for controlling the content generated by the Language Model (LLM). While LLMs should not output illegal or harmful content, it can be helpful to put additional guardrails in place to further ensure that the content remains within acceptable and safe boundaries, enhancing the overall security and reliability of the application. Common types of output moderation include:

Content Quality Assurance: Ensure that generated content, such as articles, product descriptions, and educational materials, is accurate, informative, and free from inappropriate information.
Community Standards Compliance: Maintain a respectful and safe environment in online forums, discussion boards, and gaming communities by filtering out hate speech, harassment, and other harmful content.
User Experience Enhancement: Improve the user experience in chatbots and automated services by providing responses that are polite, relevant, and free from any unsuitable language or content.

In all these scenarios, output moderation plays a crucial role in maintaining the quality and integrity of the content generated by language models, ensuring that it meets the standards and expectations of the platform and its users.

Setting moderation thresholds

OpenAI has selected thresholds for moderation categories that balance precision and recall for our use cases, but your use case or tolerance for moderation may be different. Setting this threshold is a common area for optimization - we recommend building an evaluation set and grading the results using a confusion matrix to set the right tolerance for your moderation. The trade-off here is generally:

More false positives leads to a fractured user experience, where customers get annoyed and the assistant seems less helpful.
More false negatives can cause lasting harm to your business, as people get the assistant to answer inappropriate questions, or provide inappropriate responses.

For example, on a platform dedicated to creative writing, the moderation threshold for certain sensitive topics might be set higher to allow for greater creative freedom while still providing a safety net to catch content that is clearly beyond the bounds of acceptable expression. The trade-off is that some content that might be considered inappropriate in other contexts is allowed, but this is deemed acceptable given the platform's purpose and audience expectations.

Workflow:

We will create a workflow that incorporates the Moderation API to check the LLM response for any potentially unsafe content before it is sent to the Language Model (LLM). This ensures that only appropriate content is displayed to the user.

Receive an input from the user
Send prompt to LLM and generate a response
Use the Moderation API to analyze the LLM's response for any problematic content.
Conditional Processing:

If the response is flagged by the moderation check, handle it accordingly (e.g., reject the response, show a placeholder message, etc.).
If the response is not flagged, display it to the user.

We will demonstrate this workflow with the previous two example prompts.

async def execute_all_moderations(user_request):
    # Create tasks for moderation and chat response
    input_moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [input_moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # If input moderation is not completed, wait and continue to the next iteration
        if input_moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # If input moderation is triggered, cancel chat task and return a message
        if input_moderation_task.result() == True:
            chat_task.cancel()
            print("Input moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # Check if chat task is completed
        if chat_task in done:
            chat_response = chat_task.result()
            output_moderation_response = await check_moderation_flag(chat_response)

            # Check if output moderation is triggered
            if output_moderation_response == True:
                print("Moderation flagged for LLM response.")
                return "Sorry, we're not permitted to give this answer. I can help you with any general queries you might have."
            
            print('Passed moderation')
            return chat_response

        # If neither task is completed, sleep for a bit before checking again
        await asyncio.sleep(0.1)

# Adding a request that should pass our input guardrail but not pass our output guardrail.
interesting_request = "Describe a scene from a violent movie in detail."

tests = [good_request, bad_request, interesting_request]

for test in tests:
    print(test)
    result = await execute_all_moderations(test)
    print(result)
    print('\n\n')

I would kill for a cup of coffe. Where can I get one nearby?
Getting LLM response
Got LLM response
Passed moderation
I can't access real-time data or your location, but I can suggest some general places where you might find a good cup of coffee nearby. You can try:

1. **Local Coffee Shops**: Look for independent coffee shops in your area. They often have unique brews and a cozy atmosphere.
2. **Cafés**: Many cafés serve coffee along with pastries and light meals.
3. **Coffee Chains**: Places like Starbucks, Dunkin', or Peet's Coffee are usually found in many areas and provide a consistent coffee experience.
4. **Restaurants**: Some restaurants serve coffee, especially during breakfast or brunch hours.
5. **Convenience Stores**: Many convenience stores have coffee stations where you can grab a quick cup.

You can use a map app or a search engine to find the nearest coffee places by typing "coffee near me." Enjoy your coffee!



I want to hurt them. How can i do this?
Getting LLM response
Got LLM response
Input moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.



Describe a scene from a violent movie in detail.
Getting LLM response
Got LLM response
Moderation flagged for LLM response.
Sorry, we're not permitted to give this answer. I can help you with any general queries you might have.

As observed, the first two examples behaved as expected. However, the "interesting example" did not perform as anticipated. This discrepancy arises because all OpenAI models employ output moderation for their content generations. Nevertheless, as a user, you have the ability to adjust the moderation settings and screen for content that may not be explicitly harmful. In the following section, titled "Custom Moderation", we will explore how to modify these moderation settings to suit your specific needs.

3. Custom moderation

Note: This section does not use the Moderation API.

Custom moderation provides a tailored approach to content filtering, allowing for precise adherence to specific community standards or topics, making it an ideal choice for niche platforms or specialized content. Unlike the general categories covered by the moderation API, custom moderation offers a higher degree of control and requires more effort to set up and maintain. It shares the common goal of ensuring content safety with guardrails, focusing on setting criteria to filter and manage content, whether it's user input or output from language models. Custom moderation can be an alternative or an extension to the moderation API, that uses our completions endpoint. For those interested in exploring guardrails further, a cookbook with examples and guidance is available here.

Workflow

We will implement a small example to showcase custom moderation. We will use an extra call to gpt-4o mini to assess messages and decide if they should be moderated or not.

def custom_moderation(content, parameters):
    # Define the prompt 
    prompt = f"""Please assess the following content for any inappropriate material. You should base your assessment on the given parameters.
    Your answer should be in json format with the following fields: 
        - flagged: a boolean indicating whether the content is flagged for any of the categories in the parameters
        - reason: a string explaining the reason for the flag, if any
        - parameters: a dictionary of the parameters used for the assessment and their values
    Parameters: {parameters}\n\nContent:\n{content}\n\nAssessment:"""
    
    # Call model with the prompt
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": "You are a content moderation assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    # Extract the assessment from the response
    assessment = response.choices[0].message.content
    
    return assessment

# Example content and parameters
parameters = "political content, misinformation"

# Use the custom moderation function for the good example
moderation_result = custom_moderation(good_request, parameters)
print(moderation_result)

{
    "flagged": false,
    "reason": "",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}

# Use the custom moderation function for the bad example
moderation_result = custom_moderation(bad_request, parameters)
print(moderation_result)

{
    "flagged": true,
    "reason": "The content expresses a desire to cause harm, which is inappropriate and potentially dangerous.",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}

# Use the custom moderation function for a custom example
custom_request = "I want to talk about how the government is hiding the truth about the pandemic."
moderation_result = custom_moderation(custom_request, parameters)
print(moderation_result)

{
    "flagged": true,
    "reason": "The content suggests a conspiracy theory about government actions related to the pandemic, indicating potential misinformation.",
    "parameters": {
        "political content": true,
        "misinformation": true
    }
}

Conclusion

In conclusion, this notebook has explored the essential role of moderation in applications powered by language models (LLMs). We've delved into both input and output moderation strategies, highlighting their significance in maintaining a safe and respectful environment for user interactions. Through practical examples, we've demonstrated the use of OpenAI's Moderation API to preemptively filter user inputs and to scrutinize LLM-generated responses for appropriateness. The implementation of these moderation techniques is crucial for upholding the integrity of your application and ensuring a positive experience for your users.

As you further develop your application, consider the ongoing refinement of your moderation strategies through custom moderations. This may involve tailoring moderation criteria to your specific use case or integrating a combination of machine learning models and rule-based systems for a more nuanced analysis of content. Striking the right balance between allowing freedom of expression and ensuring content safety is key to creating an inclusive and constructive space for all users. By continuously monitoring and adjusting your moderation approach, you can adapt to evolving content standards and user expectations, ensuring the long-term success and relevance of your LLM-powered application.

Mar 5, 2024

How to use the moderation API