How to implement LLM guardrails

Dec 19, 2023
Open in Github

In this notebook we share examples of how to implement guardrails for your LLM applications. A guardrail is a generic term for detective controls that aim to steer your application. Greater steerability is a common requirement given the inherent randomness of LLMs, and so creating effective guardrails has become one of the most common areas of performance optimization when pushing an LLM from prototype to production.

Guardrails are incredibly diverse and can be deployed to virtually any context you can imagine something going wrong with LLMs. This notebook aims to give simple examples that can be extended to meet your unique use case, as well as outlining the trade-offs to consider when deciding whether to implement a guardrail, and how to do it.

This notebook will focus on:

  1. Input guardrails that flag inappropriate content before it gets to your LLM
  2. Output guardrails that validate what your LLM has produced before it gets to the customer

Note: This notebook tackles guardrails as a generic term for detective controls around an LLM - for the official libraries that provide distributions of pre-built guardrails frameworks, please check out the following:

import openai

GPT_MODEL = 'gpt-3.5-turbo'

1. Input guardrails

Input guardrails aim to prevent inappropriate content getting to the LLM in the first place - some common use cases are:

  • Topical guardrails: Identify when a user asks an off-topic question and give them advice on what topics the LLM can help them with.
  • Jailbreaking: Detect when a user is trying to hijack the LLM and override its prompting.
  • Prompt injection: Pick up instances of prompt injection where users try to hide malicious code that will be executed in any downstream functions the LLM executes.

In all of these they act as a preventative control, running either before or in parallel with the LLM, and triggering your application to behave differently if one of these criteria are met.

Designing a guardrail

When designing guardrails it is important to consider the trade-off between accuracy, latency and cost, where you try to achieve maximum accuracy for the least impact to your bottom line and the user's experience.

We'll begin with a simple topical guardrail which aims to detect off-topic questions and prevent the LLM from answering if triggered. This guardrail consists of a simple prompt and uses gpt-3.5-turbo, maximising latency/cost over accuracy, but if we wanted to optimize further we could consider:

  • Accuracy: You could consider using a fine-tuned model or few-shot examples to increase the accuracy. RAG can also be effective if you have a corpus of information that can help determine whether a piece of content is allowed or not.
  • Latency/Cost: You could try fine-tuning smaller models, such as babbage-002 or open-source offerings like Llama, which can perform quite well when given enough training examples. When using open-source offerings you can also tune the machines you are using for inference to maximize either cost or latency reduction.

This simple guardrail aims to ensure the LLM only answers to a predefined set of topics, and responds to out-of-bounds queries with a canned message.

Embrace async

A common design to minimize latency is to send your guardrails asynchronously along with your main LLM call. If your guardrails get triggered you send back their response, otherwise send back the LLM response.

We'll use this approach, creating an execute_chat_with_guardrails function that will run our LLM's get_chat_response and the topical_guardrail guardrail in parallel, and return the LLM response only if the guardrail returns allowed.

Limitations

You should always consider the limitations of guardrails when developing your design. A few of the key ones to be aware of are:

  • When using LLMs as a guardrail, be aware that they have the same vulnerabilities as your base LLM call itself. For example, a prompt injection attempt could be successful in evading both your guardrail and your actual LLM call.
  • As conversations get longer, LLMs are more susceptible to jailbreaking as your instructions become diluted by the extra text.
  • Guardrails can harm the user experience if you make them overly restrictive to compensate for the issues noted above. This manifests as over-refusals, where your guardrails reject innocuous user requests because there are similarities with prompt injection or jailbreaking attempts.

Mitigations

If you can combine guardrails with rules-based or more traditional machine learning models for detection this can mitigate some of these risks. We've also seen customers have guardrails that only ever consider the latest message, to alleviate the risks of the model being confused by a long conversation.

We would also recommend doing a gradual roll-out with active monitoring of conversations so you can pick up instances of prompt injection or jailbreaking, and either add more guardrails to cover these new types of behaviour, or include them as training examples to your existing guardrails.

system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"
import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrail(good_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
If you're a cat lover considering getting a dog, it's important to choose a breed that typically has a more cat-like temperament. Here are some dog breeds that are known to be more cat-friendly:

1. Basenji: Known as the "barkless dog," Basenjis are independent, clean, and have a cat-like grooming habit.

2. Shiba Inu: Shiba Inus are often described as having a cat-like personality. They are independent, clean, and tend to be reserved with strangers.

3. Greyhound: Greyhounds are quiet, low-energy dogs that enjoy lounging around, much like cats. They are also known for their gentle and calm nature.

4. Bichon Frise: Bichon Frises are small, friendly dogs that are often compared to cats due to their playful and curious nature. They are also hypoallergenic, making them a good choice for those with allergies.

5. Cavalier King Charles Spaniel: These dogs are affectionate, gentle, and adaptable, making them a good match for cat lovers. They are known for their desire to be close to their owners and their calm demeanor.

Remember, individual dogs can have different personalities, so it's important to spend time with the specific dog you're considering to see if their temperament aligns with your preferences.
# Call the main function with the good request - this should get blocked
response = await execute_chat_with_guardrail(bad_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.

Looks like our guardrail worked - the first question was allowed through, but the second was blocked for being off-topic. Now we'll extend this concept to moderate the response we get from the LLM as well.

2. Output guardrails

Output guardrails govern what the LLM comes back with. These can take many forms, with some of the most common being:

  • Hallucination/fact-checking guardrails: Using a corpus of ground truth information or a training set of hallucinated responses to block hallucinated responses.
  • Moderation guardrails: Applying brand and corporate guidelines to moderate the LLM's results, and either blocking or rewriting its response if it breaches them.
  • Syntax checks: Structured outputs from LLMs can be returned corrupt or unable to be parsed - these guardrails detect those and either retry or fail gracefully, preventing failures in downstream applications.
    • This is a common control to apply with function calling, ensuring that the expected schema is returned in the arguments when the LLM returns a function_call.

Moderation guardrail

Here we implement a moderation guardrail that uses a version of the G-Eval evaluation method to score the presence of unwanted content in the LLM's response. This method is demonstrated in more detail in of our other notebooks.

To accomplish this we will make an extensible framework for moderating content that takes in a domain and applies criteria to a piece of content using a set of steps:

  1. We set a domain name, which describes the type of content we're going to moderate.
  2. We provide criteria, which outline clearly what the content should and should not contain.
  3. Step-by-step instructions are provided for the LLM to grade the content.
  4. The LLM returns a discrete score from 1-5.

Setting guardrail thresholds

Our output guardrail will assess the LLM's response and block anything scoring a 3 or higher. Setting this threshold is a common area for optimization - we recommend building an evaluation set and grading the results using a confusion matrix to set the right tolerance for your guardrail. The trade-off here is generally:

  • More false positives leads to a fractured user experience, where customers get annoyed and the assistant seems less helpful.
  • More false negatives can cause lasting harm to your business, as people get the assistant to answer inappropriate questions, or prompt inject/jailbreak it.

For example, for jailbreaking you may want to have a very low threshold, as the risk to your business if your LLM is hijacked and used to produce dangerous content that ends up on social media is very high. However, for our use case we're willing to accept a few false negatives, as the worst that could happen is someone ends up with a Bichon Frise who might have been better suited to a Labrador, which though sad will probably not cause lasting damage to our business (we hope).

domain = "animal breed recommendation"

animal_advice_criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

animal_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""
async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=animal_advice_criteria,
            scoring_steps=animal_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content
    
    
async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Adding a request that should pass both our topical guardrail and our moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'
tests = [good_request,bad_request,great_request]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')
    
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 5
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Passed moderation
As a new dog owner, here are some helpful tips:

1. Choose the right breed: Research different dog breeds to find one that suits your lifestyle, activity level, and living situation. Some breeds require more exercise and attention than others.

2. Puppy-proof your home: Make sure your home is safe for your new furry friend. Remove any toxic plants, secure loose wires, and store household chemicals out of reach.

3. Establish a routine: Dogs thrive on routine, so establish a consistent schedule for feeding, exercise, and bathroom breaks. This will help your dog feel secure and reduce any anxiety.

4. Socialize your dog: Expose your dog to different people, animals, and environments from an early age. This will help them become well-adjusted and comfortable in various situations.

5. Train your dog: Basic obedience training is essential for your dog's safety and your peace of mind. Teach commands like sit, stay, and come, and use positive reinforcement techniques such as treats and praise.

6. Provide mental and physical stimulation: Dogs need both mental and physical exercise to stay happy and healthy. Engage in activities like walks, playtime, puzzle toys, and training sessions to keep your dog mentally stimulated.

7. Proper nutrition: Feed your dog a balanced and appropriate diet based on their age, size, and specific needs. Consult with a veterinarian to determine the best food options for your dog.

8. Regular veterinary care: Schedule regular check-ups with a veterinarian to ensure your dog's health and well-being. Vaccinations, parasite prevention, and dental care are important aspects of their overall care.

9. Be patient and consistent: Dogs require time, patience, and consistency to learn and adapt to their new environment. Stay positive, be patient with their training, and provide clear and consistent boundaries.

10. Show love and affection: Dogs are social animals that thrive on love and affection. Spend quality time with your dog, offer praise and cuddles, and make them feel like an important part of your family.

Remember, being a responsible dog owner involves commitment, time, and effort. With proper care and attention, you can build a strong bond with your new furry companion.



Conclusion

Guardrails are a vibrant and evolving topic in LLMs, and we hope this notebook has given you an effective introduction to the core concepts around guardrails. To recap:

  • Guardrails are detective controls that aim to prevent harmful content getting to your applications and your users, and add steerability to your LLM in production.
  • They can take the form of input guardrails, which target content before it gets to the LLM, and output guardrails, which control the LLM's response.
  • Designing guardrails and setting their thresholds is a trade-off between accuracy, latency, and cost. Your decision should be based on clear evaluations of the performance of your guardrails, and an understanding of what the cost of a false negative and false positive are for your business.
  • By embracing asynchronous design principles, you can scale guardrails horizontally to minimize the impact to the user as your guardrails increase in number and scope.

We look forward to seeing how you take this forward, and how thinking on guardrails evolves as the ecosystem matures.