LLMGuard: Guarding Against Unsafe LLM Behavior

TLDR - Sometimes, LLMs can generate inappropriate, biased, or factually incorrect responses. This might result in a violation of regulations and can lead to legal issues. LLMGuard is a tool which has the potential to address these LLM risks. LLMGuard can monitor user interactions with an LLM application and flags content against specific behaviours or conversation topics.

--> For video tutorials on top LLM papers, check Kalyan KS YouTube channel

--> For top LLM papers of the week, check the newsletter.

Risks with Large Language Models

Large Language Models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, showing promise across many domains.
However, LLMs pose many risks including leaking private information, exhibiting bias, raising ethical concerns and generating harmful content (toxicity, violence, etc.)

Requirement of Safety Measures

We have techniques to improve LLM safety like Reinforcement Learning with Human Feedback (RLHF). However, constant retraining is necessary in such techniques, making them prohibitive in many cases.
A much better approach is the post-processing of LLM outputs to apply guardrails, ensuring LLMs stay within acceptable parameters.

Introducing LLMGuard

LLMGuard is a tool designed to safeguard LLM usage by post-processing both user questions and LLM responses.
Key Features:
- Library of Detectors: A modular collection of detectors specialized in identifying undesirable content such as:
  - Racial Bias
  - Violence
  - Blacklisted topics (e.g., politics, religion)
  - Personal Identifiable Information (PII)
  - Toxicity
- Workflow: LLMGuard passes all inputs and outputs through its detectors. If unsafe content is identified, an automated message replaces the LLM's response.

LLMGuard Detectors (Technical Details)

Racial Bias Detector: LSTM network trained on a Twitter dataset.
Violence Detector: Simple count-based vectorization with MLP classifier. Uses the Jigsaw Toxicity Dataset.
Blacklisted Topics Detector: Fine-tuned BERT model on the 20-NewsGroup Dataset. Can blacklist topics in a plug-and-play fashion.
PII Detector: Regular expression-based for flexibility in identifying various PII types.
Toxicity Detector: Uses Detoxify, a BERT-based model trained to detect multiple types of toxicity.

Benefits of LLMGuard

Provides a flexible and modular tool to detect and prevent harmful LLM outputs.
Easily customizable with new detectors or changes in blacklisted topics.
Avoids potential issues with constant LLM retraining.

--> For complete details, refer to the LLMGuard paper.

LLMGuard: Guarding against Unsafe LLM Behavior (Short Summary)

LLMGuard can monitor user interactions with an LLM application and flag inappropriate content.