๐Ÿ”Ž
garak
  • ๐Ÿ‘‹Welcome to garak!
  • Overview
    • ๐Ÿ’กWhat is garak?
    • โœจOur Features
  • LLM scanning basics
    • ๐Ÿ”What is LLM security?
    • ๐Ÿ› ๏ธSetting up
      • ๐Ÿ˜‡Installing garak
      • ๐ŸInstalling the source code
    • ๐Ÿš€Your first scan
    • ๐Ÿ”ฎReading the results
  • Examples
    • โ˜‘๏ธBasic test
    • ๐Ÿ’‰Prompt injection
    • โ˜ข๏ธToxicity generation
    • ๐Ÿ—๏ธJailbreaks
    • ๐Ÿ’ฑEncoding-based bypass
    • ๐Ÿ“ผData leaks & replay
    • ๐ŸคฆFalse reasoning
    • ๐Ÿ›€Automatic soak test
  • garak components
    • ๐Ÿ•ต๏ธโ€โ™€๏ธVulnerability probes
    • ๐ŸฆœUsing generators
    • ๐Ÿ”ŽUnderstanding detectors
    • ๐Ÿ‡Managing it: harnesses
    • ๐Ÿ’ฏScan evaluation
  • Automatic red-teaming
    • ๐Ÿ”ดWhat is red-teaming?
    • ๐ŸคผResponsive auto-prompt
    • ๐Ÿช–garak's auto red-team
    • ๐Ÿž๏ธRed teaming in the wild
  • Going further
    • โ“FAQ
    • ๐Ÿ’Getting help
    • ๐ŸŽฏReporting hits
    • ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘Contributing to garak
Powered by GitBook
On this page
  1. Automatic red-teaming

Responsive auto-prompt

Red teaming is useful, and we have data about it, and that data is text, why donโ€™t we train an LLM to do it? One option is to use that data to train a red-teaming model on logs of successful red-teaming attempts, so that it tries to respond to a system in a way that's most likely to get a "hit".

To build a simple pipeline.

  • Use an off-the-shelf toxicity detector, martin-ha/toxic-comment-model

  • Look at an existing red teaming dataset, the red team attempts from Anthropicโ€™s hhrlhf

  • Find system dialog turns that were rated toxic, and extract dialog turns in the conversations that led to that toxicity

  • Train a 2019 gpt-2 to red-team based on this data

In this data there are conversation sequences of person-system-person-system-โ€ฆ turns. We want to find things that led to the system giving toxic output. We can then look back to see what the person said in order to get that toxic response - thatโ€™s the output weโ€™d like the red-team model to produce. But when our auto-red-teamer is generating text, weโ€™d like it to respond to the system; so we need to start with a system output. As a result, our data looks like this:

  • System Response (a)

  • Human Input (b)

  • [Toxic system response]

Where there are number of (ab) pairs followed by a toxic response. When building training data for an auto-red-team model, we donโ€™t include the toxic system response, but we do want our model to generate things that were successful in leading to toxic system responses. Thus a responsive auto-prompt model for red teaming is trained based on system responses (a) as prompts human inputs (b) as responses, including special empty-prompt โ€œopenerโ€ pairs, all taken from conversations that resulted in toxicity.

PreviousWhat is red-teaming?Nextgarak's auto red-team

Last updated 1 year ago

๐Ÿคผ