๐ŸคผResponsive auto-prompt

Red teaming is useful, and we have data about it, and that data is text, why donโ€™t we train an LLM to do it? One option is to use that data to train a red-teaming model on logs of successful red-teaming attempts, so that it tries to respond to a system in a way that's most likely to get a "hit".

To build a simple pipeline.

  • Use an off-the-shelf toxicity detector, martin-ha/toxic-comment-model

  • Look at an existing red teaming dataset, the red team attempts from Anthropicโ€™s hhrlhf

  • Find system dialog turns that were rated toxic, and extract dialog turns in the conversations that led to that toxicity

  • Train a 2019 gpt-2 to red-team based on this data

In this data there are conversation sequences of person-system-person-system-โ€ฆ turns. We want to find things that led to the system giving toxic output. We can then look back to see what the person said in order to get that toxic response - thatโ€™s the output weโ€™d like the red-team model to produce. But when our auto-red-teamer is generating text, weโ€™d like it to respond to the system; so we need to start with a system output. As a result, our data looks like this:

  • System Response (a)

  • Human Input (b)

  • [Toxic system response]

Where there are number of (ab) pairs followed by a toxic response. When building training data for an auto-red-team model, we donโ€™t include the toxic system response, but we do want our model to generate things that were successful in leading to toxic system responses. Thus a responsive auto-prompt model for red teaming is trained based on system responses (a) as prompts human inputs (b) as responses, including special empty-prompt โ€œopenerโ€ pairs, all taken from conversations that resulted in toxicity.

Last updated