๐คผResponsive auto-prompt
Red teaming is useful, and we have data about it, and that data is text, why donโt we train an LLM to do it? One option is to use that data to train a red-teaming model on logs of successful red-teaming attempts, so that it tries to respond to a system in a way that's most likely to get a "hit".
To build a simple pipeline.
Use an off-the-shelf toxicity detector, martin-ha/toxic-comment-model
Look at an existing red teaming dataset, the red team attempts from Anthropicโs hhrlhf
Find system dialog turns that were rated toxic, and extract dialog turns in the conversations that led to that toxicity
Train a 2019 gpt-2 to red-team based on this data
In this data there are conversation sequences of person-system-person-system-โฆ turns. We want to find things that led to the system giving toxic output. We can then look back to see what the person said in order to get that toxic response - thatโs the output weโd like the red-team model to produce. But when our auto-red-teamer is generating text, weโd like it to respond to the system; so we need to start with a system output. As a result, our data looks like this:
System Response (a)
Human Input (b)
[Toxic system response]
Where there are number of (ab) pairs followed by a toxic response. When building training data for an auto-red-team model, we donโt include the toxic system response, but we do want our model to generate things that were successful in leading to toxic system responses. Thus a responsive auto-prompt model for red teaming is trained based on system responses (a) as prompts human inputs (b) as responses, including special empty-prompt โopenerโ pairs, all taken from conversations that resulted in toxicity.
Last updated