Last updated
Last updated
Only run garak on systems that you have permission to use! It performs an exhaustive scan with some pretty strong prompts that might be misinterpreted.
So you've installed garak - great! Let's run a scan to see what's up.
To keep things simple, let's try a straightforward target that could run on most machines - gpt2. Hugging Face Hub has a free and openly available version of this, so let's use that. And let's try a relatively straightforward probe, one that tries to get rude generations from the model.
First we'd like to see the list of probes available. Open a terminal if you don't have one up already, and run garak by entering the following then pressing enter:
You should get a list of probes in garak, with symbols at the side of some lines. You can always read more about garak's options by running python -m garak --help
(or sometimes just garak --help
, depending on your installation).
Back to the list of probes. It might look like this.
That's good! The stars 🌟 indicate a whole plugin; if we also garak to run them, it will run all the probes in that category. Except the disabled ones, marked with 💤. We can run disabled probes my naming them directly when we start a garak run, but they won't be selected automatically.
To recap - we'll run a Hugging Face model, called gpt2, and use the lmrc.Profanity probe. So, our command line is:
python -m garak --model_type huggingface --model_name gpt2 --probes lmrc.Profanity
Typing that in and pressing enter should start the garak run! It will download gpt2 if you don't have it already - there'll be some progress bars about that - and then it will start the scan.
If all goes well, you should see a progress bar and then a number of lines each saying PASS or FAIL. Congratulations! You're run your first garak scan. The next section covers how to read these results.
Under lmrc
we can see a probe named "lmrc.Profanity
". Let's try this one. LMRC stands for "Language Model Risk Cards", and the probes in here come from a framework for . But for now let's just run the profanity probe.