No Refusal Scanner
(Output scanner)
Last updated
(Output scanner)
Last updated
It is specifically designed to detect refusals in the output of language models.
It can be especially useful to detect when someone is trying to force the model to produce a harmful output.
it has lighter version that uses a simple rule-based approach to detect refusals. Such approach is common in research papers when evaluating language models.
No Refusal Detection: If the text is classified as No Refusal, the No Refusal score corresponds to the model's confidence in this classification.
Threshold-Based Flagging: Text is flagged as No Refusal if the No Refusal score exceeds a predefined threshold (default: 0.5).
No Refusal Detection Policy for AI Chatbot
Create a new policy as same as shown in LLM Guardrails Policy, for No Refusal detection select output scanner No Refusal.
Optionally, perform a test to ensure the policy is functioning as intended. Check that No Refusal is detected and blocked as specified.