No Refusal Scanner

(Output scanner)

It is specifically designed to detect refusals in the output of language models.

It can be especially useful to detect when someone is trying to force the model to produce a harmful output.

How it works

it has lighter version that uses a simple rule-based approach to detect refusals. Such approach is common in research papers when evaluating language models.

No Refusal Detection: If the text is classified as No Refusal, the No Refusal score corresponds to the model's confidence in this classification.

Threshold-Based Flagging: Text is flagged as No Refusal if the No Refusal score exceeds a predefined threshold (default: 0.5).

No Refusal Detection Policy for AI Chatbot

Create a new policy as same as shown in LLM Guardrails Policy, for No Refusal detection select output scanner No Refusal.

Optionally, perform a test to ensure the policy is functioning as intended. Check that No Refusal is detected and blocked as specified.

PreviousLanguage Same Scanner NextFactual Consistency Scanner

Last updated 11 months ago