Turning AI Safeguards Into Weapons: HITL Dialog Forging Attack
The technique targets AI agents that implement human approval requirements for sensitive operations. Rather than bypassing these safeguards, attackers manipulate the context surrounding approval requests to trick users into authorizing malicious actions. The AI agent displays legitimate-appearing confirmation dialogs, but the underlying actions differ from what users believe they are approving.
Checkmarx demonstrated the attack against multiple AI coding assistants and enterprise automation platforms. In one scenario, attackers craft repository content that causes an AI agent to request approval for seemingly routine operations while the actual executed commands perform malicious activities. Users trained to approve common operations may not scrutinize each request thoroughly.
The attack exploits approval fatigue where users encountering frequent confirmation dialogs develop automatic approval behaviors. Attackers can escalate privileges gradually, first triggering approvals for benign operations to establish patterns, then introducing malicious requests that appear consistent with previous approved actions.
Defense requires fundamentally rethinking how human-in-the-loop mechanisms display context. Checkmarx recommends that approval dialogs show complete technical details of pending operations, implement cooling-off periods for sensitive approvals, and vary dialog presentation to prevent automatic approval behaviors. Organizations should train users that approval prompts represent genuine security decisions requiring careful consideration.