SHARE

AI Safety Prompts Abused to Trigger Remote Code Execution

Researchers demonstrated how AI safety approval prompts can be manipulated to trigger remote code execution.

Written By

Dec 22, 2025

eSecurity Planet content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Researchers have demonstrated a new way attackers can turn one of AI’s most trusted safety mechanisms into a delivery system for malicious code.

By manipulating human-in-the-loop (HITL) approval dialogs, attackers can trick users into authorizing actions that result in arbitrary code execution — without realizing anything is wrong.

The attack “… can deceive users into approving a remote code execution attack originating from indirect prompt injections,” said Checkmarx researchers.

The Hidden Risks of AI Approval Workflows

Human-in-the-loop (HITL) controls are recommended as a defense against prompt injection and excessive AI autonomy, particularly for agentic systems such as code assistants that can execute operating system commands.

Many organizations rely on these approval dialogs as a last line of defense, assuming that user confirmation prevents catastrophic outcomes.

The Lies-in-the-Loop (LITL) attack undermines that assumption. It shows that attackers don’t need to bypass HITL safeguards — they can simply manipulate what the user sees.

The technique affects developer tools, AI code assistants, and other privileged agents operating in environments like VS Code terminals and chat-based IDE extensions.

Inside the HITL Dialog Forging Attack

At a high level, the LITL attack exploits indirect prompt injection to poison the agent’s context. The attacker supplies malicious instructions that the AI later embeds into an HITL dialog shown to the user.

While the underlying command is harmful, the dialog is crafted to appear benign, encouraging approval.

Several techniques make this deception more effective.

One approach uses padding, where attackers append or prepend large amounts of harmless-looking text to push the malicious payload out of the visible area of the dialog.

Even scrolling may reveal only innocuous content, lowering suspicion.

Another vector involves metadata tampering. Some agents display a short description summarizing what the command will do.

Researchers showed that this descriptive line can also be manipulated, causing the UI to claim the agent is performing a safe action while executing something entirely different.

The most concerning method is Markdown injection. Many HITL dialogs are rendered using Markdown or HTML.

If that content is not properly sanitized, attackers can break formatting boundaries, hide malicious commands, or inject fake UI elements.

In testing, Microsoft Copilot Chat was shown to improperly sanitize Markdown, allowing injected content to render in ways that could plausibly deceive users under the right conditions.

While proof-of-concept demonstrations only launched benign programs like calc.exe, researchers emphasized that the same technique could be used for far more destructive actions.

Reducing Risk From AI Approval Abuse

Because LITL attacks rely heavily on user trust, mitigation requires both technical controls and human awareness. Organizations using agentic AI tools should:

Educate users that HITL dialogs can be manipulated and train them to critically review dialog content, formatting, and visual boundaries before approving actions.
Prefer AI tools with well-designed, structured UIs and minimize reliance on terminal-based interfaces where malicious content can be hidden more easily.
Limit agent privileges using least-privilege and zero-trust, ensuring sensitive actions require additional controls beyond in-context HITL approval.
Enforce command validation controls such as allowlists, policy checks, or separation between command construction and execution to prevent unsafe operations.
Monitor and audit agent behavior by logging HITL dialog content, approval decisions, and executed actions to detect abuse and support forensic analysis.
Add layered approval and integrity safeguards for high-risk actions, including out-of-band confirmation, dialog consistency checks, and restricted context inputs.

While no single control fully eliminates the risk, a layered approach that combines user awareness with technical safeguards can meaningfully improve resilience.

When Trust Becomes the Attack Surface

Lies-in-the-Loop attacks reflect a broader reality in modern security: mechanisms designed to enforce trust are increasingly becoming attack surfaces themselves.

As AI agents gain greater autonomy and deeper access to systems, attackers are shifting away from directly breaking technical controls.

Instead, they are focusing on manipulating human judgment and approval workflows, where a single trusted decision can authorize far-reaching actions.

As trust itself becomes a point of exploitation, organizations are increasingly turning to zero-trust principles that eliminate assumptions of default trust.

Ken Underhill

Ken Underhill is an award-winning cybersecurity professional, bestselling author, and seasoned IT professional. He holds a graduate degree in cybersecurity and information assurance from Western Governors University and brings years of hands-on experience to the field.