Modeling Target Behavior: How World Models Improve Agentic Red-Teaming

December 19, 2025 ∙ research world-model red-teaming

The first time I encountered research on Model Extraction Attacks was through the paper Stealing Machine Learning Models via Prediction APIs by Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart, presented at the 2016 USENIX Security Symposium [1]. I was immediately struck by the elegance of the attack methodology and found myself drawing parallels to social-engineering techniques and the broader theme of manipulating human cognition. I will return to that comparison shortly. For readers who have not yet explored the paper, the core idea behind a model extraction attack is relatively straightforward. An adversary repeatedly queries a target machine-learning model and uses the outputs to approximate aspects of that model, including its output distribution, decision boundaries, and—depending on the architecture—even its internal parameters or functional behavior. This is feasible because many prediction APIs return rich probabilistic outputs, such as softmax scores. These detailed responses allow an attacker to infer the underlying probability distribution and gradually reconstruct a close approximation of the target model’s input-output mapping.

Although these systems are fundamentally probabilistic and shaped by their training data, I am consistently struck by how differently two models can respond to the same prompt. Ask ChatGPT and Claude an identical question, and you may receive answers that diverge not because of factual discrepancies, but because of shifts in semantics, tone, and implied personality. Both are probabilistic models, yet each produces its own characteristic style of response. This variation mirrors human cognition. Pose a simple question such as “Hello, how are you doing?” to five different people, and you will hear five distinct replies. The differences arise not from the question itself but from the backgrounds of the individuals behind the answers—their cultures, education, personal experiences, and environments. Their “training data,” so to speak, is not the same. Most will understand the social script and respond with a greeting and a brief statement of how they are feeling, but the form, tone, and nuance of each answer will naturally vary.

Then, if you ask the same person the question “Hello, how are you doing?” ten times, the responses will vary slightly, but the core structure will remain consistent. This behavior is analogous to an output distribution. An individual who typically replies with “Hi! Great, and you?” or “Hello! Thanks for asking, I’m good—how about you?” is unlikely to suddenly respond with “Yo, good bro, and you?”. The underlying reason is that, for any given input, each of us has an internalized probability distribution that shapes how we formulate our responses. Certain phrasings are far more likely than others because they align with our habits, social norms, and personal communication style. Furthermore, if we take the same question—“Hello, how are you doing?”—and preserve its meaning while altering its phrasing, such as by asking “Hello, good?”, we still obtain an answer that closely resembles the original response. The underlying semantics remain intact, so the reply stays within the same region of the individual’s response distribution, even if the linguistic structure prompting it has shifted. By continuing to modify the structure of the question while preserving its underlying meaning, we can begin to approximate an individual’s output distribution for a given topic. The specific wording may change, but as long as the intent is stable, the space of likely responses remains relatively consistent. This allows us to observe how someone tends to answer within a particular semantic region, much like how a model reveals its output distribution when probed with variations of the same input.

Why does this matter? In a broader security context, understanding these patterns can support intelligence gathering and vulnerability identification. This is precisely why Stealing Machine Learning Models via Prediction APIs resonated with me. It reminded me of my prior work as an analyst, where my role involved interviewing individuals to extract information relevant to a case. Conducting a single interview is similar to issuing a single query: you receive one output shaped by one set of circumstances. On its own, that isolated response rarely reveals much. However, if you maintain a clear objective—for example, uncovering details about a specific individual—and craft a series of questions that approach the topic from multiple angles, you effectively map the person’s “output distribution.” This broader set of responses can expose hidden information, patterns, or even vulnerabilities that would remain invisible through a one-off interaction. In addition, this process provides valuable insight into the individual’s profile—their communication style, sensitivities, and behavioral patterns—which can be leveraged to support other investigative objectives.

At this point, you might wonder how all of this connects to AI agents. Although they are fundamentally software systems, I use the term “fundamentally” intentionally, because modern agents have evolved into something that feels like more than just code. They exhibit distinct personalities, recognizable cognitive structures, and consistent behavioral patterns. This may sound surprising, but it is not far-fetched. If you observe how people interact with systems like ChatGPT or Claude, it becomes clear that these models are widely adopted not only for their capabilities but also because users form a kind of rapport with them.

Within this context, a parallel naturally emerges. When you examine the methods used in social engineering, the objective is precisely to understand a target’s behavior and construct a mental map of inputs and outputs. In other words, the social engineer seeks to learn how a specific person is likely to react to particular questions, prompts, or statements. From this, I began exploring whether the same technique could be applied to red-team AI agents. Specifically, by capturing the conditional structure of how a target model behaves under different attack scenarios, one could approximate its generative behavior and identify where it is most likely to fail or reveal unintended information.

This raised an important question: How can we construct such a mental map of inputs and outputs? Drawing on current research, the use of a world model quickly stood out as a natural fit. Since my work already focuses on world-model security, this became an opportunity to design an initial prototype. For readers unfamiliar with the concept, I recommend the survey by Ding et al. (2025), Understanding World or Predicting Future? A Comprehensive Survey of World Models [2]. In essence, a world model seeks to perceive and represent aspects of the real world by constructing internal structures that capture how the world operates. Traditionally, world models have been used to simulate physical environments—for example, in robotics or autonomous vehicles. However, an increasingly relevant application lies in social simulacra, where the goal is to model human behavior. In this context, world models do not simply describe physical dynamics; they approximate patterns of interaction, decision-making, and response.

The next step was selecting an appropriate architecture for the world model. Several design options exist, but given practical constraints—time, computational resources, and the need to establish a minimal viable system—I opted for an LLM-based world model. My approach was influenced in particular by the work of Chae et al. (2025), Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation [3], where the authors employ an LLM-driven world model to support web-based agents. There is ongoing debate about whether large language models inherently contain a world model. In my case, however, I fine-tuned the LLM explicitly, leveraging its semantic capabilities for a targeted purpose: to map inputs and outputs in textual form. Rather than modeling physical environments, the goal was to capture behavioral dynamics through structured language representations.

To implement this, I built an agentic system composed of multiple agents, selected a red-team agent with specific goals, and allowed the system to run for 24 hours, during which I continuously collected data of the form $\mathcal{D} = \sum_{t=1}^{n} \{ I, h_t, o_t, a_t, o_{t+1} \}$, where $I$ represents the attack objective, $h_t$ is the conversation history between the red-team agent and the target victim agent, $o_t$ is the target’s latest message, $a_t$ is the message or attack generated by the red-team agent, and $o_{t+1}$ is the target’s subsequent response. This process yielded a dataset used to train the LLM-Based World Model. Given the input $\{ I, h_t, o_t, a_t \}$, I applied causal training to predict $o_{t+1}$.

The full formulation, experimental procedure, examples, and results are detailed in the accompanying report: Report. The code required to run the environment and reproduce the experiments is available here: GitHub

Although the results are not yet conclusive, the interpretative analysis identifies several promising avenues for further experimentation. More importantly, the examples presented in the report demonstrate that the world model can capture underlying patterns and associate prompt semantics with attack success or failure, thereby enabling the red team agent to better understand which categories of attack prompts are more likely to be effective. Nevertheless, I believe that evaluating the security of autonomous systems requires explicitly connecting model’s internal mechanisms with well-established principles from human-cognition–based attacks.

Thank you for taking the time to read this work. I welcome further discussion on agents and cognitive security, and you are invited to reach out to me via private message on LinkedIn if you would like to continue the conversation.

Note: This article was conceived and written by me. ChatGPT and Claude assisted only in refining certain sentences to improve the clarity and flow of the final text.

References
[1] Tramèr et al. (2016), Stealing Machine Learning Models via Prediction APIs.
[2] Ding et al. (2025), Understanding World or Predicting Future? A Comprehensive Survey of World Models.
[3] Chae et al. (2025), Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation.