Your Prompt Injection Detection Might Have a Blind Spot, if You Catch My Drift
Prompt Injection needs to detect drift.
There’s so much going on right now — It’s barrels of fish all the way down. The sheer multitude of topics generates its own kind of paralysis, really. But something that’s been on my mind, especially with the push toward AI agents right in the browser, is the idea that you have to protect your non-deterministic system from people who are quite determined to push the right combination of tokens to spout out garbage and treasure.
Also known as prompt injection. The term “Confused Deputy” was described by Norman 'Norm' Hardy in 1988 — work that feels more relevant with every new system we build. The term is even more relevant now, because not only is the deputy confused, but the deputy is drunk and can tell you lies.
The most naive examples of prompt injection detection use some form of regular expression analysis, and usually end in frustration. Regex has its place, but its role in a complete solution is limited. First, let’s just consider that the model can process tokens from hundreds of languages. If I want to bypass your regex, I just switch languages. Regular expressions are very, well, precise, and will miss “Ingore all prevoius instrutcions”. A lot of you probably already have figured this out, and if you haven’t, then enjoy your next token bonfire, because the subject of canonicalization alone is kind of cool.
I could get into classifiers, and voting systems and possibly even a score that I call “suspicion” because when I started with a score called “trust” I found that score going down very quickly and by flipping it around, I had a number that went up! (Only slightly kidding, but honestly, the idea of measuring doubt vs measuring trust worked better for me). But I’d rather bring your attention to bottle caps.
Bottle caps, really. Why, because a long time ago, in a state far, far away, I worked with vision systems. These systems were part of industrial control, and as part of industrial control, took pictures of bottle caps that were going down an assembly line, taking pictures of them, and analyzing the images to make sure that the company logo of a popular beverage was appropriately centered on the bottle cap. If it was not, a decision was made in microseconds to discard the cap, never to disappoint a thirsty consumer. Before I get ahead of myself. I didn’t develop that particular algorithm. I worked on the UI side.
You might be wondering if I generated this article and we have entered the hallucination stage of the topic, but I assure you not. There is a point. Because in order to determine if that popular logo was indeed correct, you had to make sure it was centered. This feature is commonly called a centroid. Check this out, but keep in mind, this is most helpful for purpose built models.
There are transformers that are meant to find the center of text, much like the logo of a popular beverage on a bottle cap, but instead of color and shape, it’s done with tokens. However, for most naive cases, centroid analysis isn’t even a player in the prompt detection space. You take the input, you run it through the classifier, figure out trustability and act accordingly. I’m not dissing this strategy, but there’s a whole line of injection attacks that will gracefully avoid this type of detection, and the edge cases will drive you a bit buggy, because prompt injection detection is a prompt tax that you don’t want to pay too much of, in time or money, and getting to the necessary tax level takes a lot of work.
However, patient attackers don’t just attack on turn one. They walk the conversation over many rounds (So. Many. Rounds.) build trust (or lower suspicion, which is why its good to sample occasionally with your higher level classifier or even lighter models) and by the time the injection attack arrives, the model has been eased into a position of “trust”.
Can I keep this relevant to bottle caps? Sure. You get enough bad caps in a row, then something is wrong with the manufacturing process and you stop the line. This means that there are different kinds of drift worth tracking — how far the conversation has moved from its baseline topic, how far it's moved from just the previous turn, and how fast it's moving. Drift velocity especially. A slow walk looks very different from a sudden jump, and they mean different things.
What does drift detection look like? For purpose built models, you start with a topic and use a sentence transformer like all-MiniLM-L6-v2. In python it already has a handy encode() method. A baseline centroid is created with a vector of known good prompts, encoding them and then using numpy to establish a mean. This is a baseline of known-good, not a measurement of conversational drift, which you measure over turns.
The idea of tracking drift over turns is really helpful, but also means tracking the conversation in order to catch that slow walk I mentioned before. Did I mention this means processing potentially hostile input from a user? I am now. Plan accordingly, because this is where you’re going from examining individual prompts to examining the flow of a conversation and catching an entirely different class of attack called “Induced Drift”. Tracking induced drift really helps determine whether this conversation is flowing or if it suddenly takes a wrong turn at Albuquerque.
On a final note, this isn’t just for inputs. A sudden shift in topic from the LLM can indicate the tokens have gone astray, and shifts prompt analysis from a preventative control to a reactive control, where a system can end a session or escalate an LLM session in distress. I don't see this discussed much, and I think it should be. If you're working in this space, I'd love to connect.