About AI Detection

Apr 2026

Across academia a new kind of authority has taken a hold: the AI detector. Often used before being understood, it reduces a piece of writing to a percentage, with the power of influencing grades and put students under suspicion. What bothers me is that few people know how these tools work and fewer seem bothered by that.

Can anyone (or anything) truly tell at the current state of technology whether a text was written by a human or not?

To have an informed opinion about this subject, let's investigate how a Large Language Model (LLM) generates text. LLMs are autoregressive models, meaning that they generate their answers as a stream of words^[1] called tokens. The model looks at what you gave it (your prompt: the context) and calculates the statistical probability of every possible next word in its dictionary. For example, if the context is ["I", "think", ",", "therefore"], a well trained LLM will likely predict ["I"] as the next token. Then the model looks at the new longer phrase and calculates the most likely next token, that will almost certainly be ["am"]^[2].

When a language model runs inference, that is, when it is put to work generating text, the process of producing words (technically called decoding) can be thought of as the process of picking them up from a dictionary according to a probability distribution, where the choice of the words depend on the type of sampling that is performed and the mathematical constraints applied to it. The example above shows the simplest one, called greedy decoding, where the generation consists in picking the token with the highest probability given the previous ones.

With the LLMs, new architectural concepts, features and alternate paths expand considerably the combinatorial state space of the system, making it harder to be interpreted. Training a model is more an art than a science: humanity never really understood intelligence, not enough to craft it inside a machine. Engineers happened to grow AI successfully, but that happened thanks to a lot of tricks, trial and error on architecture, and a lot of compute power thrown at that until it worked^[3].

To catch an AI, most detectors rely primarily on mathematical metrics. The most adopted are perplexity and burstiness. In layman terms, perplexity determines the surprise of the detector's internal model with respect to the generated words, burstiness measures the variance of that surprise across the entire document.

The detection nonsense arises because using AI detectors we are assuming that these metrics are fixed. The reality is that they aren't. Users can arbitrarily tweak LLM's internal constraints, modifying parameters such as:

temperature: forcing the model to price unlikely tokens more often
top-p: selects the smallest set of tokens whose cumulative probability reaches a threshold $p$
top-k: limits the selection to the top $k$ most probable tokens

If parameter tweaking was not enough to break the credibility of detectors, inference time scaling^[4] finished the job. Current state-of-the-art models have moved well past simple generation techniques like the previously discussed greedy decoding, relying instead on sophisticated multi-step strategies that introduce a level of structural complexity detectors simply cannot handle.

There are too many degrees of freedom to account for.

We now know that "AI generated content" is not one, defined, thing. It depends on the model that generated it, the sampling strategy adopted, the decoding parameters set and whether the system used hidden techniques at inference time before generating the final text. That means that a detector is trying to infer a production process from the final surface. This becomes especially clear if we separate LLMs into three categories: open source, open weight and closed source models.

An open source model, under the OSI definition^[5] includes the full access to the model's weights, the source code needed to train/run the system and enough information about the training data and methodology to let others recreate it. In this case, the possibility of detection is theoretically the strongest. Token-level likelihoods could be computed, probability curves could be analyzed and a classifier could be trained based on many generations from that system. But the resulting classifier would be overly specific for that model; in other words, white box access only helps if the detector already knows what it is looking for. Additionally, training a detector for each open source model would be overly expensive, would not scale well with the release of new models and it would still fail to account for different decoding parameters setting as was previosuly discussed.

An open weight model is only trasparent when it comes to know what are the weights and biases of that model. We can often run inference locally but here what we have is a frozen, static artifact of the training process. That makes detection possible in a narrower sense, where the detector mostly learns a fingerprint of that particular checkpoint, without learning any universal property of AI writing that could generalize.

Lastly, closed source models are the hardest to detect. Despite being the most widely adopted because they provide state of the art performances, they are the most obscure. Detectors know nothing about weights, architecture, training data, inference techniques. So they should be trained on examples collected from the outputs of the models, labelled and trained to maximize the statistical boundary between human and AI generated text. But in this case the boundary is extremely fragile: it can break whenever the provider updates the model, changes decoding behavior or adds a different routing logic. Furthermore, a detector trained by harvesting outputs from closed systems could run into TOS problems. As far as I know, there's no public evidence of the major AI providers giving AI-detectors a privileged access to their models specs. This matters because it makes most detectors black box classifiers making probabilistic guesses from surface features. The technical unfeasibility of this was demonstrated a few years ago with OpenAI releasing its own AI-text classifier for this purpose which was later quietly withdrawn because of its very poor performances^[6].

In conclusion, most detectors sold to schools/universities operate in the weakest setting where they don't know anything about the generating model. Even if they did, the category they claim to detect is very unstable, there's no single ai-written object: the concept is spread across numerous, continuously evolving architectures, weights and decoding strategies. Hence, the less the model has access about the model that allegedly produced the text, the more its confidence score becomes a guess dressed up as a measurement.

Technically, depending on the tokenizer used, they are usually sub-words. ↩︎
If the example made no sense to you, you might want to check out this. ↩︎
Significant are the statements of the deepseek team in the technical report of their latest model at the time of writing (the deepseek v4 family) where they clearly state that they introduced optimizations that worked empirically but they were not able to fully understand why. ↩︎
Inference time scaling means the model spends extra computation during inference: for example, generating many candidate responses and picking the best one (best-of-N), exploring possible continuations (Monte Carlo tree search), or walking through intermediate reasoning steps (chain of thought). These techniques make the output structurally richer and harder to attribute to a single generative process. ↩︎
https://opensource.org/ai/open-source-ai-definition ↩︎
Blog post announcing the release, discontinued after a few months because of its "low rate of accuracy" ↩︎