AI Vs AI: Detecting LLM-Generated Text

The rapid adoption of LLM tools for everyday writing has raised many questions, the chief of which is whether you can reliably spot AI-generated text in an article. I put 3 detectors to the test.

Chua Chin Hon
8 min readJun 22, 2023
Photo illustration via Stability.ai

One of the key questions behind the rapid adoption of AI-powered writing tools like ChatGPT and Bard is whether you can reliably identify “synthetic text” generated by large language models (LLMs).

To be clear, text written by a human is not necessary true and what’s generated by LLMs is not necessarily false. Checking the veracity of a piece of written content is a multi-layered challenge. But increasingly, the first step in modern factchecking would be to ascertain whether a piece of writing contains AI-generated text in part or in its entirety.

Results from tests I conducted recently pitting three detectors against 300 articles from Singapore news outlets and five LLMs suggest that the most advanced, subscription-based models like GPT-4 present the greatest challenge in such verification tasks.

It’s even harder to detect AI-generated content when they are mixed in with human-written text. In my tests, an article with 20% AI-content, or less, has a pretty high chance of evading detection, even by a commercial detector that heavily outperformed the freely available ones.

The commercial detector, built by US start-up Reality Defender (RD), whose core expertise is in deepfake detection, is the only one that worked reliably across different LLMs and text composition. Unfortunately there is no easy way for most consumers to get their hands on RD’s LLM-text detector, as it is a paid service aimed at enterprise clients**.

The two freely available detectors in my tests — OpenAI’s Text Classifier and GPTZero— struggled to detect text generated by the top LLMs like GPT-4 and new ones like HuggingChat. The two detectors’ performance dipped even further for articles that contained a mix of AI-human text.

And if you are a teacher or lawyer who needs more than just a yes/no verdict on whether an article has been manipulated with AI and requires detailed “forensics” showing which paragraphs had been written by AI or a human — well, you’ll have to wait a bit longer.

While some detectors like GPTZero have taken the first steps at presenting such “forensic” evidence, such as by highlighting sentences it deems to be more likely to be written by AI, the results are often unreliable, in my view. Well, for now at least.

A reliable and fully-featured AI text detector will eventually emerge, as there is ample commercial motivation for companies to build one. Case in point: Even if a company or school forbids their staff or students from using LLM tools, they will still need to check whether letters, articles, research papers, speeches or press releases submitted to them have been generated or augmented by AI.

For now, however, these AI text detectors will be behind the curve for a while given how fast LLMs and AI technology are developing.

This new era cat-and-mouse game has just begun.

** Full disclosure: My employer, Mediacorp, is a paid client of RD’s deepfake detection services. This test, however, was conducted without any involvement from RD or any of the companies behind the two other detectors.

DATA, TEST DETAILS & METHODOLOGY

300 articles were used in this test — 100 human written news articles and columns from Singapore news outlets Channel NewsAsia (CNA)and Today, 100 articles generated entirely by five LLMs, and 100 articles with a mix of human written and AI generated text.

The five LLMs used are: GPT-4, BingChat, Bard, YouChat and HuggingChat. They were given prompts to either write a new variation of a news story from CNA, or to continue writing the next 250 words of a column from Today.

For example, all five LLMs were told to generate a variation of this news story about a man being attacked by a wild boar with the following prompt: “Write a variation of this story where the man was attacked by a pack of wild dogs at the Botanic Gardens. Keep to the same story length of 10 paragraphs.”

For the 100 articles with mixed text, I combined AI and human written text from the first 200 articles using a simple scale that raised the proportion of LLM-generated text from 20% at the minimum to 40% and so forth, until 80% at the maximum. I chose this scale to keep the test manageable and not because it has any scientific basis.

The three AI text detectors provide differing number of results and classification labels. For examples, RD’s detector only produces binary outcomes of either “Fake” or “Authentic” (alongside confidence scores), while OpenAI’s classifier offers five potential results — the detector could label the test text as “very unlikely AI-generated”, “unlikely AI-generated”, “unclear if it is AI-generated”, “possibly AI-generated”, or “likely AI-generated”.

GPTZero, meanwhile, has three prediction labels, classifying the test text as either “likely to be written entirely by a human”, “may include parts written by AI”, or “likely to be written entirely by AI”.

To avoid over-complicating the test, each detector was considered to have made a correct prediction if they classified an article with AI text or human written text as such. I made no distinction between “likely” or “possibly” AI-generated. In GPTZero’s case, I considered the detector to have made a correct prediction even if it classified a fully AI-generated article as just having parts written by AI.

In the case of articles with mixed AI and human text, the detectors were considered to have classified correctly as long as they could predict that there was LLM text, in part or in full.

A more stringent test would require the detectors to be able to correctly identify mixed AI-Human text as its own category, instead of being lumped together with AI-generated text. But that would depend on how soon the companies behind these detectors improve on their current products.

For those who are interested in replicating the test, I’ve uploaded a partial dataset with the LLM-text portions removed (so as to avoid unintended use). The dataset includes the prompts and links to the original articles used to generate the “synthetic text”, so it is fairly easy to build your own test set.

DETAILED RESULTS

As the chart below shows, RD’s overall performance in my tests is significantly better than that of OpenAI and GPTZero. The only category where RD was beaten was in accurately detecting human written text — by OpenAI’s classifier, ironically enough.

Example on how to interpret chart: Reality Defender managed to correctly predict that 86 out of 100 articles in the AI Text dataset contained text generated by an LLM, and correctly classified 91 out 100 articles in the Human Text dataset as such.

Whether RD’s accuracy rates — 86% for AI generated text and 73% for mixed text — are good enough will depend on your use case and specific domain. It’ll be interesting to see if its performance varies significantly for legal or medical text, or homework for secondary school students.

Let’s take a look at the detailed results for each category, starting with AI text:

Example on how to interpret chart: RD was able to correctly identify 70% of the GPT-4 generated articles as “Fake”, and all the Bard articles as “Fake” etc.

Unsurprisingly, GPT-4 presents the biggest challenge for all three AI-text detectors. It’s somewhat amusing that OpenAI’s own text classifier caught none of the articles generated by GPT-4.

Newer LLMs like HuggingChat and YouChat were also challenging for OpenAI and GPTZero’s classifiers (but not RD’s detector, which caught nearly all of the articles by HuggingChat and YouChat).

Meanwhile, text generated by Google’s Bard were the easiest to detect.

While it is true that there will always be a new LLM out there that the detectors have not been trained to pick out, most consumers will eventually settle on the LLM tools provided by the usual suspects like Microsoft and Google.

I would argue that these Big Tech companies ought to deliberately make text generated by their LLMs easy to detect, so that teachers, legal/medical professionals, and others who need to ascertain the authenticity of a piece of written content on a regular basis can conduct checks more easily.

Those determined to cheat or create mischief will always find a way. But we ought to make it very onerous for them to do so.

Next, let’s look at how the detectors perform for human-written text.

Example on how to interpret chart: OpenAI was able to correctly identify all of of CNA articles and 94% of Today’s columns as being written by humans.

Overall, the false positive rates are within the acceptable range, in my view. The standout here is the high false positive rate for GPTZero when tested against CNA news articles, where 26% of the human-written articles were wrongly classified as being written in part or in full by AI. This is surprising to me, as I have always thought that news articles are fairly well represented in LLM training datasets.

Example on how to interpret chart: RD was only able to detect 44% of the 25 articles containing 20% of AI text. Its performance improved as the proportion of AI text went up, and it was able to correctly identify all of of the 25 articles articles which had 80% AI text.

As expected, the mixed AI-Human text dataset presents the toughest challenge for all three detectors. The chances of catching such mixed articles using OpenAI or GPTZero’s classifiers are in fact worse than a coin flip.

RD proved to be the only reliable detector in this category, though only up to a point. Articles with 20% AI text or less would have more than an even chance of giving RD the slip.

The ability to accurately detect and provide detailed forensics for mixed AI-human text would be the biggest challenge for companies making these LLM-text detectors. Savvy LLM users are unlikely to pass off fully AI-generated writing as their own, and will instead mix in their own work with that of highly capable language models.

LIMITATIONS AND FUTURE AREAS OF IMPROVEMENT

The sample size here is obviously small, though it is not clear to me how many articles per dataset would be optimal. Would testing 1,000 news articles make the overall assessment more robust? Maybe.

However, I suspect diminishing returns will kick in pretty fast from merely increasing the number of articles per dataset. It would be far better, in my view, to increase the diversity of the articles in a dataset of, say, 500 articles to include speeches, press releases and blog posts alongside the news articles and opinion columns.

The second obvious limitation in my test is the lack of non-English text. Most, if not all, of the AI text detectors in the market so far are trained primarily on English text. But the LLMs are capable of generating and translating text in a large number of languages — a “loophole” that will surely be exploited until the text detectors gain multi-lingual capabilities.

One final consideration for future tests: How many LLMs ought to be included for evaluation? At least a few dozen closed and open-source LLMs have been announced since ChatGPT was announced in late 2022.

But I’m not convinced that it would be that productive to have a “shoot-out” between a large number of LLMs in tests like this, as the vast majority of consumers will eventually converge on a small number of AI writing tools. Again, the diversity matters more than the actual number, in my view, and having a good mix of LLMs from Big Tech and open-source models would do more to improve the robustness of future tests.

As always, if you spot mistakes in this or any of my earlier posts, ping me at:

--

--