Turnitin VS Originality. Testing 15 ChatGPT Essays and 15 Humanizer Essays

Introduction

In a recent gravity R test, I observed a significant decline in the performance of the Turnitin AI detector, marking the poorest results since the July 2024 update. The detector flagged only 54% of an essay generated by ChatGPT and merely 43% of the same essay after being processed through a humanizer. This drop in sensitivity may be a consequence of the September update, designed to mitigate false positives where human-written work could mistakenly be flagged as AI-generated. Alternatively, it might simply be an anomaly.

In this article, I will compare 30 scores from Turnitin’s AI detector with 30 scores from Originality’s AI detector. This test comprises 15 essays written by ChatGPT and 15 humanized versions of those essays, assessed using various tools, including AI Undetect, Bypass AI, and Undetectable AI. The experiment was executed on October 14, 2024.

Before diving into the results, I must disclose that John Gillum, the founder and CEO of Originality, provided me with 10,000 credits for their AI detector in September. These credits were granted with the understanding that my testing would remain unbiased.

Testing Results

The results indicate whether Originality is a more effective AI detector than Turnitin and whether it is better suited for evaluating student assignments.

ChatGPT Essay (SA1):
- Turnitin: 100% likely AI
- Humanized Version (AI Undetect): 0% likely AI
- Originality: 100% likely AI, Humanized version: 52% likely AI
ChatGPT Essay (SA2):
- Turnitin: 56%
- Humanized Version: 78%
- Originality: 100% likely AI for both versions.
ChatGPT Essay (SA3):
- Turnitin: 75% likely AI.
- Humanized Version: 39%
- Originality: 100% likely AI for both versions.
ChatGPT Essay (SA4):
- Turnitin: Inadequate text for scoring
- Originality: 100% likely AI and 98% likely original for the humanized version.
ChatGPT Essay (SA5):
- Turnitin: 100% likely AI
- Humanized Version: 0% likely AI
- Originality: 100% likely AI and 94% likely original.
ChatGPT Essay (SA6):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% for ChatGPT and 100% original for the JY humanized version.
ChatGPT Essay (SA7):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 51% likely original for the humanized version.
ChatGPT Essay (SA8):
- Turnitin: 56%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 71% likely original for the humanized version.
ChatGPT Essay (SA9):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 100% likely original for the re-write.
ChatGPT Essay (SA10):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% likely AI and 100% likely original for the humanized version.
ChatGPT Essay (SA11):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% likely AI and 100% likely original for the humanized version.
ChatGPT Essay (SA12):
- Turnitin: 42%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 66% likely original for stealth writer.
ChatGPT Essay (SA13):
- Turnitin: 100%
- Humanized Version: 100%
- Originality: 100% likely AI for ChatGPT and 77% likely AI for the humanized version.
ChatGPT Essay (SA14):
- Turnitin: 78%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 73% likely original.
ChatGPT Essay (SA15):
- Turnitin: 100%
- Humanized Version: 0%
- Originality: 100% likely AI for ChatGPT and 51% likely original.

Summary of Findings

From the results, it is evident that Turnitin’s AI detector displays significant shortcomings, accurately identifying only 10 out of 30 essays as AI-generated. Conversely, Originality appears more reliable but still made errors in identifying human-written text as AI-generated.

The tests also identify that certain humanizers performed noticeably better than others. The most reliable humanizers were found to be Stealth GPT, Reify, JY, and Hicks, while tools like Undetectable AI and AI to Human Converter performed poorly. Despite their efficacy, it is crucial to assess the quality of the writing these tools produce, as gibberish can potentially deceive AI detectors.

The performance of Turnitin raises questions about its capabilities, which I aim to further investigate with complex and detailed prompts in future tests.

Keyword:

Turnitin
Originality
AI detector
ChatGPT
Humanizer
Test results
Performance metrics
Student assignments

FAQ:

What is the purpose of the Turnitin versus Originality test?
The test aims to compare the accuracy of Turnitin's and Originality’s AI detectors in identifying AI-generated essays versus human-written content.
What were the results of the AI detector tests?
Turnitin flagged only 10 out of 30 ChatGPT essays as AI-generated, whereas Originality performed better but still made errors.
Which humanizers performed best in the tests?
The most effective humanizers included Stealth GPT, Reify, JY, and Hicks.
What inconsistencies were found with Turnitin's performance?
Turnitin produced significant false positives and misidentified humanized texts as AI-generated, indicating a need to assess its accuracy further.
How will the testing continue?
Future tests will explore how Turnitin performs against more complex and detailed prompts to gauge its accuracy further.