By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Inside the Dataset Powering the Next Generation of AI Text Detection | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Inside the Dataset Powering the Next Generation of AI Text Detection | HackerNoon
Computing

Inside the Dataset Powering the Next Generation of AI Text Detection | HackerNoon

News Room
Last updated: 2026/02/10 at 10:28 PM
News Room Published 10 February 2026
Share
Inside the Dataset Powering the Next Generation of AI Text Detection | HackerNoon
SHARE

Table of Links

  1. Abstract and Introduction

  2. Dataset

    2.1 Baseline

    2.2 Proposed Model

    2.3 Our System

    2.4 Results

    2.5 Comparison With Proprietary Systems

    2.6 Results Comparison

  3. Conclusion

    3.1 Strengths and Weaknesses

    3.2 Possible Improvements

    3.3 Possible Extensions and Applications

    3.4 Limitations and Potential for Misuse

A. Other Plots and information

B. System Description

C. Effect of Text boundary location on performance

2 Dataset

The dataset used is part of M4GT-bench Dataset(Wang et al., 2024a) consisting of texts each of which are partially human written and partially machine generated sourced from PeerRead reviews and outfox student essays (Koike et al., 2023) all of which are in English. The generators used were GPT-4(OpenAI, 2024) , ChatGPT , LLaMA2 7/13/70B (Touvron et al., 2023). Table 1 shows the source , generator used and data split of the dataset. The generators were given partially human written essays or partially human written reviews along with problem statements and instructions to complete the text. The proportion of human written content in each of the samples ranged from 0 to 50% in the first part while the rest is machine generated in the training data and varying from 0 to 100% in development and test sets. The length of the texts varied between a single sentence to over 20 with median word count of 212 and mean word count of 248.

2.1 Baseline

The provided baseline uses finetuned Longformer over 10 epochs. The baseline classifies tokens individually as human or machine generated and then maps the tokens to words to identify the text boundary between machine generated and human written texts. The final predictions are the labels of words after whom the text boundary exists. The detection criteria is first change from 0 to 1 or vice versa. We have tried one more approach by considering the change only if consecutive tokens are the same. The baseline model achieved an MAE of 3.53 on the Development set which consists of same source and generator as the training data. The model had an MAE of 21.535 on the test set which consists of unseen domains and generators.

2.2 Proposed Model

We have built several models out of which DeBERTa-CRF was used as the official submission. We have finetuned DeBERTa(He et al., 2023), SpanBERT(Joshi et al., 2020), Longformer(Beltagy et al., 2020), Longformer-pos (Longfomer trained only on position embeddings), each of them again along with Conditional Random Fields (CRF)(McCallum, 2012) with different text boundary identification logic by training on just the training dataset and after hyperparameter tuning , the predictions have been made on both development and test sets. CRFs have played a vital role in improving the performance of the models due to their architecture being well suited for pattern recognition in sequential data. The primary metric used was Mean Average Error (MAE) between predicted word index of the text boundaries and the actual text boundary word index. However Mean Average Relative Error (MARE) too was used for a better understanding which is the ratio of MAE and text lenght in words. Some of the plots and information couldn’t be added due to page limits and are available here. [1] along with the code used. [2] . a hypothetical example in Figure 1 demonstrates how the model works. The tokens are classified at first and mapped to words. In cases where part of a word is predicted as human and rest as machine (in case of longer words), the word as a whole is classified as machine generated.

Figure 1: A visual example of working of the model

2.3 Our system

We have used ’deberta-v3-base’ along with CRF using Adam(Kingma and Ba, 2017) optimizer over 30 epochs with a learning rate of 2e-5 and a weight decay of 1e-2 to prevent overfitting. other models that have been used are ’Spanbert-basecased’, ’Longformer-base-4096’, ’Longformerbase-4096-extra.pos.embd.only’ which is similar to Longformer but pretrained to preserve and freeze weights from RoBERTa(Liu et al., 2019) and train on only the position embeddings. The large variants of these have also been tested however the base variants have achieved better performance on both the development and testing datasets. predictions have been made on both the development and testing datasets by training on just the training dataset. Two approaches were used when detecting text boundary 1) looking for changes in token predictions i.e from 1 to 0 or 0 to 1. and 2) looking for change to consecutive tokens i.e 1 to 0,0 or 0 to 1,1. Approach 2 achieved better results than approach 1 in all the cases and was used in the official submission.

2.4 Results

The results from using different models with the two approaches on the development set and the test set can be seen in Table 2. These models have been trained over 30 epochs and the best results were added among the several attempts with varying hyperparameters. The provided baseline however has been trained on just through approach I over 10 epochs using base variant of Longformer. These models have then been used to make predictions on the test set without further training or changes using the set of hyperparameters that produced the best results for each on the development set. However MAE which is the primary metric of the task doesn’t take length of the text into consideration, Hence MARE (Mean Average Relative Error) was also calculated for a better understanding.

2.5 Comparison With Proprietary Systems

Some of the proprietary systems built for the purpose of detecting machine generated text provide insights into what parts of the text input is likely machine generated at a sentence / paragraph level. Many of the popular systems like GPTZero, GPTkit, etc.. are found to to less reliable for the task of detecting text boundary in partially machine generated texts. Of the existing models only ZeroGPT was found to produce a reliable level of accuracy. For the purpose of accurate comparison percentage accuracy of classifying each sentence as human / machine generated is used as ZeroGPT does detection at a sentence level.

2.6 Results Comparison

Since the comparison is being done at a sentence level, In cases where actual boundary lies inside the sentence, calculation of metrics is done on the remaining sentences, and when actual boundary is at the start of a sentence , all sentences were taken into consideration. With regard to predictions, A sentence prediction is deemed correct only when a sentence that is entirely human written is predicted as completely human written and vice versa. The two metrics used were average sentence accuracy which is average of percentage of sentences correctly calculated in each input text, and overall sentence accuracy which is percentage of sentences in the entire dataset accurately classified. The results on the development and test sets are as shown in Table 3. Since its difficult to do the same on 12000 items of the test set , a small section of 500 random samples were used for comparison and were found to perform similar to the development set with a 15-20 percent lower accuracy than the proposed models. Since ZeroGPT’s API doesn’t cover sentence level predictions , they have been manually calculated over the development set and can be found here [3].

:::info
Author:

(1) Ram Mohan Rao Kadiyala, University of Maryland, College Park ([email protected]**).**

:::


:::info
This paper is available on arxiv under CC BY-NC-SA 4.0 license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Today's NYT Connections: Sports Edition Hints, Answers for Feb. 11 #506 Today's NYT Connections: Sports Edition Hints, Answers for Feb. 11 #506
Next Article Cinnamon Bun is baking: Google confirms Android 17 Beta 1 is coming soon Cinnamon Bun is baking: Google confirms Android 17 Beta 1 is coming soon
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The 5 Best Apple Watch Apps I Use Every Day – BGR
The 5 Best Apple Watch Apps I Use Every Day – BGR
News
Razer Launches Limited-Edition Boomslang Gaming Mouse for ,337
Razer Launches Limited-Edition Boomslang Gaming Mouse for $1,337
News
This budget Android phone looks like an iPhone with a Nothing twist
This budget Android phone looks like an iPhone with a Nothing twist
News
Sony Could Be Improving Sound and Features on Its Next Earbuds
Sony Could Be Improving Sound and Features on Its Next Earbuds
News

You Might also Like

👨🏿‍🚀 Daily – Access denied in South Africa |
Computing

👨🏿‍🚀 Daily – Access denied in South Africa |

2 Min Read
Honor reveals design of Honor 400 series smartphones ahead of global launch · TechNode
Computing

Honor reveals design of Honor 400 series smartphones ahead of global launch · TechNode

1 Min Read
NVIDIA reportedly to launch cut-down H20 chip for China as early as July: report · TechNode
Computing

NVIDIA reportedly to launch cut-down H20 chip for China as early as July: report · TechNode

1 Min Read
iPhone Pro now eligible for China’s national subsidy program · TechNode
Computing

iPhone Pro now eligible for China’s national subsidy program · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?