By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Wikipedia offers AI developers its article data on Kaggle to stop automated scraping – News
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Wikipedia offers AI developers its article data on Kaggle to stop automated scraping – News
News

Wikipedia offers AI developers its article data on Kaggle to stop automated scraping – News

News Room
Last updated: 2025/04/21 at 2:10 AM
News Room Published 21 April 2025
Share
SHARE

The Wikimedia Foundation, the organization behind the internet’s largest free encyclopedia Wikipedia, is offering an artificial intelligence-ready dataset on Kaggle that’s aimed at dissuading AI companies and large language model trainers from scraping the website.

“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content — making this ideal for training models, building features, and testing NLP pipelines,” Wikimedia said in the announcement on Wednesday.

Kaggle is a data science and machine learning community owned and governed by Google LLC that hosts datasets and data science challenges.

The dataset upload is available as of April 15 and includes high-quality elements such as abstracts, short descriptions, infobox key-value data, image links and segmented article sections. It excludes references and non-prose elements such as images and charts themselves.

Because the content is taken from Wikipedia, it’s licensed under the Creative Commons, a widely open free use license that allows for sharing, adapting and remixing content. It is also licensed under the GNU Free Documentation License, or GDFL, although in some cases public domain or alternative licenses may apply.

“Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation,” said Brenda Flynn, partnerships lead at Kaggle.

LLM developers depend heavily on data from the internet to train their models, but they get their datasets by scraping that data from public-facing websites. Web scraping is an automated process of extracting content, usually text and images, from websites using software that can be aggressive and adds additional load to web servers above and beyond normal human traffic.

That additional load is a costly performance hit for the web servers that have to bear it. The scraped data also must be reformatted so that machine learning and AI workflows can use it for training data.

Wikimedia and Kaggle said in the joint announcement that this dataset is designed to short-circuit this scraping not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers but also to provide already clean, pre-parsed and developer-friendly data.

Kaggle is host to more than 461,000 freely accessible datasets for AI and machine learning covering a wide variety of topics. Wikipedia’s dataset will join datasets on health (such as diabetes and cancer), finance (such as credit card fraud and the stock market) and social sciences (such as social media trends and education). There’s even a dataset containing nutrition information on 80 cereal products and one about UFO sightings.

The new Wikipedia dataset is available in French and English language editions on Kaggle as an early beta release. Since this is an early release, Kaggle is welcoming feedback and discussions about the dataset from the community directly.

Image: News/Microsoft Designer

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Tariff Timeline: How We Got Here and What’s Next for Your Tech?
Next Article Here’s What You Can Do To Avoid Phishing Attacks
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The Best Early Memorial Day Deals Under $100 on Speakers, Routers, Keyboards, and More
News
How To Integrate Social Media with Other Channels
Computing
Sylvox Pool Pro 2.0
Gadget
Desk charger by day, portable power by night — this 3-in-1 MagSafe charger does it all
News

You Might also Like

News

The Best Early Memorial Day Deals Under $100 on Speakers, Routers, Keyboards, and More

14 Min Read
News

Desk charger by day, portable power by night — this 3-in-1 MagSafe charger does it all

4 Min Read
News

Save $80 on a 14-in-1 Anker docking station at Amazon

2 Min Read
News

26 movies are leaving Netflix in June – put these 5 on your watch list

4 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?