The Wikimedia Foundation, the organization behind the internet’s largest free encyclopedia Wikipedia, is offering an artificial intelligence-ready dataset on Kaggle that’s aimed at dissuading AI companies and large language model trainers from scraping the website.
“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content — making this ideal for training models, building features, and testing NLP pipelines,” Wikimedia said in the announcement on Wednesday.
Kaggle is a data science and machine learning community owned and governed by Google LLC that hosts datasets and data science challenges.
The dataset upload is available as of April 15 and includes high-quality elements such as abstracts, short descriptions, infobox key-value data, image links and segmented article sections. It excludes references and non-prose elements such as images and charts themselves.
Because the content is taken from Wikipedia, it’s licensed under the Creative Commons, a widely open free use license that allows for sharing, adapting and remixing content. It is also licensed under the GNU Free Documentation License, or GDFL, although in some cases public domain or alternative licenses may apply.
“Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation,” said Brenda Flynn, partnerships lead at Kaggle.
LLM developers depend heavily on data from the internet to train their models, but they get their datasets by scraping that data from public-facing websites. Web scraping is an automated process of extracting content, usually text and images, from websites using software that can be aggressive and adds additional load to web servers above and beyond normal human traffic.
That additional load is a costly performance hit for the web servers that have to bear it. The scraped data also must be reformatted so that machine learning and AI workflows can use it for training data.
Wikimedia and Kaggle said in the joint announcement that this dataset is designed to short-circuit this scraping not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers but also to provide already clean, pre-parsed and developer-friendly data.
Kaggle is host to more than 461,000 freely accessible datasets for AI and machine learning covering a wide variety of topics. Wikipedia’s dataset will join datasets on health (such as diabetes and cancer), finance (such as credit card fraud and the stock market) and social sciences (such as social media trends and education). There’s even a dataset containing nutrition information on 80 cereal products and one about UFO sightings.
The new Wikipedia dataset is available in French and English language editions on Kaggle as an early beta release. Since this is an early release, Kaggle is welcoming feedback and discussions about the dataset from the community directly.
Image: News/Microsoft Designer
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU