ClueWeb12
Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.
About this tool
ClueWeb12
Category: Themed Directories
Tags: datasets, web, information-retrieval
Source: http://lemurproject.org/clueweb12/
Overview
ClueWeb12 is a large-scale English web crawl dataset created to support research in information retrieval and related human language technologies. It contains 733,019,372 English web pages collected between February 10, 2012 and May 10, 2012. It is a companion and successor to the ClueWeb09 dataset and has been distributed for research use since January 2013.
Features
-
Research focus
- Designed for information retrieval (IR) and human language technology research.
- Distributed strictly for research purposes.
-
Scale and content
- 733,019,372 English web pages in total.
- Content collected from the public web.
- Documents provided in HTML format.
-
Collection period
- Crawl dates: February 10, 2012 – May 10, 2012.
-
Dataset variants
- ClueWeb12-Full
- 733M documents.
- HTML format.
- Distributed on 1 × 8 TB disk.
- ClueWeb12-B13
- 50M documents.
- HTML format.
- Distributed on 1 × 500 GB disk.
- ClueWeb12-Full
-
Access and licensing
- Distributed by Carnegie Mellon University (CMU).
- Requires signing an Organizational Agreement (for the research group/unit) with CMU.
- Each individual user must sign an Individual Agreement retained by the organization.
- Intended for a single research group or unit within a larger legal entity (e.g., a lab within a university).
- Typical processing time to obtain the dataset: 4–6 weeks after initiating the license and payment process.
-
Online exploration (historical note)
- ClueWeb12-B13 search engine access requires credentials tied to a ClueWeb12 data license (previously ClueWeb09 credentials could be used up to January 31, 2014).
- No charge for using Lemur Project’s online ClueWeb12 services (where available); dataset itself still requires a license and distribution fee.
-
Distribution process (summary)
- Organization signs the Organizational Agreement (all pages initialed and signed by an authorized person).
- Agreement and order form are emailed (PDF preferred) to CMU.
- CMU acknowledges receipt and issues an invoice.
- Payment is made in U.S. dollars; purchaser must notify CMU by email after payment so they can track the deposit.
- After payment is confirmed, disks containing the dataset are shipped.
-
Sponsorship
- Creation of ClueWeb12 was sponsored by the U.S. National Science Foundation (NSF), grant CNS-0934358.
Pricing
Fees are for dataset distribution and exclude shipping costs; payment must be in U.S. dollars.
| Dataset Variant | Document Count | Format | Distribution Media | Cost (USD)* | |-------------------|----------------|--------|--------------------|-------------| | ClueWeb12-Full | 733M | HTML | 1 × 8 TB disk | $380 | | ClueWeb12-B13 | 50M | HTML | 1 × 500 GB disk | $185 |
*Shipping costs are additional.
Brand
- Provider: Lemur Project / Carnegie Mellon University
- Brand logo: http://lemurproject.org/images/lemur-logo.png
Links
- Dataset page: http://lemurproject.org/clueweb12/
- Dataset progress/info: http://boston.lti.cs.cmu.edu/clueweb12/
Loading more......
Information
Categories
Tags
Similar Products
6 result(s)Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.
Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.
Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.
The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.
Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.
A curated awesome-style list of open 3D semantic city and region models (e.g., CityGML datasets), providing a centralized directory of high-quality 3D urban data sources.