ClueWeb12

Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.

🌐Visit Website

About this tool

ClueWeb12

Category: Themed Directories
Tags: datasets, web, information-retrieval
Source: http://lemurproject.org/clueweb12/

Overview

ClueWeb12 is a large-scale English web crawl dataset created to support research in information retrieval and related human language technologies. It contains 733,019,372 English web pages collected between February 10, 2012 and May 10, 2012. It is a companion and successor to the ClueWeb09 dataset and has been distributed for research use since January 2013.

Features

Research focus
- Designed for information retrieval (IR) and human language technology research.
- Distributed strictly for research purposes.
Scale and content
- 733,019,372 English web pages in total.
- Content collected from the public web.
- Documents provided in HTML format.
Collection period
- Crawl dates: February 10, 2012 – May 10, 2012.
Dataset variants
- ClueWeb12-Full
  - 733M documents.
  - HTML format.
  - Distributed on 1 × 8 TB disk.
- ClueWeb12-B13
  - 50M documents.
  - HTML format.
  - Distributed on 1 × 500 GB disk.
Access and licensing
- Distributed by Carnegie Mellon University (CMU).
- Requires signing an Organizational Agreement (for the research group/unit) with CMU.
- Each individual user must sign an Individual Agreement retained by the organization.
- Intended for a single research group or unit within a larger legal entity (e.g., a lab within a university).
- Typical processing time to obtain the dataset: 4–6 weeks after initiating the license and payment process.
Online exploration (historical note)
- ClueWeb12-B13 search engine access requires credentials tied to a ClueWeb12 data license (previously ClueWeb09 credentials could be used up to January 31, 2014).
- No charge for using Lemur Project’s online ClueWeb12 services (where available); dataset itself still requires a license and distribution fee.
Distribution process (summary)
- Organization signs the Organizational Agreement (all pages initialed and signed by an authorized person).
- Agreement and order form are emailed (PDF preferred) to CMU.
- CMU acknowledges receipt and issues an invoice.
- Payment is made in U.S. dollars; purchaser must notify CMU by email after payment so they can track the deposit.
- After payment is confirmed, disks containing the dataset are shipped.
Sponsorship
- Creation of ClueWeb12 was sponsored by the U.S. National Science Foundation (NSF), grant CNS-0934358.

Pricing

Fees are for dataset distribution and exclude shipping costs; payment must be in U.S. dollars.

| Dataset Variant | Document Count | Format | Distribution Media | Cost (USD)* | |-------------------|----------------|--------|--------------------|-------------| | ClueWeb12-Full | 733M | HTML | 1 × 8 TB disk | $380 | | ClueWeb12-B13 | 50M | HTML | 1 × 500 GB disk | $185 |

*Shipping costs are additional.

Brand

Provider: Lemur Project / Carnegie Mellon University
Brand logo: http://lemurproject.org/images/lemur-logo.png

Information

Websitelemurproject.org

PublishedDec 30, 2025

ClueWeb12

Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.

🌐Visit Website

About this tool

ClueWeb12

Category: Themed Directories
Tags: datasets, web, information-retrieval
Source: http://lemurproject.org/clueweb12/

Overview

Features

Research focus
- Designed for information retrieval (IR) and human language technology research.
- Distributed strictly for research purposes.
Scale and content
- 733,019,372 English web pages in total.
- Content collected from the public web.
- Documents provided in HTML format.
Collection period
- Crawl dates: February 10, 2012 – May 10, 2012.
Dataset variants
- ClueWeb12-Full
  - 733M documents.
  - HTML format.
  - Distributed on 1 × 8 TB disk.
- ClueWeb12-B13
  - 50M documents.
  - HTML format.
  - Distributed on 1 × 500 GB disk.
Access and licensing
- Distributed by Carnegie Mellon University (CMU).
- Requires signing an Organizational Agreement (for the research group/unit) with CMU.
- Each individual user must sign an Individual Agreement retained by the organization.
- Intended for a single research group or unit within a larger legal entity (e.g., a lab within a university).
- Typical processing time to obtain the dataset: 4–6 weeks after initiating the license and payment process.
Online exploration (historical note)
- ClueWeb12-B13 search engine access requires credentials tied to a ClueWeb12 data license (previously ClueWeb09 credentials could be used up to January 31, 2014).
- No charge for using Lemur Project’s online ClueWeb12 services (where available); dataset itself still requires a license and distribution fee.
Distribution process (summary)
- Organization signs the Organizational Agreement (all pages initialed and signed by an authorized person).
- Agreement and order form are emailed (PDF preferred) to CMU.
- CMU acknowledges receipt and issues an invoice.
- Payment is made in U.S. dollars; purchaser must notify CMU by email after payment so they can track the deposit.
- After payment is confirmed, disks containing the dataset are shipped.
Sponsorship
- Creation of ClueWeb12 was sponsored by the U.S. National Science Foundation (NSF), grant CNS-0934358.

Pricing

Fees are for dataset distribution and exclude shipping costs; payment must be in U.S. dollars.

*Shipping costs are additional.

Brand

Provider: Lemur Project / Carnegie Mellon University
Brand logo: http://lemurproject.org/images/lemur-logo.png

Information

Websitelemurproject.org

PublishedDec 30, 2025

ClueWeb12

About this tool

ClueWeb12

Overview

Features

Pricing

Brand

Links

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources

ClueWeb12

About this tool

ClueWeb12

Overview

Features

Pricing

Brand

Links

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources