ClueWeb12

Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.

🌐Visit Website

About this tool

No Content Available

No content provided

Surveys

Loading more......

Information

Websitelemurproject.org

PublishedDec 30, 2025

Tags

3 Items

#datasets

#web

#information-retrieval

Similar Products

6 result(s)

ClueWeb09

Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.

3.5B Web Pages from CommonCrawl 2012

Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.

CommonCrawl Web Data

Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.

LAW Network Datasets

The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.

Indie Map

Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.

Awesome 3D Semantic City Models

Featured

A curated awesome-style list of open 3D semantic city and region models (e.g., CityGML datasets), providing a centralized directory of high-quality 3D urban data sources.

ClueWeb12

About this tool

No Content Available

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources

ClueWeb12

About this tool

No Content Available

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources