ClueWeb12
Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.
About this tool
No Content Available
No content provided
Loading more......
Information
Categories
Tags
Similar Products
6 result(s)Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.
Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.
Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.
The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.
Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.
A curated awesome-style list of open 3D semantic city and region models (e.g., CityGML datasets), providing a centralized directory of high-quality 3D urban data sources.