3.5B Web Pages from CommonCrawl 2012
Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.
About this tool
3.5B Web Pages from CommonCrawl 2012
Overview
Large-scale web crawl dataset containing approximately 3.5 billion web pages collected by CommonCrawl in 2012. Intended for research and experimentation in areas such as web mining, search, and network analysis. Listed within an “awesome-style” collection of computer networks datasets.
- Category: Themed directories
- Type: Dataset (web crawl / big data)
- Year of crawl: 2012
- Source: CommonCrawl (referenced via BigDataNews)
- URL: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
- Tags: datasets, web, big-data
Features
- Contains about 3.5 billion web pages from a large-scale crawl.
- Based on the CommonCrawl 2012 corpus.
- Suitable for web mining research, including:
- Content analysis at large scale
- Topic modeling and classification
- Language and text mining experiments
- Applicable to search and information retrieval research, such as:
- Indexing and ranking experiments
- Query log–independent search evaluation scenarios
- Supports network and graph analysis, including:
- Web graph construction
- Link structure and connectivity studies
- Page-level and domain-level graph metrics
- Appropriate for big data processing frameworks (e.g., Hadoop/Spark-style workflows), given its scale.
- Included in an awesome-style curated list of computer networks datasets, indicating use as a reference dataset for networking and web research communities.
Use Cases
- Academic and industrial web-scale research projects.
- Benchmarking big data processing pipelines and distributed systems.
- Building and testing experimental search engines.
- Studying web structure, connectivity, and evolution around 2012.
Pricing
- Not specified in the provided content. (CommonCrawl datasets are typically freely available, but the exact access terms should be confirmed on the linked source.)
Loading more......
Information
Categories
Similar Products
6 result(s)Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.
Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.
Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.
The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.
Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.
Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.