CommonCrawl Web Data
Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.
About this tool
CommonCrawl Web Data
URL: http://commoncrawl.org/the-data/get-started/
Overview
CommonCrawl Web Data is a public, petabyte-scale repository of web crawl data collected over many years. It provides raw web page content and associated metadata from regular crawls of the public web, enabling large-scale analytics, research, and machine learning applications.
Features
-
Long-term crawl history
- Multiple named crawl releases from 2015 through 2025 (e.g.,
CC-MAIN-2025-51,CC-MAIN-2025-47, …,CC-MAIN-2015-32). - Enables longitudinal and time-based analyses of web content.
- Multiple named crawl releases from 2015 through 2025 (e.g.,
-
Petabyte-scale web corpus
- Large-scale collection of raw web pages and metadata.
- Suitable for big data processing, web mining, and training large models.
-
Public AWS S3 hosting
- Data stored in Amazon S3 in region
us-east-1(access from this region is mandatory for S3-based access). - Accessible via
s3://commoncrawl/...paths. - Many AWS services (e.g., EMR) can directly consume the S3 paths, often with wildcard support.
- Data stored in Amazon S3 in region
-
S3 access recommendations
- Access from within
us-east-1to avoid inter-region data transfer charges and improve latency. - Caution about using Elastic IPs or load balancers, which may incur additional routed traffic costs.
- On non-EMR Hadoop clusters, use the S3A protocol (e.g.,
s3a://commoncrawl/...) for improved compatibility and performance.
- Access from within
-
HTTP/HTTPS access without AWS account
- Data can be downloaded directly over HTTPS via URLs of the form:
https://data.commoncrawl.org/[path_to_file] - Compatible with standard HTTP download tools such as cURL and wget.
- No AWS account required for HTTP-based access.
- Data can be downloaded directly over HTTPS via URLs of the form:
-
AWS CLI integration
- Data can be accessed and managed using the AWS Command Line Interface, pointing to the Common Crawl S3 bucket.
- Works with AWS services that support S3 as an input data source.
Access Methods
-
From AWS (recommended for large-scale processing)
- Region:
us-east-1 - S3 URL scheme:
s3://commoncrawl/path_to_file - Hadoop (non-EMR): use
s3a://commoncrawl/path_to_file.
- Region:
-
From local machines or external clusters
- HTTPS URL scheme:
https://data.commoncrawl.org/path_to_file - Use tools like
curlorwgetto download files. - No AWS account needed for this method.
- HTTPS URL scheme:
Pricing
- The content provided does not specify explicit pricing plans.
- Noted cost considerations relate to AWS infrastructure charges when accessing via S3 (e.g., inter-region data transfer, Elastic IP or load balancer traffic), which are billed by AWS, not by Common Crawl itself.
Loading more......
Information
Categories
Similar Products
6 result(s)Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.
Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.
Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.
The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.
Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.
Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.