• Home
  • Categories
  • Tags
  • Pricing
  • Submit
  1. Home
  2. Themed Directories
  3. CommonCrawl Web Data

CommonCrawl Web Data

Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.

🌐Visit Website

About this tool

CommonCrawl Web Data

URL: http://commoncrawl.org/the-data/get-started/

Overview

CommonCrawl Web Data is a public, petabyte-scale repository of web crawl data collected over many years. It provides raw web page content and associated metadata from regular crawls of the public web, enabling large-scale analytics, research, and machine learning applications.

Features

  • Long-term crawl history

    • Multiple named crawl releases from 2015 through 2025 (e.g., CC-MAIN-2025-51, CC-MAIN-2025-47, …, CC-MAIN-2015-32).
    • Enables longitudinal and time-based analyses of web content.
  • Petabyte-scale web corpus

    • Large-scale collection of raw web pages and metadata.
    • Suitable for big data processing, web mining, and training large models.
  • Public AWS S3 hosting

    • Data stored in Amazon S3 in region us-east-1 (access from this region is mandatory for S3-based access).
    • Accessible via s3://commoncrawl/... paths.
    • Many AWS services (e.g., EMR) can directly consume the S3 paths, often with wildcard support.
  • S3 access recommendations

    • Access from within us-east-1 to avoid inter-region data transfer charges and improve latency.
    • Caution about using Elastic IPs or load balancers, which may incur additional routed traffic costs.
    • On non-EMR Hadoop clusters, use the S3A protocol (e.g., s3a://commoncrawl/...) for improved compatibility and performance.
  • HTTP/HTTPS access without AWS account

    • Data can be downloaded directly over HTTPS via URLs of the form:
      https://data.commoncrawl.org/[path_to_file]
    • Compatible with standard HTTP download tools such as cURL and wget.
    • No AWS account required for HTTP-based access.
  • AWS CLI integration

    • Data can be accessed and managed using the AWS Command Line Interface, pointing to the Common Crawl S3 bucket.
    • Works with AWS services that support S3 as an input data source.

Access Methods

  • From AWS (recommended for large-scale processing)

    • Region: us-east-1
    • S3 URL scheme: s3://commoncrawl/path_to_file
    • Hadoop (non-EMR): use s3a://commoncrawl/path_to_file.
  • From local machines or external clusters

    • HTTPS URL scheme: https://data.commoncrawl.org/path_to_file
    • Use tools like curl or wget to download files.
    • No AWS account needed for this method.

Pricing

  • The content provided does not specify explicit pricing plans.
  • Noted cost considerations relate to AWS infrastructure charges when accessing via S3 (e.g., inter-region data transfer, Elastic IP or load balancer traffic), which are billed by AWS, not by Common Crawl itself.
Surveys

Loading more......

Information

Websitecommoncrawl.org
PublishedDec 30, 2025

Categories

1 Item
Themed Directories

Tags

3 Items
#datasets
#web
#big-data

Similar Products

6 result(s)
3.5B Web Pages from CommonCrawl 2012

Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.

ClueWeb09

Research corpus of about 1 billion web pages collected in 2009 by the Lemur Project, designed for information retrieval and web mining experiments and commonly listed in awesome datasets directories.

ClueWeb12

Large-scale web crawl dataset of 733 million web pages collected in 2012, maintained by the Lemur Project and widely used for IR research; referenced in awesome-style dataset listings.

LAW Network Datasets

The Laboratory for Web Algorithmics (LAW) at the University of Milan provides a structured collection of large‑scale web and hyperlink graph datasets. This page acts as a directory of web graph and network datasets suitable for experiments in web mining, graph algorithms, and network analysis, aligning with "awesome"-type meta-lists of reusable data resources.

53.5B Web Clicks of 100K Users in Indiana University

Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.

Indie Map

Indie Map provides a social graph and crawl data of prominent IndieWeb sites, cataloged in the Awesome Data Project as a specialized social network dataset for IndieWeb communities.

Built with
Ever Works
Ever Works

Connect with us

Stay Updated

Get the latest updates and exclusive content delivered to your inbox.

Product

  • Categories
  • Tags
  • Pricing
  • Help

Clients

  • Sign In
  • Register
  • Forgot password?

Company

  • About Us
  • Admin
  • Sitemap

Resources

  • Blog
  • Submit
  • API Documentation
All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
Copyright © 2025 Ever. All rights reserved.·Terms of Service·Privacy Policy·Cookies