• Home
  • Categories
  • Pricing
  • Submit
    Built with
    Ever Works
    Ever Works

    Connect with us

    Stay Updated

    Get the latest updates and exclusive content delivered to your inbox.

    Product

    • Categories
    • Pricing
    • Help

    Clients

    • Sign In
    • Register
    • Forgot password?

    Company

    • About Us
    • Admin
    • Sitemap

    Resources

    • Blog
    • Submit
    • API Documentation
    All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
    Copyright © 2025 Ever. All rights reserved.·Terms of Service·Privacy Policy·Cookies
    Decorative pattern
    Decorative pattern
    1. Home
    2. Datasets
    3. 53.5B Web Clicks of 100K Users in Indiana University

    53.5B Web Clicks of 100K Users in Indiana University

    Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.

    53.5B Web Clicks of 100K Users in Indiana University

    Overview

    A large-scale clickstream dataset of approximately 53.5 billion HTTP requests collected at Indiana University (IU), capturing web browsing activity of about 100,000 anonymized users. The data supports research on web traffic networks, user browsing behavior, recommendation systems, and network traffic patterns.

    Data was captured at IU’s border router using packet-level collection, enabling analysis of real user navigation paths (via referrer information) while minimizing biases typical of server logs or browser-based data.

    Features

    Scope and Scale

    • ~53.5 billion HTTP GET requests
    • ~60 million requests per day
    • ~30 GB/day of raw traffic during collection
    • Collection period: September 2006 – May 2010
    • Data missing for about 275 days total
    • Approximately 0.85 TB (raw) + 1.5 TB (raw-url) compressed

    Collections

    1. raw collection

      • ~25 billion requests
      • Referrer: only host name of the referrer retained
      • Time span: 26 Sep 2006 – 3 Mar 2008
      • Missing 98 days of data (including entire June 2007)
      • Size: ~0.85 TB, compressed
    2. raw-url collection

      • ~28.6 billion requests
      • Referrer: full referrer URL retained
      • Time span: 3 Mar 2008 – 31 May 2010
      • Missing 179 days of data, including full months of Dec 2008, Jan 2009, Feb 2009
      • Size: ~1.5 TB, compressed

    Data Collection Methodology

    • Source: Mirror of traffic passing through Indiana University’s border router
    • Filter: Berkeley Packet Filter matching all traffic to TCP port 80
    • Tooling: Long-running collection process using the pcap library
    • Extraction: Regular expressions applied to packet payloads to identify HTTP GET requests
    • Only requests are logged; server responses are not analyzed
    • No TCP stream reassembly performed

    Recorded Fields per Request

    Each logged record includes:

    • Timestamp (32-bit Unix epoch in seconds, little-endian)
    • Requested URL (split into host and path fields)
    • Referring URL / host (depending on collection: hostname only or full URL)
    • User agent classification:
      • Browser ("B") vs. other/unknown/bot ("?")
    • Direction flag (traffic origin/destination relative to IU):
      • "I" – External traffic to IU (outside → IU)
      • "O" – Internal traffic from IU to the outside (IU → outside)

    Traffic Coverage and Sampling Notes

    1. Inside vs. outside IU traffic

      • "Outside IU" traffic: requests from outside IU for pages inside IU
      • "Inside IU" traffic: requests from people at IU (~100,000 users) for resources outside IU
      • These two sets have different and important sampling biases
    2. Anonymization and privacy

      • No client-identifying data retained:
        • No MAC addresses
        • No IP addresses
        • No unique client indices
    3. Limitations

      • No stream reassembly; only individual packets with GET requests
      • Server responses not logged
      • Only HTTP (port 80) traffic captured

    File Organization and Format

    • Dataset is broken into hourly files

    • First line of each file: a set of flags (can be ignored by most users)

    • Each record has the structure:

      XXXXADreferrer
      host
      path
      

      Where:

      • XXXX – 32-bit Unix epoch timestamp (seconds, little-endian)
      • A – User-agent flag:
        • "B" = browser
        • "?" = other (including bots)
      • D – Direction flag:
        • "I" = external traffic to IU
        • "O" = internal traffic from IU to outside
      • referrer – Referrer hostname or full URL (newline-terminated)
      • host – Target hostname (newline-terminated)
      • path – Target path (newline-terminated)

    Potential Applications

    • Modeling the structure and dynamics of web traffic networks
    • Studying user browsing behavior and navigation paths
    • Developing and evaluating recommendation systems
    • Analyzing network traffic patterns and workload characteristics
    • Designing and optimizing network infrastructure, websites, and server software
    • Forecasting traffic trends
    • Classifying websites by activity patterns
    • Improving ranking algorithms for search results

    Source

    • More information and access: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/
    Surveys

    Loading more......

    Information

    Websitecnets.indiana.edu
    PublishedDec 30, 2025

    Categories

    1 Item
    Datasets

    Tags

    3 Items
    #datasets#internet#big-data

    Similar Products

    6 result(s)

    3.5B Web Pages from CommonCrawl 2012

    Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.

    Awesome Astrodata

    Awesome list for astronomy data and resources for self-learning. Datasets, catalogs, and educational materials for exploring astronomical data.

    Awesome Cybersecurity Datasets

    A curated list of amazingly awesome cybersecurity datasets maintained by Santiago H. Ramos. Features network intrusion data, malware samples, botnet traffic, and web attack payloads used by universities and researchers worldwide. Approximately 1.9k stars and 326 forks.

    apd-core - NaturalLanguage section

    A curated Awesome-style sub-collection within the APD (Awesome Public Datasets) core repository that indexes multiple high‑quality natural language datasets and lexical resources via individual YAML meta files (e.g., SQuAD, Universal Dependencies, WordNet). It serves as a meta directory of links to external NLP datasets, aligning with the broader Awesome ecosystem as a directory-of-resources pattern.

    Featured

    Awesome 3D Semantic City Models

    A curated awesome-style list of open 3D semantic city and region models (e.g., CityGML datasets), providing a centralized directory of high-quality 3D urban data sources.

    Featured

    Awesome Data - Biology Datasets (Meta)

    A curated Awesome-style collection of biological and genomics datasets, including ENCODE, EMPIAR, Ensembl Genomes, GEO, Gene Ontology, GloBI, LINCS, HGDP, HMP, ICOS PSP Benchmark, HapMap, JCB DataViewer (via BioStudies), and KEGG. Each entry links out to the primary dataset resource along with a corresponding YAML metadata file in the awesomedata/apd-core GitHub repository, making this part of a larger meta collection of Awesome data directories.

    Featured