• Home
  • Categories
  • Tags
  • Pricing
  • Submit
  1. Home
  2. Datasets
  3. 53.5B Web Clicks of 100K Users in Indiana University

53.5B Web Clicks of 100K Users in Indiana University

Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.

🌐Visit Website

About this tool

53.5B Web Clicks of 100K Users in Indiana University

Overview

A large-scale clickstream dataset of approximately 53.5 billion HTTP requests collected at Indiana University (IU), capturing web browsing activity of about 100,000 anonymized users. The data supports research on web traffic networks, user browsing behavior, recommendation systems, and network traffic patterns.

Data was captured at IU’s border router using packet-level collection, enabling analysis of real user navigation paths (via referrer information) while minimizing biases typical of server logs or browser-based data.

Features

Scope and Scale

  • ~53.5 billion HTTP GET requests
  • ~60 million requests per day
  • ~30 GB/day of raw traffic during collection
  • Collection period: September 2006 – May 2010
  • Data missing for about 275 days total
  • Approximately 0.85 TB (raw) + 1.5 TB (raw-url) compressed

Collections

  1. raw collection

    • ~25 billion requests
    • Referrer: only host name of the referrer retained
    • Time span: 26 Sep 2006 – 3 Mar 2008
    • Missing 98 days of data (including entire June 2007)
    • Size: ~0.85 TB, compressed
  2. raw-url collection

    • ~28.6 billion requests
    • Referrer: full referrer URL retained
    • Time span: 3 Mar 2008 – 31 May 2010
    • Missing 179 days of data, including full months of Dec 2008, Jan 2009, Feb 2009
    • Size: ~1.5 TB, compressed

Data Collection Methodology

  • Source: Mirror of traffic passing through Indiana University’s border router
  • Filter: Berkeley Packet Filter matching all traffic to TCP port 80
  • Tooling: Long-running collection process using the pcap library
  • Extraction: Regular expressions applied to packet payloads to identify HTTP GET requests
  • Only requests are logged; server responses are not analyzed
  • No TCP stream reassembly performed

Recorded Fields per Request

Each logged record includes:

  • Timestamp (32-bit Unix epoch in seconds, little-endian)
  • Requested URL (split into host and path fields)
  • Referring URL / host (depending on collection: hostname only or full URL)
  • User agent classification:
    • Browser ("B") vs. other/unknown/bot ("?")
  • Direction flag (traffic origin/destination relative to IU):
    • "I" – External traffic to IU (outside → IU)
    • "O" – Internal traffic from IU to the outside (IU → outside)

Traffic Coverage and Sampling Notes

  1. Inside vs. outside IU traffic

    • "Outside IU" traffic: requests from outside IU for pages inside IU
    • "Inside IU" traffic: requests from people at IU (~100,000 users) for resources outside IU
    • These two sets have different and important sampling biases
  2. Anonymization and privacy

    • No client-identifying data retained:
      • No MAC addresses
      • No IP addresses
      • No unique client indices
  3. Limitations

    • No stream reassembly; only individual packets with GET requests
    • Server responses not logged
    • Only HTTP (port 80) traffic captured

File Organization and Format

  • Dataset is broken into hourly files

  • First line of each file: a set of flags (can be ignored by most users)

  • Each record has the structure:

    XXXXADreferrer
    host
    path
    

    Where:

    • XXXX – 32-bit Unix epoch timestamp (seconds, little-endian)
    • A – User-agent flag:
      • "B" = browser
      • "?" = other (including bots)
    • D – Direction flag:
      • "I" = external traffic to IU
      • "O" = internal traffic from IU to outside
    • referrer – Referrer hostname or full URL (newline-terminated)
    • host – Target hostname (newline-terminated)
    • path – Target path (newline-terminated)

Potential Applications

  • Modeling the structure and dynamics of web traffic networks
  • Studying user browsing behavior and navigation paths
  • Developing and evaluating recommendation systems
  • Analyzing network traffic patterns and workload characteristics
  • Designing and optimizing network infrastructure, websites, and server software
  • Forecasting traffic trends
  • Classifying websites by activity patterns
  • Improving ranking algorithms for search results

Source

  • More information and access: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/
Surveys

Loading more......

Information

Websitecnets.indiana.edu
PublishedDec 30, 2025

Categories

1 Item
Datasets

Tags

3 Items
#datasets
#internet
#big-data

Similar Products

6 result(s)
3.5B Web Pages from CommonCrawl 2012

Large-scale web crawl dataset containing 3.5 billion web pages from CommonCrawl (2012), suitable for web mining, search, and network analysis research. Listed as part of an awesome-style collection of computer networks datasets.

CAIDA Internet Datasets

Collection of Internet measurement and topology datasets from CAIDA, covering traffic traces, topology, routing, and security data. Frequently referenced in awesome data/networking lists for large-scale Internet research.

CommonCrawl Web Data

Open repository of petabyte-scale web crawl data spanning multiple years, offering raw web page and metadata for large-scale analytics. A canonical item in many awesome datasets and web data directories.

B3FD - Biometrically Filtered Famous Figure Dataset for Age Estimation

A facial age and gender estimation dataset with approximately 375k images of famous figures, biometrically filtered to improve label quality. Indexed within an awesome machine learning datasets collection.

Brain Catalogue

An open resource providing high-resolution 3D reconstructions and anatomical data of brains from multiple species for comparative neuroanatomy and neuroscience research.

Canada Science and Technology Museums Corporation Open Data

An open data portal from the Canada Science and Technology Museums Corporation, listing machine‑readable datasets about the collections and activities of Canadian science and technology museums, cataloged within the Awesome Public Datasets museums section.

Built with
Ever Works
Ever Works

Connect with us

Stay Updated

Get the latest updates and exclusive content delivered to your inbox.

Product

  • Categories
  • Tags
  • Pricing
  • Help

Clients

  • Sign In
  • Register
  • Forgot password?

Company

  • About Us
  • Admin
  • Sitemap

Resources

  • Blog
  • Submit
  • API Documentation
All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
Copyright © 2025 Ever. All rights reserved.·Terms of Service·Privacy Policy·Cookies