

Clickstream dataset with 53.5 billion web clicks from 100,000 anonymized users at Indiana University, useful for studying browsing behavior, recommendation, and network traffic patterns; included in an awesome curated datasets list.
Loading more......
A large-scale clickstream dataset of approximately 53.5 billion HTTP requests collected at Indiana University (IU), capturing web browsing activity of about 100,000 anonymized users. The data supports research on web traffic networks, user browsing behavior, recommendation systems, and network traffic patterns.
Data was captured at IU’s border router using packet-level collection, enabling analysis of real user navigation paths (via referrer information) while minimizing biases typical of server logs or browser-based data.
raw collection
raw-url collection
Each logged record includes:
host and path fields)"B") vs. other/unknown/bot ("?")"I" – External traffic to IU (outside → IU)"O" – Internal traffic from IU to the outside (IU → outside)Inside vs. outside IU traffic
Anonymization and privacy
Limitations
Dataset is broken into hourly files
First line of each file: a set of flags (can be ignored by most users)
Each record has the structure:
XXXXADreferrer
host
path
Where:
XXXX – 32-bit Unix epoch timestamp (seconds, little-endian)A – User-agent flag:
"B" = browser"?" = other (including bots)D – Direction flag:
"I" = external traffic to IU"O" = internal traffic from IU to outsidereferrer – Referrer hostname or full URL (newline-terminated)host – Target hostname (newline-terminated)path – Target path (newline-terminated)