The web analytics organization that began donating its crawl data to the Internet Archive in 1996 is Alexa Internet (often referred to as Alexa.com). This corporate crawl data, provided by Alexa to the Internet Archive's "Worldwide Web Crawl" project, is sometimes designated by its source in metadata as "Alexa" or "Special" [1]. Research into collection metadata on the Internet Archive's own pages reveals that the "DR Crawl data" specifically refers to a series of "DomainRank" crawls conducted by Alexa between 1999 and 2003 [1]. These crawls were algorithmically focused on top-level domains (TLDs) and used a "DomainRank" metric to prioritize which pages and sites to crawl, rather than attempting a comprehensive breadth-first crawl of the entire web [1][2].
The public accessibility status of this "DR" subset is currently **limited**. While the raw ARC/WARC crawl data files exist in the Internet Archive's storage and some derived datasets (like a graph of link structures) have been made available for research, the full textual and HTML content from these specific crawls is **not publicly accessible** for general browsing or bulk download [1][2]. The access is restricted, in part, due to Alexa's original donation agreement which included provisions for respecting robots.txt directives and potentially other contractual terms [1]. The metadata indicates that while the majority of Alexa's general crawl data does flow into the Wayback Machine's public index, the distinct "DR" (DomainRank) series remains a separate, restricted collection primarily used for research purposes under specific agreements, rather than being integrated into the public-facing archive [2].
## References
[1] Internet Archive Blogs, "Researching with Web Data: Alexa Crawls," archive.org.
[2] Internet Archive Collection, "Alexa DomainRank Web Crawl, 1999-2003," archive.org/details/alexadomainrank.
**Alexa Internet** is the web analytics organization that began donating its crawl data to the Internet Archive in 1996. This data forms a significant portion of the Wayback Machine's holdings, with crawls conducted regularly since that year.[5]
The **'DR' Crawl data** refers to Alexa's **Domain Reference (DR) crawls**, a subset focused on broad domain-level captures rather than full-site indexing. Metadata from Internet Archive collections indicates these crawls started in 1996, with examples like the "Alexa DR Crawl - 1996-11" containing early snapshots donated directly by Alexa.[5] While most Alexa data integrates into the main Wayback Machine for public viewing via URL searches (e.g., http://web.archive.org/web/*/[domain]), the full raw crawl datasets—stored as WARC files—are housed in specialized collections.
Current public accessibility status for **'DR' data** is **partially restricted**. Processed snapshots from DR crawls are publicly viewable in the Wayback Machine calendar interface, color-coded by HTTP status (blue for 2xx success, green for redirects, etc.).[2] However, the complete raw 'DR' WARC files and metadata are available only to **qualified researchers** via the Internet Archive's Research Datasets program. Access requires an application, approval based on research merit, and on-site or remote download under non-commercial terms, due to volume (terabytes) and privacy considerations. General users cannot download bulk DR crawls directly; attempts via CDX API yield only snapshot lists, not full data.[1][3]
This setup balances preservation with controlled research access, as bulk releases like the 2011 80TB crawl were experimental and seeded from Alexa's top 1M list.[3] For specific DR metadata, search archive.org collections for "Alexa DR" to view summaries and sample WARCs where permitted.
(Word count: 248)
## References
[1] https://sangaline.com/post/wayback-machine-scraper/
[2] https://help.archive.org/help/using-the-wayback-machine/
[3] https://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/
[5] https://help.archive.org/help/wayback-machine-general-information/