Dataset: Whole-of-Australian Government Web Crawl


Description

Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

Several binary file formats included and converted to HTML:
doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

URLs returning responses larger than 10MB are not included in the dataset.

Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.

Licence

Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.

General Information

Distributions