Friday, April 26, 2024
Home Tags Internet Archive

Tag: Internet Archive

The last 22 years of UK politics just became searchable online

Jeff Overs/BBC News & Current Affairs via Getty Images Britain's GOV.UK portal has been online since Netscape Navigator 2.0 was state of the art. Now, a National Archives project to make this trove of historical content more accessible has shifted 22 years worth of government websites to the cloud, re-indexed and made searchable through its updated UK Government Web Archive. However, it provides valuable historical insight into the changing policies and attitudes of Britain's official government communications, and there's a trove of information to be found for anyone with an interest in the finer detail of government publications. For example, a search for Brexit reveals 19,043 results, the first of which is a 2014 upload of a higher education funding presentation originally produced in April 2013, three months after then-Prime Minister David Cameron announced that the government would hold a referendum on EU membership. Over a period of two weeks, 120TB of the British government's archived GOV.UK web data was transferred from 72 individual two-terabyte hard disks to a pair of physical AWS Snowball transfer devices before being dispatched to one of Amazon's UK cloud storage facilities, where The National Archives' websites and content are hosted. Everything had to be indexed and fully text searchable, and that meant that MirrorWeb had to develop new tools. "We attempted to use traditional Hadoop tooling but found it to be impractical for big data sets stored in the cloud," explains MirrorWeb CTO Philip Clegg. By comparison, archives of official government sites such as Your Vote Matters date from as recently as 2018. Given that GOV.UK's official deletion policy means that content can be removed if it "was published by mistake" or "if it could result in a risk to health, finances or reputation", it's probably safe to bet that we won't be seeing any official return of that compromising tweet. That's going to change, though.