24 Matching Annotations
  1. Mar 2022
  2. Jan 2022
    1. Extracting a WARC record

      Once we’ve identified the offset and length of a particular record (in this case, an offset of 1260 bytes and a length of 1085 bytes), we can snip out an individual record like this:

      $ tail -c +1261 hello-world.warc | head -c 1085
      
    2. Making the WARC

      To create a WARC, we used wget:

      $ wget --warc-file hello-world http://iipc.github.io/warc-specifications/primers/web-archive-formats/hello-world.txt
      

      …which created the compressed hello-world.warc.gz file. These special block-compressed files are often used directly, but in this primer, we uncompress it so we can see what’s going on:

      $ gunzip hello-world.warc.gz
      

      …leaving us with hello-world.warc.

  3. Dec 2021
    1. import warc
      
      from StringIO import StringIO
      from httplib import HTTPResponse
      
      class FakeSocket():
          def __init__(self, response_str):
              self._file = StringIO(response_str)
          def makefile(self, *args, **kwargs):
              return self._file
      
      for record in warc.open("eada.warc.gz"):
          if record.type == "response":
              resp = HTTPResponse(FakeSocket(record.payload.read()))
              resp.begin()
              if resp.getheader("content-type") == "text/html":
                  print record['WARC-Target-URI']
      

      I sorted the output and came up with a nice list of URLs for the website. Here is a brief snippet:

      http://mith.umd.edu/eada/gateway/winslow.php
      http://mith.umd.edu/eada/gateway/winthrop.php
      http://mith.umd.edu/eada/gateway/witchcraft.php
      http://mith.umd.edu/eada/gateway/wood.php
      http://mith.umd.edu/eada/gateway/woolman.php
      http://mith.umd.edu/eada/gateway/yeardley.php
      http://mith.umd.edu/eada/guesteditors.php
      http://mith.umd.edu/eada/html/display.php?docs=acrelius_founding.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=alsop_character.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=arabic.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=ashbridge_account.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=banneker_letter.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=barlow_anarchiad.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=barlow_conspiracy.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=barlow_vision.xml&action=show
      http://mith.umd.edu/eada/html/display.php?docs=barlowe_voyage.xml&action=show
      
    2. $ wget --warc-file eada --mirror --page-requisites --adjust-extension --convert-links --wait 1 --execute robots=off --no-parent http://mith.umd.edu/eada/ > /dev/null 
      WARC output does not work with timestamping, timestamping will be disabled.
      Opening WARC file ‘eada.warc.gz’.
      
      --2021-12-29 17:43:08--  http://mith.umd.edu/eada/
      Resolving mith.umd.edu (mith.umd.edu)... 174.129.6.250
      Connecting to mith.umd.edu (mith.umd.edu)|174.129.6.250|:80... connected.
      HTTP request sent, awaiting response... 301 Moved Permanently
      Location: https://mith.umd.edu/eada/ [following]
      
           0K                                                       100% 25,6M=0s
      
      --2021-12-29 17:43:10--  https://mith.umd.edu/eada/
      Connecting to mith.umd.edu (mith.umd.edu)|174.129.6.250|:443... connected.
      HTTP request sent, awaiting response... 301 Moved Permanently
      Location: https://archive.mith.umd.edu/eada/ [following]
      
           0K                                                       100% 41,1M=0s
      
      --2021-12-29 17:43:11--  https://archive.mith.umd.edu/eada/
      Resolving archive.mith.umd.edu (archive.mith.umd.edu)... 174.129.6.250
      Connecting to archive.mith.umd.edu (archive.mith.umd.edu)|174.129.6.250|:443... connected.
      HTTP request sent, awaiting response... 301 Moved Permanently
      Location: http://eada.lib.umd.edu [following]
      
           0K                                                       100% 42,9M=0s
      
      --2021-12-29 17:43:13--  http://eada.lib.umd.edu/
      Resolving eada.lib.umd.edu (eada.lib.umd.edu)... 129.2.19.174
      Connecting to eada.lib.umd.edu (eada.lib.umd.edu)|129.2.19.174|:80... connected.
      HTTP request sent, awaiting response... 200 OK
      Length: 5210 (5,1K) [text/html]
      Saving to: ‘mith.umd.edu/eada/index.html’
      
           0K .....                                                 100%  447M=0s
      
      2021-12-29 17:43:13 (447 MB/s) - ‘mith.umd.edu/eada/index.html’ saved [5210/5210]
      
      FINISHED --2021-12-29 17:43:13--
      Total wall clock time: 5,2s
      Downloaded: 1 files, 5,1K in 0s (447 MB/s)
      Converting links in mith.umd.edu/eada/index.html... 2-7
      Converted links in 1 files in 0,001 seconds.
      
    1. WET Response Format

      As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

      WARC/1.0
      WARC-Type: conversion
      WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
      WARC-Date: 2013-12-04T15:30:35Z
      WARC-Record-ID: 
      WARC-Refers-To: 
      WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
      Content-Type: text/plain
      Content-Length: 5765
      
      
      ...Text Content...
      
    2. WAT Response Format

      WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

      This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

      The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

          Envelope
              WARC-Header-Metadata
              Payload-Metadata
                  HTTP-Response-Metadata
                      Headers
                          HTML-Metadata
                              Head
                                  Title
                                  Scripts
                                  Metas
                                  Links
                              Links
              Container
      
    3. WARC Format

      The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

      For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

      In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

      WARC/1.0
      WARC-Type: response
      WARC-Date: 2013-12-04T16:47:32Z
      WARC-Record-ID: 
      Content-Length: 73873
      Content-Type: application/http; msgtype=response
      WARC-Warcinfo-ID: 
      WARC-Concurrent-To: 
      WARC-IP-Address: 23.0.160.82
      WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
      WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
      WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU
      
      HTTP/1.0 200 OK
      Server: nginx
      Content-Type: text/html; charset=UTF-8
      Vary: Accept-Encoding
      Vary: Cookie
      X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
      Content-Encoding: gzip
      Date: Wed, 04 Dec 2013 16:47:32 GMT
      Content-Length: 18953
      Connection: close
      
      
      ...HTML Content...
      
  4. Oct 2020
  5. Oct 2018
    1. InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay.
  6. Sep 2018
    1. The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store “web crawls” as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content.
  7. Apr 2018