Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Planning a new CPAN module for WARC support (DSLIP: IdpOp)

by jcb (Pilgrim)
on Aug 08, 2019 at 23:59 UTC ( #11104202=perlquestion: print w/replies, xml ) Need Help??

jcb has asked for the wisdom of the Perl Monks concerning the following question:

After a long sojourn in the wilderness, I have returned to the Monastery with part of an API in hand and several questions for my fellow monks.

  1. Is there a better namespace for this than the top-level? If so, where?

    Archive:: seems to fit at first glance, but this module has a radically different interface from most of the modules in that namespace because WARC files store subtly different information. (Most archives store "files"; WARC can store files, but are designed to store HTTP request/response exchanges.) An Archive::WARC interface could be reasonable, but it would provide a special "view" of a WARC file that omits most details. (Recognizing this was a major step in designing this API -- and took me a few years to do!)

    HTTP:: could be a possibility, but does not really fit because WARC files can also store information from other protocols. (The WARC spec envisions storing DNS records "as observed" as an example.)

    LWP:: fits the eventual goal of providing transparent access to WARC files as a sort of "local Wayback" but is probably better reserved for the interface modules that *actually* implement that "local Wayback" than the generic support for accessing and building WARC files. (The baseline WARC distribution uses the HTTP::* classes, but has no other dependencies on LWP and no dependencies in the LWP:: namespace.)

  2. Any problems with the use of "meaningful" constructors?

    The WARC::Collection and WARC::Volume modules provide read-only access to existing (collections of)? WARC files. The constructors have been given names to reflect this: WARC::Volume->mount and WARC::Collection->assemble.

    The use of "open" for a WARC::Volume constructor was considered, but cannot be used in the indirect object syntax that I prefer for a constructor due to a parse conflict with the "open" builtin that perl resolves by raising a parse error instead of looking for a class method.

    ("open WARC::File ($name)" would have been ideal, but looks too much like a typo using the "open" builtin.)

  3. How best to provide options on the "replay" method of WARC::Record?

    The current API envisions some means of retrieving the content of a WARC record as a file handle or string and another means of getting a reconstructed protocol response object. (An HTTP::Response in the usual case, but possibly something else.)

    Options also include whether or not to actually retrieve the request chain or to just synthesize a request from the information in the "response" record. (There is no point in reading several WARC records for a long redirect chain if the user only cares about the URL and the server's final response.) This is a significant concern because the common CDX index format only indexes response records.

  4. Should the tied hash and tied array interfaces for WARC::Record WARC::Fields be automatically invoked using overloaded dereference operators?

    Or is this asking for trouble?

  5. Is overloading the == (or <=>) operator on WARC::Record to use file:offset tuples as good an idea as it seems?

    This would be most useful to coalesce duplicate records from multiple indexes. Logically, two record objects that refer to the same physical record should compare as equal.

  6. What to do with a segmented record if we lack index information to find the next segment?

    WARC file names are normally systematic: we can probably guess the next WARC filename in "normal" cases, but there will always be edge cases where we have no idea.

    How far should I go in trying to make this Just Work? When the "It Just Works" logic fails, is it better to return an undefined value or raise an exception? And should we ensure that all segments are available when first opening a segmented payload or defer failure to when we actually "run out of road"?

  7. Should the WARC::Collection class have a concept of "next volume"?

    This would mean that $record->next on the last record in a file returns the first record in the next file.

    Related:Should WARC::Collection expose information about the set of volumes in a collection? If so, how?

  8. Any advice on attaching contents to WARC records?

    Simply keeping the contents in memory is not always an option -- WARC segmentation permits payloads of unlimited size.

Nothing is too trivial here: this is intended for CPAN and bikeshedding public APIs is the best way to avoid backwards compatibility becoming unpleasant later.

The modules are not ready for CPAN yet, mostly due to the still-lingering namespace question. Nor has any significant code been written yet, since I prefer to have a solid idea of the API before getting too involved in implementation. The rest of this node is a copy of the current documentation draft as formatted with pod2html: (internal links are probably broken, sorry)


NAME

WARC - Web ARChive support for Perl


SYNOPSIS

  use WARC;
  $collection = assemble WARC::Collection (@indexes);
  $record = $collection->search(url => $url, time => $when);
  $volume = mount WARC::Volume ($filename);
  $record = $volume->first_record;
  $next_record = $record->next;
  $record = $volume->record_at($offset);
  # $record is a WARC::Record object


DESCRIPTION

The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules

WARC::Collection
A WARC::Collection object represents a set of indexed WARC files.

WARC::Volume
A WARC::Volume object represents a single WARC file.

WARC::Record
Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.

WARC::Record::Payload
WARC::Record::Segment
WARC::Fields
A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.

The key-value format used in WARC headers has its own MIME type ``application/warc-fields'' and is also usable as the contents of a ``warcinfo'' record and elsewhere. The WARC::Fields class also provides support for objects of this type.

WARC::Index
WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.

WARC::Index::CDX
Access module for the common CDX WARC index format.

WARC::Index::SDBM
Planned ``fast index'' format using ``SDBM_File'' to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.

WARC::Index::SQLite
Another planned ``fast index'' format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.

Overview of the WARC writer support modules

WARC::Volume::Builder
The WARC::Volume::Builder class provides a means to write new WARC files.

WARC::Index::CDX::Builder
WARC::Index::SDBM::Builder
WARC::Index::SQLite::Builder
The WARC::Index::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.

The build constructor that WARC::Index provides uses one of these classes for the actual work.


CAVEATS

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for ``WARC-alike'' handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.


AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>


SEE ALSO

Information about the WARC format at http://bibnum.bnf.fr/WARC/.

An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.

# TODO: add relevant RFCs.

The POD pages for the modules mentioned in the overview lists.


COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.



NAME

WARC::Builder - Web ARChive construction support for Perl


SYNOPSIS

  use WARC::Builder;
  $warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
                                     format => 'WARC File Format 1.0',
                                     # other fields omitted ...
                                     );
  $warcinfo = new WARC::Record (type => 'warcinfo',
                                content => $warcinfo_data);
  # for a small-scale crawl
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename => $warcfilename);
  # for a large-scale crawl
  $index1 = build WARC::Index::CDX (into => $indexprefix.'.cdx');
  $index2 = build WARC::Index::SDBM (into => $indexprefix.'.sdbm');
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename_template =>
                                $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
                              index => [$index1, $index2]);
  # for each collected object
  $build->append(@records);     # or ...
  $build->append($record1, $record2, ... );


DESCRIPTION

The WARC::Builder class is the high-level interface for writing WARC archives. It is a very simple interface, because, at this level, WARC is a very simple format: a simple sequence of WARC records, which WARC::Builder accepts as WARC::Record objects to append to the in-progress WARC file.

WARC file size limits are handled automatically if configured.

Methods

$build = new WARC::Builder (key => value, ...)
Construct a WARC::Builder object. The following keys are supported:
index => [$index]
index => [$index1, $index2, ...]
If set, must be an array reference of a list of index builder objects. Each newly-added WARC::Record will be presented to all index builder objects in this list.

filename => $warcfilename
If set, create a single WARC file with the given file name. The file name must match m/\.warc(?:\.gz)?$/. The presence of a final ``.gz'' indicates that the WARC file should be written with per-record gzip compression.

This option is mutually exclusive with the filename_template option.

Using this option inhibits starting a new WARC file and causes the max_file_size option to be ignored. A warning is emitted in this case.

filename_template => $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz'
Establish an sprintf format string to construct file names. The file name produced by the template string must match m/\.warc(?:\.gz)?$/. The presence of a final ``.gz'' indicates that the WARC file should be written with per-record gzip compression.

The filename_template option gives the format string, while filename_template_vars gives an array reference of named parameters to be used with the format.

If constructing file names in accordance with the IIPC WARC implementation guidelines, this string should be of the form 'PREFIX-%s-%05d-HOSTNAME.warc.gz' where PREFIX is any chosen prefix to name the crawl and HOSTNAME is the name or other identifier for the machine writing the file.

This option is mutually exclusive with the filename option.

filename_template_vars => [qw/timestamp serial/]
Provide the list of parameters to the sprintf call used to produce a WARC filename from the filename_template option.

The available variables are:

serial
A number, incremented each time adding a record causes a new WARC file to be started.

timestamp
A 14-digit timestamp in the YYYYmmddHHMMSS format recommended in the IIPC WARC implementation guidelines. The timestamp is always in UTC. The time used is the time at which the WARC::Builder object was constructed and is constant between WARC files. This should be substituted as a string.

Default [qw/timestamp serial/] in accordance with IIPC guidelines.

first_serial => $count
The initial value of the serial filename variable for this object. Default 0.

max_file_size => $size
Maximum size of a WARC file. A new WARC file is started if appending a record would cause the current file to exceed this length.

The limit can be specified as an exact number of bytes, or a number followed by a size suffix m/[KMG]i?/. The ``K'', ``M'', and ``G'' suffixes indicate base-10 multiples (10**(3*n)), while the ``Ki'', ``Mi'', and ``Gi'' suffixes indicate base-2 multiples (2**(10*n)) widely used in computing.

Default ``1G'' == 1_000_000_000.

warcinfo => $warcinfo_record
A WARC::Record object of type ``warcinfo'' that will be written at the start of each WARC file. This record will be cloned and written with a distinct ``WARC-Record-ID'' as the first record in each WARC file, including the first. As a consequence, it does not require a ``WARC-Record-ID'' header and any ``WARC-Record-ID'' given is silently ignored.

Each clone of this record will also have the ``WARC-Filename'' header added.

Each clone of this record will also have the ``WARC-Date'' header set to the time at which the WARC::Builder object was constructed.

warcversion => 'WARC/1.0'
Set the version of the WARC format to be written. This string is the first line of each WARC record. It must begin with the prefix 'WARC/' and should be the version from the WARC specification that the crawler follows.

Default ``WARC/1.0''.

$build->append( $record1, ... )
Add any number of WARC::Record objects to the growing WARC file. If WARC file size limits are configured, and a record would cause the current WARC file to exceed the configured size limits, a new WARC file is opened automatically.

All records passed to a single append call are added to the same WARC file. If a new WARC file is to be started, it will be started before any records are written.

All records passed to a single append call are considered ``concurrent'' and all subsequent records will have a ``WARC-Concurrent-To'' header added referencing the first record, if they do not already have a ``WARC-Concurrent-To'' header. This is a convenience feature for simpler crawlers and is inhibited if any record already has a ``WARC-Concurrent-To'' header when append is called.

If a WARC::Record passed to this method lacks a ``WARC-Record-ID'' header, a warning will be emitted using carp(), a UUID will be generated, and a record ID of the form ``urn:uuid:UUID'' will be assigned. If the record object is read-only, this method will croak() instead.

If a WARC::Record passed to this method lacks any of the ``WARC-Date'', ``WARC-Type'', or ``Content-Length'' headers, this method will croak().


AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>


SEE ALSO

WARC, the WARC::Record manpage

International Internet Preservation Consortium (IIPC) WARC implementaion guidelines. https://netpreserve.org/resources/WARC_Guidelines_v1.pdf


...



NAME

WARC::Collection - Interface to a group of WARC files


SYNOPSIS

  use WARC::Collection;
  $collection = assemble WARC::Collection ($index_1, $index_2, ...);
  $collection = assemble WARC::Collection from => ($index_1, ...);
  $record = $collection->search(url => $url, time => $when);


DESCRIPTION

The WARC::Collection class is the primary means by which user code is expected to use the WARC library. This class uses indexes to efficiently search for records in one or more WARC files.

Methods

$collection = assemble WARC::Collection ($index_1, $index_2, ...);
$collection = assemble WARC::Collection from => ($index_1, ...);
Assemble a collection of WARC files from one index or multiple indexes, specified either as objects derived from WARC::Index or filenames.

While multiple indexes can be used in a collection, note that searching a collection requires individually searching every index in the collection.

$record = $collection->search( ... )
@records = $collection->search( ... )
Search the index for records matching the parameters and return the best match in scalar context or a list of all matches in list context. The returned values are WARC::Record objects.

The parameters are specified as key => value pairs and each narrows the search, sorts the results, or both, indicated in the following list with ``[N ]'', ``[ S]'', or ``[NS]'', respectively.

The keys supported are:

[N ] url
An exact match for a URL.

[NS] url_prefix
A prefix match for a URL. Prefers records with shorter URLs.

[ S] time
Prefer records collected nearer to the requested time.


...



NAME

WARC::Date - datestamp objects for WARC library


SYNOPSIS

  use WARC::Date;
  $datestamp = WARC::Date->now();               # construct from current time
  $datestamp = WARC::Date->from_epoch(time);    # likewise
  # construct from string
  $datestamp = parse WARC::Date ($text);        # full-featured
  $datestamp = WARC::Date->from_text($string);  # standard format only
  $time = $datestamp->as_epoch;         # as seconds since epoch
  $text = $datestamp->as_string;        # as "YYYY-MM-DDThh:mm:ssZ"


DESCRIPTION

WARC::Date objects encapsulate the details of the required format for timestamps in WARC headers.

Methods

$datestamp = WARC::Date->now
Construct a WARC::Date object representing the current time.

$datestamp = WARC::Date->from_epoch( $timestamp )
Construct a WARC::Date object representing the time indicated by an epoch timestamp.

$datestamp = WARC::Date->from_text( $string )
Construct a WARC::Date object representing the time indicated by a string in the same format returned by the as_string method.

$datestamp = parse WARC::Date ($text)
Construct a WARC::Date object from a textual representation. If the HTTP::Date manpage is installed, accepts any input acceptable to HTTP::Date::str2time. Otherwise, this method is equivalent to the from_text method.

$datestamp->as_string
Return a string in the format specified by [W3C-NOTE-datetime] restricted to 14 digits and UTC time zone, which is ``YYYY-MM-DDThh:mm:ssZ''.


CAVEATS

WARC::Date objects use epoch time internally and are therefore limited by the range of Perl's integers.


AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>


SEE ALSO

WARC, the HTTP::Date manpage

[W3C-NOTE-datetime] ``Date and Time Formats'' http://www.w3.org/TR/NOTE-datetime.


...



NAME

WARC::Fields - WARC record headers and application/warc-fields


SYNOPSIS

  require WARC::Fields;
  $f = new WARC::Fields;
  $f = $record->fields;                 # get WARC record headers
  $f->field('WARC-Type' => 'metadata'); # set
  $f->field('WARC-Type');               # get
  $f->remove_field('WARC-Type');        # delete
  tie @field_names, ref $f, $f;         # bind ordered list of field names
  tie %fields, ref $f, $f;              # bind hash of field names => values


DESCRIPTION

The WARC::Fields class encapsulates information in the ``application/warc-fields'' format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers class cannot be reliably reused for WARC fields.

Instances of this class are usually created as member variables of the WARC::Record class, but can also be returned as the content of WARC records with Content-Type ``application/warc-fields''.

Instances of WARC::Fields retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.

This class strives to faithfully represent the contents of a WARC file, although the field names are defined to be case-insensitive.

Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values, that is, the value associated with the header name will be an array reference. Similarly, the name of a recurring header is repeated in the tied array interface. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.

As with HTTP::Headers, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The method and tied hash interfaces allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.

Strictly, ``X-Crazy-Header'' and ``X_Crazy_Header'' are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise y/_/-/ occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.

The WARC standard specifically states that field names are case-insensitive, accordingly, ``X-Crazy-Header'' and ``X-CRAZY-HeAdEr'' are considered the same header for the method and tied hash interfaces. They will appear exactly as given in the tied array interface, however.

Methods

$f = WARC::Fields->new
Construct a new WARC::Fields object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order.

$f->clone
Copy a WARC::Fields object. A copy of a read-only object is writable.

$f->field( $name )
$f->field( $name => $value )
$f->field( $n1 => $v1, $n2 => $v2, ... )
Get or set the value of one or more fields. The field name is not case sensitive, but WARC::Fields will preserve its case if a new entry is created.

$f = WARC::Fields->parse( $text )
$f = WARC::Fields->parse_from( $fh )
Construct a new WARC::Fields object, reading initial contents from the provided text string or filehandle.

If either parse method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, the parse method croaks.

$f->as_string
Return the contents as a formatted WARC header or application/warc-fields block.

$f->set_readonly
Mark a WARC::Fields object read-only. All methods that modify the object will croak() if called on a read-only object.

Tied Array Access

The order of field names can be fully controlled by tying an array to a WARC::Fields object and manipulating the array using ordinary Perl operations. Removing a name from the array effectively removes the field from the object, but the value for that name is still remembered, allowing names to be moved about without loss of data.

WARC::Fields will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.

Tied Hash Access

The contents of a WARC::Fields object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal list.


...



NAME

WARC::Index - base class for WARC index classes


SYNOPSIS

  use WARC::Index::CDX; # or ...
  use WARC::Index::SDBM;
  # or some other WARC::Index::* implementation
  $index = attach WARC::Index::CDX (...);       # or ...
  $index = attach WARC::Index::SDBM (...);
  $record = $index->search(url => $url, time => $when);
  @results = $index->search(url => $url, time => $when);
  build WARC::Index::CDX (...); # or ...
  build WARC::Index::SDBM (...);


DESCRIPTION

WARC::Index is an abstract base class for indexes on WARC files and WARC-alike files. This class establishes the expected interface and provides a simple interface for building indexes.

Methods

$index = attach WARC::Index::* (...)
Construct an index object using the indicated technology and whatever parameters the index implementation needs.

Typically, indexes are file-based and a single parameter is the name of an index file which in turn contains the names of the indexed WARC files.

$record = $collection->search( ... )
@records = $collection->search( ... )
Search an index for records matching parameters. The WARC::Collection class uses this method to search each index in a collection.

build WARC::Index::* (into => $dest, from => ...)
build WARC::Index::* (from => [...], into => $dest)
The WARC::Index base class does provide this method, however. The build method works by loading the corresponding index builder class and driving the process or simply returning the newly-constructed object.

The build method itself handles the from key for specifying the files to index. The from key can be given an array reference, after which more key => value pairs may follow, or can simply use the rest of the argument list as its value.

If the from key is given, the build method will read the indicated files, construct an index, and return nothing. If the from key is not given, the build method will construct and return an index builder.

All index builders accept at least the into key for specifying where to store the index. See the documentation for WARC::Index::*::Builder for more information.

Index system registration

The WARC::Index package also maintains a registry of loaded index support. The register function adds the calling package to the list.

WARC::Index::register( filename => $filename_re )
Add the calling package to an internal list of available index handlers. The calling package must be a subclass of WARC::Index or this function will croak().

The filename key indicates that the calling package expects to handle index files with names matching the provided regex.

WARC::Index::find_handler( $filename )
Return the registered handler for $filename or undef if none match.


...



NAME

WARC::Record - one record from a WARC file


SYNOPSIS

  use WARC;             # or ...
  use WARC::Volume;     # or ...
  use WARC::Collection;
  # WARC::Record objects are returned from ->record_at and ->search methods
  # Construct a record, as when preparing a WARC file
  $warcinfo = new WARC::Record (type => 'warcinfo');

...


DESCRIPTION

WARC::Record objects come in two flavors with a common interface. Records read from WARC files are read-only and have meaningful return values from the methods listed in ``Methods on records from WARC files''. Records constructed in memory can be updated and those same methods all return undef.

Common Methods

$record->fields
Get the internal WARC::Fields object that contains WARC record headers.

$record->field( $name )
Get the value of the WARC header named $name from the internal WARC::Fields object.

Methods on records from WARC files

These methods all return undef if called on a WARC::Record object that does not represent a record in a WARC file.

$record->protocol
Return the format and version tag for this record. For WARC 1.0, this method returns 'WARC/1.0'.

$record->volume
Return the WARC::Volume object representing the file in which this record is located.

$record->offset
Return the file offset at which this record can be found.

$record->next
Return the next WARC::Record in the WARC file that contains this record.

$record->replay
Return a protocol-specific object representing the record contents.

This method returns undef if the library does not recognize the protocol message stored in the record.

A record with Content-Type ``application/http'' with an appropriate ``msgtype'' parameter produces an HTTP::Request or HTTP::Response object. An unknown ``msgtype'' on ``application/http'' produces a generic HTTP::Message. The returned object may be a subclass to support deferred loading of entity bodies.

$record->open_payload
Return a tied filehandle that reads the WARC record payload.

The WARC record payload is defined as the decoded content of the protocol response or other resource stored in the record. This method returns undef if called on a WARC record that has no payload or content that we do not recognize.

Methods on fresh WARC records

$record = new WARC::Record (key => value, ...)
Construct a fresh WARC record, suitable for use with WARC::Builder.


...



NAME

WARC::Volume - Web ARChive file access for Perl


SYNOPSIS

  use WARC::Volume;
  $volume = mount WARC::Volume ($filename);
  $record = $volume->first_record;
  $record = $volume->record_at($offset);
  $record = $volume->search(url => $url, time => $when);


DESCRIPTION

WARC::Volume ...

Methods

$volume = mount WARC::Volume ($filename)
Construct a WARC::Volume object. The parameter is the name of an existing WARC file. An exception is raised if the first record does not have a valid WARC header.

$volume->first_record
Construct and return a WARC::Record object representing the first WARC record in $volume. This should be a ``warcinfo'' record, but it is not required to be so.

$volume->record_at( $offset )
Construct and return a WARC::Record object representing the WARC record beginning at $offset within $volume. An exception is raised if an appropriate magic number is not found at $offset.


AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>


...


COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Edited 2019-08-09 by jcb: Demote headings and elide boilerplate to make draft documentation easier to read. Also clarify first question.

Edited 2019-08-09 by jcb: Oops: the only class that has tied array/hash interfaces is WARC::Fields, not WARC::Record.

Replies are listed 'Best First'.
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by shmem (Chancellor) on Aug 09, 2019 at 23:51 UTC

    1. It is a sad (or joyous?) fact that namespaces aren't related but by convention. Your module under the Archive:: Namespace doesn't have to follow the conventions of the other modules under this namespace. If that were so, a transparent Archive.pm would be in sight.
      Then, your Module says WARC - Web ARChive support for Perl so it is definitely an Archive type of module.
    2. No.
    3. can't answer yet.
    4. You might do that, and my guess is that it is not asking for trouble, since overload occurs just in that package. But I remember having trouble with overloading and subclassing.
    5. No, at first glance. What benefit does overloading provide you over calling a function with arguments? Overloading is useful to extend something (Math::BigInt) but has its overloading price.
    6. You should probably leave that to code using the module. Methods qw(next previous) and done. Also...
    7. yes. See previous point ;-)
    8. Hash::Util::FieldHash perhaps? an object which knows about its size and limit?

    Just an opinion of some monk.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      1. That is what I meant by Archive:: seeming to fit at first glance: I had the same idea, after all "Web ARChive" is literally the name of the format. I could argue that Archive::Web would be an appropriate root, but then you have the problem that WARC is not the only format for storing Web documents, merely the one favored by the Internet Archive and a few national libraries. Argument by weighty authority is still argument by authority. :-(

        While you are correct that conventions can be ignored, I would prefer to reserve Archive::WARC for a (future) simpler file-ish interface. There are ways to treat a WARC file much like a ZIP or ZOO archive.

      2. That is good, thank you.

      3. Is that a lack of information or just not having had time to look yet? (In other words, is more information needed or only patience?)

      4. WARC::Fields is a fairly simple ordered in-memory key-value store and unlikely to need subclasses. Overloading the dereference operators would make the tied array/hash interfaces nearly transparent, which seems nice to me.

        This would make $record->fields->{WARC-Type} or $record->fields->{WARC-Target-URI} shorthand for $record->field('WARC-Type') or $record->field('WARC-Target-URI'), since the field method on a WARC::Record is passed to the embedded WARC::Fields object.

        That is not very useful, but the real reason for overloading hash dereference to use the tied hash interface is to make keys %{$record->fields} valid and exactly what it looks like. Why roll my own iterator API when Perl already has one?

        On a side note, I realized that this question mentions the wrong package. Oops, fixed. (It had been part of WARC::Record originally before I decided to follow the same split as HTTP::Message and HTTP::Headers. I had been keeping a list of questions, and updating that fell through the cracks somehow. Oops!)

      5. Overloading provides convenience mostly, like being able to use sort on an array of WARC::Record without having to specify a comparison. The overload would probably be to a compareTo or compare_to method anyway. An overload to a method should work with subclasses, although I would expect an overload to a coderef to cause problems unless subclasses also use overload to override it. If I understand the overload documentation correctly the overhead of overloaded operators is tiny for packages that do not use them and is really the cost of supporting overloading at all.

        That is a fairly good argument against using overloads on WARC::Record, except that, without overloads, none of the overloadable operators make sense on a WARC::Record. There is ==, but that is object identity and exactly the most obvious candidate for overloading to make WARC::Record objects compare equal iff they refer to the same physical record even if they were obtained from two different indexes and therefore have been constructed separately and have different memory addresses.

      6. The purpose of WARC segmentation is to store payloads that are too large for a single WARC file. (The format has no inherent limit, but the specification recommends a policy of limiting WARC files to 1G each.) We run into this problem inside the READ or READLINE method implementing the tied file handle returned from open_payload on a WARC::Record object. Reading a payload from a WARC collection should be transparent, so the WARC library must recombine segments here.

        Also, due to limitations of the WARC format, there is no previous method: its implementation would require starting at the first record in the WARC file and repeatedly following next, a nasty performance surprise for the unwary. Better to let the module user do that if they really need it. At least that way, they should know it will be very slow.

      7. So I must ask the related question: How should WARC::Collection expose information about the volumes in the collection? Collections can be large enough that the indexes must be primarily stored on disk. Common Crawl, as an example, is double-digit TB hundreds of TBtens hundreds of thousands of 1GB WARC files storing many billions of records per crawl. Then again, simply returning an array should work here — ten two hundred thousand WARC::Volume objects should fit in a few hundred MB or so of RAM. Is array memory overhead still significantly smaller than hash memory overhead? I will have to carefully think about expected live object counts when choosing internal representations.

        Or should this be another tied array interface, where the list of WARC files is drawn from an index as needed? That can only work if the collection object is only using one index, but I think requiring a merged index for collections too large for even a complete list of WARC files to fit in RAM is reasonable.

      8. This is less of a problem for reading WARC files — the open_payload method provides a tied file handle that reads the payload from a WARC record; the real problem is supplying the data when writing a WARC file, especially in a way that is compatible with future support for transparently saving LWP exchanges to WARC files. Are temporary files really the only practical option here? (I suspect probably so.)

        Temporary file space can be bounded even if payload size is not: segments can be recorded as they arrive.

      Edited 2019-08-10 by jcb: Correct size of Common Crawl datasets and redo math. The conclusion seems to remain valid due to a previous math error.

Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by haukex (Chancellor) on Aug 10, 2019 at 12:09 UTC

    Just a couple of my opinions:

    1. I'm not aware of any requirement placed on the Archive:: namespace for all modules there to have a similar API or to only work on certain archives. At the moment it feels to me like the most natural place for such a module.
    2. No, I don't see any issues with using constructor names different from new, in fact this might make the code more readable later on. Just make sure to pick names that really do describe what the constructor is doing, and don't overload it too much - feel free to add more than one constructor with different names if that fits better.
    3. I would say key/value pairs (hash), as in $record->replay( foo => "bar" ) - that is IMO one of the most flexible ways of doing it.
    4. If you mean that $object->method as well as $object->[...] and $object->{...} should work, then yes, overloaded array/hash dereferencing that returns a tied array/hash does work (Update: I've done this myself before, but my classes for the two tie classes are different from the object's class!). Just keep in mind that you wouldn't be able to use that API for anything else then.
    5. Before you overload an operator, I'd suggest providing a method to do the operation. An overloaded operator can always be added later. (Similarly for the above point.)
    6. I'm not sure, but I would suggest providing both a low-level API that doesn't try to do anything fancy, so users can choose to use that for precise control of what happens, and optionally a higher-level API that tries to do the "right" thing (what that means will also be a question of experience with the module).
    7. I would say "why not?", but not meant rhetorically - I probably don't know all the issues involved with doing this?
    8. I don't know enough about WARC to give a good answer here...
      1. The main reason that I find this reasoning unconvincing is that Archive::WARC:: felt like the most natural place for this to me for a long time, too.

        While there may not be a rule that requires this in some bureaucratic sense, the Principle of Least Surprise suggests (at least to me) that modules in the same namespace should share, in principle, similar interfaces. While the method names are often different, all of the modules I have looked at in Archive:: map some kind of string-like filename to an archive member. While conceptually, this is possible for a subset of WARC records, I want this library to provide complete support for WARC files, and think that that simpler read interface should eventually go into an Archive::WARC package that is a front-end to this library.

        While I mentioned Archive::Web:: as a possibility in an earlier reply, I have since realized that I cannot actually use that: people will be searching for "WARC" so the name needs to include it.

        Another reason to put this at top-level is that the WARC format is actually a generic container, not unlike YAML or JSON or MIME. The plan for a WARC::Alike:: hierarchy to put WARC-like interfaces on other related formats also suggests to me that this library is looking more like a type of framework than a simple archive access tool.

      2. Describing what the constructors do is the main reason for not using new. The WARC::Volume, WARC::Index::{CDX,SDBM,...}, and WARC::Collection classes all work only for reading existing data. (The WARC::Index->build class method inherited by index implementations constructs an index builder, planned as build WARC::Index::CDX (...) returning a WARC::Index::CDX::Builder object if not given the from option. Or should it always return the index builder, even if it "took care" of indexing some volumes for you?) So, in the current draft, volumes are mounted, indexes are attached, and collections are assembled.

      3. So, $record->replay to read whatever most closely matches the actual record (and probably croak() if we do not have a class for it), $record->replay( as => 'http' ) to read an HTTP::Response (possibly translated a la LWP from some other protocol, probably also croak()ing if we cannot do it), $record->replay( as => 'http', with => 'request' ) to actually read the HTTP request rather than synthesizing a stub, and $record->replay( as => 'http', with => 'chain' ) to fetch an entire HTTP redirect chain along with the final request/response pair?

        And feel free to bikeshed the values for the with option, if anyone reading has any ideas.

        The concern I had was about having one method do too much, but logically replay is a single operation, even if it dispatches to _replay_as_* methods to handle protocol translations.

      4. Considering that WARC::Fields is a simple in-memory ordered key-value store with a few convenience semantics, I do not expect that to be a problem, although your comment suggests that the array FETCH should perhaps return an object that stringifies to the key name, but also has an "offset" field indicating which of multiple occurrences of the same key this item represents. The idea is that the array interface should provide the "field name" column from an "application/warc-fields" document. (The WARC record headers also have their own MIME type.)

        Here is a sample, extracted from a WARC file I have around (actually that I made in order to have some "real-world" data for developing this):

        software: Wget/1.16 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdr +aft.pdf robots: classic

        That is from the "warcinfo" record that Wget wrote. For this example, the tied array would contain: qw/software format conformsTo robots/ or objects that stringify to those values. Although if FETCH returns an object, it could also include the value for that line as well. Hmmmm...

        I realized fairly quickly that the tie classes need to be different, and that the tied objects need to be different as well. (I recall something about self-tying causing segmentation faults in several versions of perl, but I do not have an exact citation for that at hand.) As I currently understand, while the various access methods need to be in subclasses, the TIEHASH and TIEARRAY methods are responsible for blessing the references that they return and can put them into any class desired, so tying a hash to WARC::Fields can invoke WARC::Fields->TIEHASH which returns a WARC::Fields::TiedHash object. The tied object class name will be a string constant, to allow the "empty subclass" test to pass, since a subclass can always override TIEHASH, call SUPER::TIEHASH, and then re-bless the returned object.

      5. The overloaded array/hash dereference on WARC::Fields is convenience for tie, which would remain documented, (I think the underlying tied object would actually be a scalar reference to the WARC::Fields object or its data) while the overloaded <=> on WARC::Record would probably be use overload '<=>' => 'compareTo'; with the use of camelCase in the method name as a hint that there is something special about that method: it is not directly called by perl, but it is called implicitly.

        That said, the main reason to overload <=> on WARC::Record is to redefine == to return true iff both objects refer to the same physical record, even if they are distinct objects. This is "value semantics" if I understand the term correctly.

      6. The WARC::Record generally is that low-level API. The open_payload method returns a tied filehandle which is a higher-level API that reads the stored entity in a record or possibly multiple records if segmentation is used. (I would expect an Archive::WARC::open_member_file call to eventually map to open_payload somehow.)

        This suggests an open_content method that returns a tied filehandle that reads from the body of a (single) WARC record without performing decoding. Now that I think about it, that could be very useful for implementing the open_payload method. Thanks for pointing me in this direction.

      7. The most significant issue I see is "which volume should be 'next'?" — a collection can use multiple indexes that may partially overlap and that are presumably from multiple (possibly simultaneous) crawls. How to impose a total order amongst the WARC volumes that is least surprising or is this not possible in general?

        Remember that reading indexes into memory may not be possible and even just a list of WARC volumes may be too large to hold in RAM. While physical hardware with "that kind of disk space" probably has "that kind of RAM" too, thanks to networks and cloud computing, we may be on an instance that has access to that much data, even mapped into the local filesystem, but definitely does not have "that much" RAM. I am thinking about Common Crawl here. While I personally do not have much use for that at this time, I do want this library to scale well enough for those who do have those uses.

      8. This comes back to WARC being a generic format, and one of the goals when developing WARC was to allow dumping network traffic (at a certain layer) directly into the growing archive. This is why WARC stores HTTP messages as records with Content-Type "application/http" and entities with transfer encodings intact.

        I have an eventual goal to be able to use WARC on a small scale as a type of persistent cache, nearly transparently integrating into LWP. This library is the first step: routines for handling the on-disk format. Later steps include interfaces that allow LWP::UserAgent to transparently return items from a WARC collection when appropriate, or even to (transparently) use only a WARC collection, which could be useful for testing. Long term ideal goals include coordinating with the LWP maintainers to add hooks that enable an LWP/WARC interface to record the exact bytes sent and received over the socket. But first, I need to implement reliable access to and construction of WARC files. All the rest builds on this layer.

Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by stevieb (Canon) on Aug 09, 2019 at 21:23 UTC

    Hey jcb, this is a great presentation here, but to be honest, I feel that it's a bit overwhelming.

    It might be easier to digest for our busy Monks if you could put the code into a repository of some sort (Github/Bitbucket etc), then ask your questions in a shorter, more direct and concise post, referring to the code in the external location where necessary.

    Not trying to dissuade you here... I've definitely asked for code review numerous times here over the years. I'm just making a suggestion from experience that may get more eyes on what you're trying to achieve/ask.

    -stevieb

      If not for my very first question, I would have uploaded a "preview release" to CPAN already.

      If the answer to that first question is (as I suspect) "No, put it at top-level", then I can start making early releases to CPAN. I expect that "0.0.0 alpha N" is a reasonable version number for "no code yet". :-)

      And there seems to be a small misunderstanding: I am really asking for an API design review. There is effectively no code written yet because I am hoping for monks more experienced than myself to say either "Yes, that API is sound and will be a good addition to CPAN." or "You will have problems here, here, and here. Have you considered ...?" before I put too much effort into writing code that will make problems later.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11104202]
Approved by Marshall
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2019-08-23 14:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?