HTML::TableParser - phpMan

Che Dong
NAME
    HTML::TableParser - HTML::TableParser - Extract data from an HTML table

VERSION
    version 0.43

SYNOPSIS
      use HTML::TableParser;

      @reqs = (
               {
                id => 1.1,                    # id for embedded table
                hdr => \&header,              # function callback
                row => \&row,                 # function callback
                start => \&start,             # function callback
                end => \&end,                 # function callback
                udata => { Snack => 'Food' }, # arbitrary user data
               },
               {
                id => 1,                      # table id
                cols => [ 'Object Type',
                          qr/object/ ],       # column name matches
                obj => $obj,                  # method callbacks
               },
              );

      # create parser object
      $p = HTML::TableParser->new( \@reqs,
                       { Decode => 1, Trim => 1, Chomp => 1 } );
      $p->parse_file( 'foo.html' );


      # function callbacks
      sub start {
        my ( $id, $line, $udata ) = @_;
        #...
      }

      sub end {
        my ( $id, $line, $udata ) = @_;
        #...
      }

      sub header {
        my ( $id, $line, $cols, $udata ) = @_;
        #...
      }

      sub row  {
        my ( $id, $line, $cols, $udata ) = @_;
        #...
      }

DESCRIPTION
    HTML::TableParser uses HTML::Parser to extract data from an HTML table.
    The data is returned via a series of user defined callback functions or
    methods. Specific tables may be selected either by a matching a unique
    table id or by matching against the column names. Multiple (even nested)
    tables may be parsed in a document in one pass.

  Table Identification
    Each table is given a unique id, relative to its parent, based upon its
    order and nesting. The first top level table has id 1, the second 2,
    etc. The first table nested in table 1 has id 1.1, the second 1.2, etc.
    The first table nested in table 1.1 has id 1.1.1, etc. These, as well as
    the tables' column names, may be used to identify which tables to parse.

  Data Extraction
    As the parser traverses a selected table, it will pass data to user
    provided callback functions or methods after it has digested particular
    structures in the table. All functions are passed the table id (as
    described above), the line number in the HTML source where the table was
    found, and a reference to any table specific user provided data.

    Table Start
            The start callback is invoked when a matched table has been
            found.

    Table End
            The end callback is invoked after a matched table has been
            parsed.

    Header  The hdr callback is invoked after the table header has been read
            in. Some tables do not use the <th> tag to indicate a header, so
            this function may not be called. It is passed the column names.

    Row     The row callback is invoked after a row in the table has been
            read. It is passed the column data.

    Warn    The warn callback is invoked when a non-fatal error occurs
            during parsing. Fatal errors croak.

    New     This is the class method to call to create a new object when
            HTML::TableParser is supposed to create new objects upon table
            start.

  Callback API
    Callbacks may be functions or methods or a mixture of both. In the
    latter case, an object must be passed to the constructor. (More on that
    later.)

    The callbacks are invoked as follows:

      start( $tbl_id, $line_no, $udata );

      end( $tbl_id, $line_no, $udata );

      hdr( $tbl_id, $line_no, \@col_names, $udata );

      row( $tbl_id, $line_no, \@data, $udata );

      warn( $tbl_id, $line_no, $message, $udata );

      new( $tbl_id, $udata );

  Data Cleanup
    There are several cleanup operations that may be performed
    automatically:

    Chomp   chomp() the data

    Decode  Run the data through HTML::Entities::decode.

    DecodeNBSP
            Normally HTML::Entitites::decode changes a non-breaking space
            into a character which doesn't seem to be matched by Perl's
            whitespace regexp. Setting this attribute changes the HTML
            "nbsp" character to a plain 'ol blank.

    Trim    remove leading and trailing white space.

  Data Organization
    Column names are derived from cells delimited by the <th> and </th>
    tags. Some tables have header cells which span one or more columns or
    rows to make things look nice. HTML::TableParser determines the actual
    number of columns used and provides column names for each column,
    repeating names for spanned columns and concatenating spanned rows and
    columns. For example, if the table header looks like this:

     +----+--------+----------+-------------+-------------------+
     |    |        | Eq J2000 |             | Velocity/Redshift |
     | No | Object |----------| Object Type |-------------------|
     |    |        | RA | Dec |             | km/s |  z  | Qual |
     +----+--------+----------+-------------+-------------------+

    The columns will be:

      No
      Object
      Eq J2000 RA
      Eq J2000 Dec
      Object Type
      Velocity/Redshift km/s
      Velocity/Redshift z
      Velocity/Redshift Qual

    Row data are derived from cells delimited by the <td> and </td> tags.
    Cells which span more than one column or row are handled correctly, i.e.
    the values are duplicated in the appropriate places.

METHODS
    new
               $p = HTML::TableParser->new( \@reqs, \%attr );

            This is the class constructor. It is passed a list of table
            requests as well as attributes which specify defaults for common
            operations. Table requests are documented in "Table Requests".

            The %attr hash provides default values for some of the table
            request attributes, namely the data cleanup operations (
            "Chomp", "Decode", "Trim" ), and the multi match attribute
            "MultiMatch", i.e.,

              $p = HTML::TableParser->new( \@reqs, { Chomp => 1 } );

            will set Chomp on for all of the table requests, unless
            overridden by them. The data cleanup operations are documented
            above; "MultiMatch" is documented in "Table Requests".

            Decode defaults to on; all of the others default to off.

    parse_file
            This is the same function as in HTML::Parser.

    parse   This is the same function as in HTML::Parser.

Table Requests
    A table request is a hash used by HTML::TableParser to determine which
    tables are to be parsed, the callbacks to be invoked, and any data
    cleanup. There may be multiple requests processed by one call to the
    parser; each table is associated with a single request (even if several
    requests match the table).

    A single request may match several tables, however unless the MultiMatch
    attribute is specified for that request, it will be used for the first
    matching table only.

    A table request which matches a table id of "DEFAULT" will be used as a
    catch-all request, and will match all tables not matched by other
    requests. Please note that tables are compared to the requests in the
    order that the latter are passed to the new() method; place the DEFAULT
    method last for proper behavior.

  Identifying tables to parse
    HTML::TableParser needs to be told which tables to parse. This can be
    done by matching table ids or column names, or a combination of both.
    The table request hash elements dedicated to this are:

    id      This indicates a match on table id. It can take one of these
            forms:

            exact match
                      id => $match
                      id => '1.2'

                    Here $match is a scalar which is compared directly to
                    the table id.

            regular expression
                      id => $re
                      id => qr/1\.\d+\.2/

                    $re is a regular expression, which must be constructed
                    with the "qr//" operator.

            subroutine
                      id => \&my_match_subroutine
                      id => sub { my ( $id, $oids ) = @_ ;
                               $oids[0] > 3 && $oids[1] < 2 }

                    Here "id" is assigned a coderef to a subroutine which
                    returns true if the table matches, false if not. The
                    subroutine is passed two arguments: the table id as a
                    scalar string ( e.g. 1.2.3) and the table id as an
                    arrayref (e.g. "$oids = [ 1, 2, 3]").

            "id" may be passed an array containing any combination of the
            above:

              id => [ '1.2', qr/1\.\d+\.2/, sub { ... } ]

            Elements in the array may be preceded by a modifier indicating
            the action to be taken if the table matches on that element. The
            modifiers and their meanings are:

            "-"     If the id matches, it is explicitly excluded from being
                    processed by this request.

            "--"    If the id matches, it is skipped by all requests.

            "+"     If the id matches, it will be processed by this request.
                    This is the default action.

            An example:

              id => [ '-', '1.2', 'DEFAULT' ]

            indicates that this request should be used for all tables,
            except for table 1.2.

              id => [ '--', '1.2' ]

            Table 2 is just plain skipped altogether.

    cols    This indicates a match on column names. It can take one of these
            forms:

            exact match
                      cols => $match
                      cols => 'Snacks01'

                    Here $match is a scalar which is compared directly to
                    the column names. If any column matches, the table is
                    processed.

            regular expression
                      cols => $re
                      cols => qr/Snacks\d+/

                    $re is a regular expression, which must be constructed
                    with the "qr//" operator. Again, a successful match
                    against any column name causes the table to be
                    processed.

            subroutine
                      cols => \&my_match_subroutine
                      cols => sub { my ( $id, $oids, $cols ) = @_ ;
                                    ... }

                    Here "cols" is assigned a coderef to a subroutine which
                    returns true if the table matches, false if not. The
                    subroutine is passed three arguments: the table id as a
                    scalar string ( e.g. 1.2.3), the table id as an arrayref
                    (e.g. "$oids = [ 1, 2, 3]"), and the column names, as an
                    arrayref (e.g. "$cols = [ 'col1', 'col2' ]"). This
                    option gives the calling routine the ability to make
                    arbitrary selections based upon table id and columns.

            "cols" may be passed an arrayref containing any combination of
            the above:

              cols => [ 'Snacks01', qr/Snacks\d+/, sub { ... } ]

            Elements in the array may be preceded by a modifier indicating
            the action to be taken if the table matches on that element.
            They are the same as the table id modifiers mentioned above.

    colre   This is deprecated, and is present for backwards compatibility
            only. An arrayref containing the regular expressions to match,
            or a scalar containing a single reqular expression

    More than one of these may be used for a single table request. A request
    may match more than one table. By default a request is used only once
    (even the "DEFAULT" id match!). Set the "MultiMatch" attribute to enable
    multiple matches per request.

    When attempting to match a table, the following steps are taken:

    1       The table id is compared to the requests which contain an id
            match. The first such match is used (in the order given in the
            passed array).

    2       If no explicit id match is found, column name matches are
            attempted. The first such match is used (in the order given in
            the passed array)

    3       If no column name match is found (or there were none requested),
            the first request which matches an id of "DEFAULT" is used.

  Specifying the data callbacks
    Callback functions are specified with the callback attributes "start",
    "end", "hdr", "row", and "warn". They should be set to code references,
    i.e.

      %table_req = ( ..., start => \&start_func, end => \&end_func )

    To use methods, specify the object with the "obj" key, and the method
    names via the callback attributes, which should be set to strings. If
    you don't specify method names they will default to (you guessed it)
    "start", "end", "hdr", "row", and "warn".

      $obj = SomeClass->new();
      # ...
      %table_req_1 = ( ..., obj => $obj );
      %table_req_2 = ( ..., obj => $obj, start => 'start',
                                 end => 'end' );

    You can also have HTML::TableParser create a new object for you for each
    table by specifying the "class" attribute. By default the constructor is
    assumed to be the class new() method; if not, specify it using the "new"
    attribute:

      use MyClass;
      %table_req = ( ..., class => 'MyClass', new => 'mynew' );

    To use a function instead of a method for a particular callback, set the
    callback attribute to a code reference:

      %table_req = ( ..., obj => $obj, end => \&end_func );

    You don't have to provide all the callbacks. You should not use both
    "obj" and "class" in the same table request.

    HTML::TableParser automatically determines if your object or class has
    one of the required methods. If you wish it *not* to use a particular
    method, set it equal to "undef". For example

      %table_req = ( ..., obj => $obj, end => undef )

    indicates the object's end method should not be called, even if it
    exists.

    You can specify arbitrary data to be passed to the callback functions
    via the "udata" attribute:

      %table_req = ( ..., udata => \%hash_of_my_special_stuff )

  Specifying Data cleanup operations
    Data cleanup operations may be specified uniquely for each table. The
    available keys are "Chomp", "Decode", "Trim". They should be set to a
    non-zero value if the operation is to be performed.

  Other Attributes
    The "MultiMatch" key is used when a request is capable of handling
    multiple tables in the document. Ordinarily, a request will process a
    single table only (even "DEFAULT" requests). Set it to a non-zero value
    to allow the request to handle more than one table.

BUGS
    Please report any bugs or feature requests on the bugtracker website
    <https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-TableParser> or
    by email to bug-HTML-TableParser AT rt.org
    <mailto:bug-HTML-TableParser AT rt.org>.

    When submitting a bug or request, please include a test-file or a patch
    to an existing test-file that illustrates the bug or desired feature.

SOURCE
    The development version is on github at
    <https://github.com/djerius/html-tableparser> and may be cloned from
    <git://github.com/djerius/html-tableparser.git>

AUTHOR
    Diab Jerius <djerius AT cpan.org>

COPYRIGHT AND LICENSE
    This software is Copyright (c) 2018 by Smithsonian Astrophysical
    Observatory.

    This is free software, licensed under:

      The GNU General Public License, Version 3, June 2007