# phpman > perldoc > HTML::TableExtract

## NAME
    [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) - Perl module for extracting the content contained in tables within an HTML
    document, either as text or encoded element trees.

## SYNOPSIS
     # Matched tables are returned as table objects; tables can be matched
     # using column headers, depth, count within a depth, table tag
     # attributes, or some combination of the four.

     # Example: Using column header information.
     # Assume an HTML document with tables that have "Date", "Price", and
     # "Cost" somewhere in a row. The columns beneath those headings are
     # what you want to extract. They will be returned in the same order as
     # you specified the headers since 'automap' is enabled by default.

     use [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown);
     my $te = [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown)->new( headers => [qw(Date Price Cost)] );
     $te->parse($html_string);

     # Examine all matching tables
     foreach my $ts ($te->tables) {
       print "Table (", join(',', $ts->coords), "):\n";
       foreach my $row ($ts->rows) {
          print join(',', @$row), "\n";
       }
     }

     # Shorthand...top level rows() method assumes the first table found in
     # the document if no arguments are supplied.
     foreach my $row ($te->rows) {
        print join(',', @$row), "\n";
     }

     # Example: Using depth and count information.
     # Every table in the document has a unique depth and count tuple, so
     # when both are specified it is a unique table. Depth and count both
     # begin with 0, so in this case we are looking for a table (depth 2)
     # within a table (depth 1) within a table (depth 0, which is the top
     # level HTML document). In addition, it must be the third (count 2)
     # such instance of a table at that depth.

     my $te = [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown)->new( depth => 2, count => 2 );
     $te->parse_file($html_file);
     foreach my $ts ($te->tables) {
        print "Table found at ", join(',', $ts->coords), ":\n";
        foreach my $row ($ts->rows) {
           print "   ", join(',', @$row), "\n";
        }
     }

     # Example: Using table tag attributes.
     # If multiple attributes are specified, all must be present and equal
     # for match to occur.

     my $te = [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown)->new( attribs => { border => 1 } );
     $te->parse($html_string);
     foreach my $ts ($te->tables) {
       print "Table with border=1 found at ", join(',', $ts->coords), ":\n";
       foreach my $row ($ts->rows) {
          print "   ", join(',', @$row), "\n";
       }
     }

     # Example: Extracting as an [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree structure
     # Rather than extracting raw text, the html can be converted into a
     # tree of element objects. The HTML document is composed of
     # [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) objects and the tables are [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown)
     # structures. Using this, the contents of tables within a document can
     # be edited in-place.

     use [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) qw(tree);
     my $te = [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown)->new( headers => qw(Fee Fie Foe Fum) );
     $te->parse_file($html_file);
     my $table = $te->first_table_found;
     my $table_tree = $table->tree;
     $table_tree->cell(4,4)->replace_content('Golden Goose');
     my $table_html = $table_tree->as_HTML;
     my $table_text = $table_tree->as_text;
     my $document_tree = $te->tree;
     my $document_html = $document_tree->as_HTML;

## DESCRIPTION
    [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) is a subclass of [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown) that serves to extract the information from
    tables of interest contained within an HTML document. The information from each extracted table
    is stored in table objects. Tables can be extracted as text, HTML, or [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown)
    structures (for in-place editing or manipulation).

    There are currently four constraints available to specify which tables you would like to extract
    from a document: *Headers*, *Depth*, *Count*, and *Attributes*.

    *Headers*, the most flexible and adaptive of the techniques, involves specifying text in an
    array that you expect to appear above the data in the tables of interest. Once all headers have
    been located in a row of that table, all further cells beneath the columns that matched your
    headers are extracted. All other columns are ignored: think of it as vertical slices through a
    table. In addition, TableExtract automatically rearranges each row in the same order as the
    headers you provided. If you would like to disable this, set *automap* to 0 during object
    creation, and instead rely on the column_map() method to find out the order in which the headers
    were found. Furthermore, TableExtract will automatically compensate for cell span issues so that
    columns are really the same columns as you would visually see in a browser. This behavior can be
    disabled by setting the *gridmap* parameter to 0. HTML is stripped from the entire textual
    content of a cell before header matches are attempted -- unless the *keep_html* parameter was
    enabled.

    *Depth* and *Count* are more specific ways to specify tables in relation to one another. *Depth*
    represents how deeply a table resides in other tables. The depth of a top-level table in the
    document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be
    thought of as a layer; tables sharing the same depth are on the same layer. Within each of these
    layers, *Count* represents the order in which a table was seen at that depth, starting with 0.
    Providing both a *depth* and a *count* will uniquely specify a table within a document.

    *Attributes* match based on the attributes of the html <table> tag, for example, border widths
    or background color.

    Each of the *Headers*, *Depth*, *Count*, and *Attributes* specifications are cumulative in their
    effect on the overall extraction. For instance, if you specify only a *Depth*, then you get all
    tables at that depth (note that these could very well reside in separate higher- level tables
    throughout the document since depth extends across tables). If you specify only a *Count*, then
    the tables at that *Count* from all depths are returned (i.e., the *n*th occurrence of a table
    at each depth). If you only specify *Headers*, then you get all tables in the document
    containing those column headers. If you have specified multiple constraints of *Headers*,
    *Depth*, *Count*, and *Attributes*, then each constraint has veto power over whether a
    particular table is extracted.

    If no *Headers*, *Depth*, *Count*, or *Attributes* are specified, then all tables match.

    When extracting only text from tables, the text is decoded with [HTML::Entities](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AEntities/markdown) by default; this
    can be disabled by setting the *decode* parameter to 0.

### Extraction Modes
    The default mode of extraction for [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) is raw text or HTML. In this mode,
    embedded tables are completely decoupled from one another. In this case, [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) is a
    subclass of [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown):

      use [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown);

    Alternatively, tables can be extracted as [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) structures, which are in turn
    embedded in an [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree representing the entire HTML document. Embedded tables are not
    decoupled from one another since this tree structure must be maintained. In this case,
    [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) is a subclass of [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown) (itself a subclass of HTML:::Parser):

      use [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) qw(tree);

    In either case, the basic interface for [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) and the resulting table objects
    remains the same -- all that changes is what you can do with the resulting data.

    [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) is a subclass of [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown), and as such inherits all of its basic methods
    such as "parse()" and "parse_file()". During scans, "start()", "end()", and "text()" are
    utilized. Feel free to override them, but if you do not eventually invoke them in the SUPER
    class with some content, results are not guaranteed.

### Advice
    The main point of this module was to provide a flexible method of extracting tabular information
    from HTML documents without relying to heavily on the document layout. For that reason, I
    suggest using *Headers* whenever possible -- that way, you are anchoring your extraction on what
    the document is trying to communicate rather than some feature of the HTML comprising the
    document (other than the fact that the data is contained in a table).

## METHODS
    The following are the top-level methods of the [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) object. Tables that have
    matched a query are actually returned as separate objects of type [HTML::TableExtract::Table](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract%3A%3ATable/markdown).
    These table objects have their own methods, documented further below.

  CONSTRUCTOR
### new
        Return a new [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) object. Valid attributes are:

        headers
            Passed as an array reference, headers specify strings of interest at the top of columns
            within targeted tables. They can be either strings or regular expressions (qr//). If
            they are strings, they will eventually be passed through a non-anchored,
            case-insensitive regular expression, so regexp special characters are allowed.

            The table row containing the headers is not returned, unless "keep_headers" was
            specified or you are extracting into an element tree. In either case the header row can
            be accessed via the hrow() method from within the table object.

            Columns that are not beneath one of the provided headers will be ignored unless
            "slice_columns" was set to 0. Columns will, by default, be rearranged into the same
            order as the headers you provide (see the *automap* parameter for more information)
            *unless* "slice_columns" is 0.

            Additionally, by default columns are considered what you would see visually beneath that
            header when the table is rendered in a browser. See the "gridmap" parameter for more
            information.

            HTML within a header is stripped before the match is attempted, unless the "keep_html"
            parameter was specified and "strip_html_on_match" is false.

        depth
            Specify how embedded in other tables your tables of interest should be. Top-level tables
            in the HTML document have a depth of 0, tables within top-level tables have a depth of
            1, and so on.

        count
            Specify which table within each depth you are interested in, beginning with 0.

        attribs
            Passed as a hash reference, attribs specify attributes of interest within the HTML
            <table> tag itself.

        automap
            Automatically applies the ordering reported by column_map() to the rows returned by
            rows(). This only makes a difference if you have specified *Headers* and they turn out
            to be in a different order in the table than what you specified. Automap will rearrange
            the columns in the same order as the headers appear. To get the original ordering, you
            will need to take another slice of each row using column_map(). *automap* is enabled by
            default.

        slice_columns
            Enabled by default, this option controls whether vertical slices are returned from under
            headers that match. When disabled, all columns of the matching table are retained,
            regardles of whether they had a matching header above them. Disabling this also disables
            "automap".

        keep_headers
            Disabled by default, and only applicable when header constraints have been specified,
            "keep_headers" will retain the matching header row as the first row of table data when
            enabled. This option has no effect if extracting into an element tree structure. In any
            case, the header row is accessible from the table method "hrow()".

        gridmap
            Controls whether the table contents are returned as a grid or a tree. ROWSPAN and
            COLSPAN issues are compensated for, and columns really are columns. Empty phantom cells
            are created where they would have been obscured by ROWSPAN or COLSPAN settings. This
            really becomes an issue when extracting columns beneath headers. Enabled by default.

        subtables
            Extract all tables embedded within matched tables.

        decode
            Automatically decode retrieved text with [HTML::Entities::decode_entities](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AEntities%3A%3Adecodeentities/markdown)(). Enabled by
            default. Has no effect if "keep_html" was specified or if extracting into an element
            tree structure.

        br_translate
            Translate <br> tags into newlines. Sometimes the remaining text can be hard to parse if
            the <br> tag is simply dropped. Enabled by default. Has no effect if *keep_html* is
            enabled or if extracting into an element tree structure.

        keep_html
            Return the raw HTML contained in the cell, rather than just the visible text. Embedded
            tables are not retained in the HTML extracted from a cell. Patterns for header matches
            must take into account HTML in the string if this option is enabled. This option has no
            effect if extracting into an elment tree structure.

        strip_html_on_match
            When "keep_html" is enabled, HTML is stripped by default during attempts at matching
            header strings (so if "strip_html_on_match" is not enabled and "keep_html" is, you would
            have to include potential HTML tags in the regexp for header matches). Stripped header
            tags are replaced with an empty string, e.g. 'hot d<em>og</em>' would become 'hot dog'
            before attempting a match.

        error_handle
            Filehandle where error messages are printed. STDERR by default.

        debug
            Prints some debugging information to STDERR, more for higher values. If "error_handle"
            was provided, messages are printed there rather than STDERR.

  REGULAR METHODS
    The following methods are invoked directly from an [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) object.

### depths
        Returns all depths that contained matched tables in the document.

### counts
        For a particular depth, returns all counts that contained matched tables.

### table
        For a particular depth and count, return the table object for the table found, if any.

### tables
        Return table objects for all tables that matched. Returns an empty list if no tables
        matched.

### first_table_found
        Return the table state object for the first table matched in the document. Returns undef if
        no tables were matched.

### current_table
        Returns the current table object while parsing the HTML. Only useful if you're messing
        around with overriding [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown) methods.

### tree
        If the module was invoked in tree extraction mode, returns a reference to the top node of
        the [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree structure for the entire document (which includes, ultimately, all
        tables within the document).

### tables_report
        Return a string summarizing extracted tables, along with their depth and count. Optionally
        takes a $show_content flag which will dump the extracted contents of each table as well with
        columns separated by $col_sep. Default $col_sep is ':'.

### tables_dump
        Same as "tables_report()" except dump the information to STDOUT.

    start
    end
    text
        These are the hooks into [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown). If you want to subclass this module and have things
        work, you must at some point call these with content.

  DEPRECATED METHODS
    Tables used to be called 'table states'. Accordingly, the following methods still work but have
    been deprecated:

### table_state
        Is now table()

### table_states
        Is now tables()

### first_table_state_found
        Is now first_table_found()

  TABLE METHODS
    The following methods are invoked from an [HTML::TableExtract::Table](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract%3A%3ATable/markdown) object, such as those
    returned from the "tables()" method.

### rows
        Return all rows within a matched table. Each row returned is a reference to an array
        containing the text, HTML, or reference to the [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) object of each cell depending
        the mode of extraction. Tables with rowspan or colspan attributes will have some cells
        containing undef. Returns a list or a reference to an array depending on context.

### columns
        Return all columns within a matched table. Each column returned is a reference to an array
        containing the text, HTML, or reference to [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) object of each cell depending on
        the mode of extraction. Tables with rowspan or colspan attributes will have some cells
        containing undef.

### row
        Return a particular row from within a matched table either as a list or an array reference,
        depending on context.

### column
        Return a particular column from within a matched table as a list or an array reference,
        depending on context.

### cell
        Return a particular item from within a matched table, whether it be the text, HTML, or
        reference to the [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) object of that cell, depending on the mode of extraction. If
        the cell was covered due to rowspan or colspan effects, will return undef.

### space
        The same as cell(), except in cases where the given coordinates were covered due to rowspan
        or colspan issues, in which case the content of the covering cell is returned rather than
        undef.

### depth
        Return the depth at which this table was found.

### count
        Return the count for this table within the depth it was found.

### coords
        Return depth and count in a list.

### tree
        If the module was invoked in tree extraction mode, this accessor provides a reference to the
        [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) structure encompassing the table.

### hrow
        Returns the header row as a list when headers were specified as a constraint. If
        "keep_headers" was specified initially, this is equivalent to the first row returned by the
        "rows()" method.

### column_map
        Return the order (via indices) in which the provided headers were found. These indices can
        be used as slices on rows to either order the rows in the same order as headers or restore
        the rows to their natural order, depending on whether the rows have been pre-adjusted using
        the *automap* parameter.

### lineage
        Returns the path of matched tables that led to matching this table. The path is a list of
        array refs containing depth, count, row, and column values for each ancestor table involved.
        Note that corresponding table objects will not exist for ancestral tables that did not match
        specified constraints.

## NOTES ON TREE EXTRACTION MODE
    As mentioned above, [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) can be invoked in 'tree' mode where the resulting HTML
    and extracted tables are encoded in [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree structures:

      use [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown) 'tree';

    There are a number of things to take note of while using this mode. The entire HTML document is
    encoded into an [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree. Each table is part of this structure, but nevertheless is
    tracked separately via an [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) structure, which is a specialized form of
    [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree.

    The [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) objects are accessible by invoking the tree() method from within each
    table object returned by [HTML::TableExtract](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract/markdown). The [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) objects have their own
### row
### column

    For example, the row() method from [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) will provide a reference to a 'glob' of
    all the elements in that row. Actions (such as setting attributes) performed on that row
    reference will affect all elements within that row. On the other hand, the row() method from the
    [HTML::TableExtract::Table](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract%3A%3ATable/markdown) object will return an array (either by reference or list, depending on
    context) of the contents of each cell within the row. In tree mode, the content is represented
    by individual references to each cell -- these are references to the same [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) objects
    that reside in the [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown) tree.

    The cell() methods provided in both cases will therefore return references to the same object.
    The exception to this is when a 'cell' in the table grid was originally 'covered' due to rowspan
    or colspan issues -- in this case the cell content will be undef. Likewise, the row() or
### column
    containing a mixture of object references and undefs. If you're going to be doing lots of
    manipulation of the table elements, it might be more efficient to access them via the methods
    provided by the [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) object instead. See [HTML::ElementTable](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElementTable/markdown) for more information
    on how to manipulate those objects.

    An alternative to the cell() method in [HTML::TableExtract::Table](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATableExtract%3A%3ATable/markdown) is the space() method. It is
    largely similar to cell(), except when given coordinates of a cell that was covered due to
    rowspan or colspan effects, it will return the contents of the cell that was covering that space
    rather than undef. So if, for example, cell (0,0) had a rowspan of 2 and colspan of 2, cell(1,1)
    would return undef and space(1,1) would return the same content as cell(0,0) or space(0,0).

## REQUIRES
    HTML::[Parser(3)](https://www.chedong.com/phpMan.php/man/Parser/3/markdown), HTML::[Entities(3)](https://www.chedong.com/phpMan.php/man/Entities/3/markdown)

## OPTIONALLY REQUIRES
    HTML::[TreeBuilder(3)](https://www.chedong.com/phpMan.php/man/TreeBuilder/3/markdown), HTML::[ElementTable(3)](https://www.chedong.com/phpMan.php/man/ElementTable/3/markdown)

## AUTHOR
    Matthew P. Sisk, <<sisk@mojotoad.com>>

## COPYRIGHT
    Copyright (c) 2000-2017 Matthew P. Sisk. All rights reserved. All wrongs revenged. This program
    is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

## SEE ALSO
    HTML::[Parser(3)](https://www.chedong.com/phpMan.php/man/Parser/3/markdown), HTML::[TreeBuilder(3)](https://www.chedong.com/phpMan.php/man/TreeBuilder/3/markdown), HTML::[ElementTable(3)](https://www.chedong.com/phpMan.php/man/ElementTable/3/markdown), [perl(1)](https://www.chedong.com/phpMan.php/man/perl/1/markdown).

