{
    "content": [
        {
            "type": "text",
            "text": "# Web::Scraper (perldoc)\n\n## NAME\n\nWeb::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions\n\n## SYNOPSIS\n\nuse URI;\nuse Web::Scraper;\nuse Encode;\n# First, create your scraper block\nmy $authors = scraper {\n# Parse all TDs inside 'table[width=\"100%]\"', store them into\n# an array 'authors'.  We embed other scrapers for each TD.\nprocess 'table[width=\"100%\"] td', \"authors[]\" => scraper {\n# And, in each TD,\n# get the URI of \"a\" element\nprocess \"a\", uri => '@href';\n# get text inside \"small\" element\nprocess \"small\", fullname => 'TEXT';\n};\n};\nmy $res = $authors->scrape( URI->new(\"http://search.cpan.org/author/?A\") );\n# iterate the array 'authors'\nfor my $author (@{$res->{authors}}) {\n# output is like:\n# Andy Adler      http://search.cpan.org/~aadler/\n# Aaron K Dancygier       http://search.cpan.org/~aakd/\n# Aamer Akhter    http://search.cpan.org/~aakhter/\nprint Encode::encode(\"utf8\", \"$author->{fullname}\\t$author->{uri}\\n\");\n}\nThe structure would resemble this (visually) { authors => [ { fullname => $fullname, link =>\n$uri }, { fullname => $fullname, link => $uri }, ] }\n\n## DESCRIPTION\n\nWeb::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a\nDSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data\nstructure.\n\n## Sections\n\n- **NAME**\n- **SYNOPSIS**\n- **DESCRIPTION**\n- **METHODS**\n- **EXAMPLES**\n- **NESTED SCRAPERS**\n- **FILTERS**\n- **AUTHOR**\n- **LICENSE**\n- **SEE ALSO**\n\nUse structuredContent.sections for detailed options, examples, and full documentation.\n"
        }
    ],
    "structuredContent": {
        "command": "Web::Scraper",
        "section": "",
        "mode": "perldoc",
        "summary": "Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions",
        "synopsis": "use URI;\nuse Web::Scraper;\nuse Encode;\n# First, create your scraper block\nmy $authors = scraper {\n# Parse all TDs inside 'table[width=\"100%]\"', store them into\n# an array 'authors'.  We embed other scrapers for each TD.\nprocess 'table[width=\"100%\"] td', \"authors[]\" => scraper {\n# And, in each TD,\n# get the URI of \"a\" element\nprocess \"a\", uri => '@href';\n# get text inside \"small\" element\nprocess \"small\", fullname => 'TEXT';\n};\n};\nmy $res = $authors->scrape( URI->new(\"http://search.cpan.org/author/?A\") );\n# iterate the array 'authors'\nfor my $author (@{$res->{authors}}) {\n# output is like:\n# Andy Adler      http://search.cpan.org/~aadler/\n# Aaron K Dancygier       http://search.cpan.org/~aakd/\n# Aamer Akhter    http://search.cpan.org/~aakhter/\nprint Encode::encode(\"utf8\", \"$author->{fullname}\\t$author->{uri}\\n\");\n}\nThe structure would resemble this (visually) { authors => [ { fullname => $fullname, link =>\n$uri }, { fullname => $fullname, link => $uri }, ] }",
        "tldr_summary": null,
        "tldr_examples": [],
        "tldr_source": null,
        "flags": [],
        "examples": [
            "There are many examples in the \"eg/\" dir packaged in this distribution. It is recommended to",
            "look through these."
        ],
        "see_also": [],
        "section_outline": [
            {
                "name": "NAME",
                "lines": 2,
                "subsections": []
            },
            {
                "name": "SYNOPSIS",
                "lines": 31,
                "subsections": []
            },
            {
                "name": "DESCRIPTION",
                "lines": 7,
                "subsections": []
            },
            {
                "name": "METHODS",
                "lines": 83,
                "subsections": []
            },
            {
                "name": "EXAMPLES",
                "lines": 3,
                "subsections": []
            },
            {
                "name": "NESTED SCRAPERS",
                "lines": 12,
                "subsections": []
            },
            {
                "name": "FILTERS",
                "lines": 22,
                "subsections": []
            },
            {
                "name": "AUTHOR",
                "lines": 2,
                "subsections": []
            },
            {
                "name": "LICENSE",
                "lines": 3,
                "subsections": []
            },
            {
                "name": "SEE ALSO",
                "lines": 4,
                "subsections": []
            }
        ],
        "sections": {
            "NAME": {
                "content": "Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions\n",
                "subsections": []
            },
            "SYNOPSIS": {
                "content": "use URI;\nuse Web::Scraper;\nuse Encode;\n\n# First, create your scraper block\nmy $authors = scraper {\n# Parse all TDs inside 'table[width=\"100%]\"', store them into\n# an array 'authors'.  We embed other scrapers for each TD.\nprocess 'table[width=\"100%\"] td', \"authors[]\" => scraper {\n# And, in each TD,\n# get the URI of \"a\" element\nprocess \"a\", uri => '@href';\n# get text inside \"small\" element\nprocess \"small\", fullname => 'TEXT';\n};\n};\n\nmy $res = $authors->scrape( URI->new(\"http://search.cpan.org/author/?A\") );\n\n# iterate the array 'authors'\nfor my $author (@{$res->{authors}}) {\n# output is like:\n# Andy Adler      http://search.cpan.org/~aadler/\n# Aaron K Dancygier       http://search.cpan.org/~aakd/\n# Aamer Akhter    http://search.cpan.org/~aakhter/\nprint Encode::encode(\"utf8\", \"$author->{fullname}\\t$author->{uri}\\n\");\n}\n\nThe structure would resemble this (visually) { authors => [ { fullname => $fullname, link =>\n$uri }, { fullname => $fullname, link => $uri }, ] }\n",
                "subsections": []
            },
            "DESCRIPTION": {
                "content": "Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a\nDSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data\nstructure.\n\nThe *scraper* and *process* blocks provide a method to define what segments of a document to\nextract. It understands HTML and CSS Selectors as well as XPath expressions.\n",
                "subsections": []
            },
            "METHODS": {
                "content": "scraper\n$scraper = scraper { ... };\n\nCreates a new Web::Scraper object by wrapping the DSL code that will be fired when *scrape*\nmethod is called.\n\nscrape\n$res = $scraper->scrape(URI->new($uri));\n$res = $scraper->scrape($htmlcontent);\n$res = $scraper->scrape(\\$htmlcontent);\n$res = $scraper->scrape($httpresponse);\n$res = $scraper->scrape($htmlelement);\n\nRetrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings and creates a DOM\nobject, then fires the callback scraper code to retrieve the data structure.\n\nIf you pass URI or HTTP::Response object, Web::Scraper will automatically guesses the encoding\nof the content by looking at Content-Type headers and META tags. Otherwise you need to decode\nthe HTML to Unicode before passing it to *scrape* method.\n\nYou can optionally pass the base URL when you pass the HTML content as a string instead of URI\nor HTTP::Response.\n\n$res = $scraper->scrape($htmlcontent, \"http://example.com/foo\");\n\nThis way Web::Scraper can resolve the relative links found in the document.\n\nprocess\nscraper {\nprocess \"tag.class\", key => 'TEXT';\nprocess '//tag[contains(@foo, \"bar\")]', key2 => '@attr';\nprocess '//comment()', 'comments[]' => 'TEXT';\n};\n\n*process* is the method to find matching elements from HTML with CSS selector or XPath\nexpression, then extract text or attributes into the result stash.\n\nIf the first argument begins with \"//\" or \"id(\" it's treated as an XPath expression and\notherwise CSS selector.\n\n# <span class=\"date\">2008/12/21</span>\n# date => \"2008/12/21\"\nprocess \".date\", date => 'TEXT';\n\n# <div class=\"body\"><a href=\"http://example.com/\">foo</a></div>\n# link => URI->new(\"http://example.com/\")\nprocess \".body > a\", link => '@href';\n\n# <div class=\"body\"><!-- HTML Comment here --><a href=\"http://example.com/\">foo</a></div>\n# comment => \" HTML Comment here \"\n#\n# NOTES: A comment nodes are accessed when installed\n# the HTML::TreeBuilder::XPath (version >= 0.14) and/or\n# the HTML::TreeBuilder::LibXML (version >= 0.13)\nprocess \"//div[contains(@class, 'body')]/comment()\", comment => 'TEXT';\n\n# <div class=\"body\"><a href=\"http://example.com/\">foo</a></div>\n# link => URI->new(\"http://example.com/\"), text => \"foo\"\nprocess \".body > a\", link => '@href', text => 'TEXT';\n\n# <ul><li>foo</li><li>bar</li></ul>\n# list => [ \"foo\", \"bar\" ]\nprocess \"li\", \"list[]\" => \"TEXT\";\n\n# <ul><li id=\"1\">foo</li><li id=\"2\">bar</li></ul>\n# list => [ { id => \"1\", text => \"foo\" }, { id => \"2\", text => \"bar\" } ];\nprocess \"li\", \"list[]\" => { id => '@id', text => \"TEXT\" };\n\nprocessfirst\n\"processfirst\" is the same as \"process\" but stops when the first matching result is found.\n\n# <span class=\"date\">2008/12/21</span>\n# <span class=\"date\">2008/12/22</span>\n# date => \"2008/12/21\"\nprocessfirst \".date\", date => 'TEXT';\n\nresult\n\"result\" allows one to return not the default value after processing but a single value\nspecified by a key or a hash reference built from several keys.\n\nprocess 'a', 'want[]' => 'TEXT';\nresult 'want';\n",
                "subsections": []
            },
            "EXAMPLES": {
                "content": "There are many examples in the \"eg/\" dir packaged in this distribution. It is recommended to\nlook through these.\n",
                "subsections": []
            },
            "NESTED SCRAPERS": {
                "content": "Scrapers can be nested thus allowing to scrape already captured data.\n\n# <ul>\n# <li class=\"foo\"><a href=\"foo1\">bar1</a></li>\n# <li class=\"bar\"><a href=\"foo2\">bar2</a></li>\n# <li class=\"foo\"><a href=\"foo3\">bar3</a></li>\n# </ul>\n# friends => [ {href => 'foo1'}, {href => 'foo2'} ];\nprocess 'li', 'friends[]' => scraper {\nprocess 'a', href => '@href',\n};\n",
                "subsections": []
            },
            "FILTERS": {
                "content": "Filters are applied to the result after processing. They can be declared as anonymous\nsubroutines or as class names.\n\nprocess $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];\nprocess $exp, $key => [ 'TEXT', 'Something' ];\nprocess $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];\n\nFilters can be stacked\n\nprocess $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \\&baz ];\n\nMore about filters you can find in Web::Scraper::Filter documentation.\n\nXML backends\nBy default HTML::TreeBuilder::XPath is used, this can be replaces by a XML::LibXML backend using\nWeb::Scraper::LibXML module.\n\nuse Web::Scraper::LibXML;\n\n# same as Web::Scraper\nmy $scraper = scraper { ... };\n",
                "subsections": []
            },
            "AUTHOR": {
                "content": "Tatsuhiko Miyagawa <miyagawa@bulknews.net>\n",
                "subsections": []
            },
            "LICENSE": {
                "content": "This library is free software; you can redistribute it and/or modify it under the same terms as\nPerl itself.\n",
                "subsections": []
            },
            "SEE ALSO": {
                "content": "<http://blog.labnotes.org/category/scrapi/>\n\nHTML::TreeBuilder::XPath\n",
                "subsections": []
            }
        }
    }
}