{
    "mode": "perldoc",
    "parameter": "Web::Scraper",
    "section": "",
    "url": "https://www.chedong.com/phpMan.php/perldoc/Web%3A%3AScraper/json",
    "generated": "2026-06-12T13:54:40Z",
    "synopsis": "use URI;\nuse Web::Scraper;\nuse Encode;\n# First, create your scraper block\nmy $authors = scraper {\n# Parse all TDs inside 'table[width=\"100%]\"', store them into\n# an array 'authors'.  We embed other scrapers for each TD.\nprocess 'table[width=\"100%\"] td', \"authors[]\" => scraper {\n# And, in each TD,\n# get the URI of \"a\" element\nprocess \"a\", uri => '@href';\n# get text inside \"small\" element\nprocess \"small\", fullname => 'TEXT';\n};\n};\nmy $res = $authors->scrape( URI->new(\"http://search.cpan.org/author/?A\") );\n# iterate the array 'authors'\nfor my $author (@{$res->{authors}}) {\n# output is like:\n# Andy Adler      http://search.cpan.org/~aadler/\n# Aaron K Dancygier       http://search.cpan.org/~aakd/\n# Aamer Akhter    http://search.cpan.org/~aakhter/\nprint Encode::encode(\"utf8\", \"$author->{fullname}\\t$author->{uri}\\n\");\n}\nThe structure would resemble this (visually) { authors => [ { fullname => $fullname, link =>\n$uri }, { fullname => $fullname, link => $uri }, ] }",
    "sections": {
        "NAME": {
            "content": "Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions\n",
            "subsections": []
        },
        "SYNOPSIS": {
            "content": "use URI;\nuse Web::Scraper;\nuse Encode;\n\n# First, create your scraper block\nmy $authors = scraper {\n# Parse all TDs inside 'table[width=\"100%]\"', store them into\n# an array 'authors'.  We embed other scrapers for each TD.\nprocess 'table[width=\"100%\"] td', \"authors[]\" => scraper {\n# And, in each TD,\n# get the URI of \"a\" element\nprocess \"a\", uri => '@href';\n# get text inside \"small\" element\nprocess \"small\", fullname => 'TEXT';\n};\n};\n\nmy $res = $authors->scrape( URI->new(\"http://search.cpan.org/author/?A\") );\n\n# iterate the array 'authors'\nfor my $author (@{$res->{authors}}) {\n# output is like:\n# Andy Adler      http://search.cpan.org/~aadler/\n# Aaron K Dancygier       http://search.cpan.org/~aakd/\n# Aamer Akhter    http://search.cpan.org/~aakhter/\nprint Encode::encode(\"utf8\", \"$author->{fullname}\\t$author->{uri}\\n\");\n}\n\nThe structure would resemble this (visually) { authors => [ { fullname => $fullname, link =>\n$uri }, { fullname => $fullname, link => $uri }, ] }\n",
            "subsections": []
        },
        "DESCRIPTION": {
            "content": "Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a\nDSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data\nstructure.\n\nThe *scraper* and *process* blocks provide a method to define what segments of a document to\nextract. It understands HTML and CSS Selectors as well as XPath expressions.\n",
            "subsections": []
        },
        "METHODS": {
            "content": "scraper\n$scraper = scraper { ... };\n\nCreates a new Web::Scraper object by wrapping the DSL code that will be fired when *scrape*\nmethod is called.\n\nscrape\n$res = $scraper->scrape(URI->new($uri));\n$res = $scraper->scrape($htmlcontent);\n$res = $scraper->scrape(\\$htmlcontent);\n$res = $scraper->scrape($httpresponse);\n$res = $scraper->scrape($htmlelement);\n\nRetrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings and creates a DOM\nobject, then fires the callback scraper code to retrieve the data structure.\n\nIf you pass URI or HTTP::Response object, Web::Scraper will automatically guesses the encoding\nof the content by looking at Content-Type headers and META tags. Otherwise you need to decode\nthe HTML to Unicode before passing it to *scrape* method.\n\nYou can optionally pass the base URL when you pass the HTML content as a string instead of URI\nor HTTP::Response.\n\n$res = $scraper->scrape($htmlcontent, \"http://example.com/foo\");\n\nThis way Web::Scraper can resolve the relative links found in the document.\n\nprocess\nscraper {\nprocess \"tag.class\", key => 'TEXT';\nprocess '//tag[contains(@foo, \"bar\")]', key2 => '@attr';\nprocess '//comment()', 'comments[]' => 'TEXT';\n};\n\n*process* is the method to find matching elements from HTML with CSS selector or XPath\nexpression, then extract text or attributes into the result stash.\n\nIf the first argument begins with \"//\" or \"id(\" it's treated as an XPath expression and\notherwise CSS selector.\n\n# <span class=\"date\">2008/12/21</span>\n# date => \"2008/12/21\"\nprocess \".date\", date => 'TEXT';\n\n# <div class=\"body\"><a href=\"http://example.com/\">foo</a></div>\n# link => URI->new(\"http://example.com/\")\nprocess \".body > a\", link => '@href';\n\n# <div class=\"body\"><!-- HTML Comment here --><a href=\"http://example.com/\">foo</a></div>\n# comment => \" HTML Comment here \"\n#\n# NOTES: A comment nodes are accessed when installed\n# the HTML::TreeBuilder::XPath (version >= 0.14) and/or\n# the HTML::TreeBuilder::LibXML (version >= 0.13)\nprocess \"//div[contains(@class, 'body')]/comment()\", comment => 'TEXT';\n\n# <div class=\"body\"><a href=\"http://example.com/\">foo</a></div>\n# link => URI->new(\"http://example.com/\"), text => \"foo\"\nprocess \".body > a\", link => '@href', text => 'TEXT';\n\n# <ul><li>foo</li><li>bar</li></ul>\n# list => [ \"foo\", \"bar\" ]\nprocess \"li\", \"list[]\" => \"TEXT\";\n\n# <ul><li id=\"1\">foo</li><li id=\"2\">bar</li></ul>\n# list => [ { id => \"1\", text => \"foo\" }, { id => \"2\", text => \"bar\" } ];\nprocess \"li\", \"list[]\" => { id => '@id', text => \"TEXT\" };\n\nprocessfirst\n\"processfirst\" is the same as \"process\" but stops when the first matching result is found.\n\n# <span class=\"date\">2008/12/21</span>\n# <span class=\"date\">2008/12/22</span>\n# date => \"2008/12/21\"\nprocessfirst \".date\", date => 'TEXT';\n\nresult\n\"result\" allows one to return not the default value after processing but a single value\nspecified by a key or a hash reference built from several keys.\n\nprocess 'a', 'want[]' => 'TEXT';\nresult 'want';\n",
            "subsections": []
        },
        "EXAMPLES": {
            "content": "There are many examples in the \"eg/\" dir packaged in this distribution. It is recommended to\nlook through these.\n",
            "subsections": []
        },
        "NESTED SCRAPERS": {
            "content": "Scrapers can be nested thus allowing to scrape already captured data.\n\n# <ul>\n# <li class=\"foo\"><a href=\"foo1\">bar1</a></li>\n# <li class=\"bar\"><a href=\"foo2\">bar2</a></li>\n# <li class=\"foo\"><a href=\"foo3\">bar3</a></li>\n# </ul>\n# friends => [ {href => 'foo1'}, {href => 'foo2'} ];\nprocess 'li', 'friends[]' => scraper {\nprocess 'a', href => '@href',\n};\n",
            "subsections": []
        },
        "FILTERS": {
            "content": "Filters are applied to the result after processing. They can be declared as anonymous\nsubroutines or as class names.\n\nprocess $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ];\nprocess $exp, $key => [ 'TEXT', 'Something' ];\nprocess $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ];\n\nFilters can be stacked\n\nprocess $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \\&baz ];\n\nMore about filters you can find in Web::Scraper::Filter documentation.\n\nXML backends\nBy default HTML::TreeBuilder::XPath is used, this can be replaces by a XML::LibXML backend using\nWeb::Scraper::LibXML module.\n\nuse Web::Scraper::LibXML;\n\n# same as Web::Scraper\nmy $scraper = scraper { ... };\n",
            "subsections": []
        },
        "AUTHOR": {
            "content": "Tatsuhiko Miyagawa <miyagawa@bulknews.net>\n",
            "subsections": []
        },
        "LICENSE": {
            "content": "This library is free software; you can redistribute it and/or modify it under the same terms as\nPerl itself.\n",
            "subsections": []
        },
        "SEE ALSO": {
            "content": "<http://blog.labnotes.org/category/scrapi/>\n\nHTML::TreeBuilder::XPath\n",
            "subsections": []
        }
    },
    "summary": "Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions",
    "flags": [],
    "examples": [
        "There are many examples in the \"eg/\" dir packaged in this distribution. It is recommended to",
        "look through these."
    ],
    "see_also": []
}