{
    "mode": "perldoc",
    "parameter": "HTML::TreeBuilder",
    "section": "",
    "url": "https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/json",
    "generated": "2026-06-10T03:52:01Z",
    "synopsis": "use HTML::TreeBuilder 5 -weak; # Ensure weak references in use\nforeach my $filename (@ARGV) {\nmy $tree = HTML::TreeBuilder->new; # empty tree\n$tree->parsefile($filename);\nprint \"Hey, here's a dump of the parse tree of $filename:\\n\";\n$tree->dump; # a method we inherit from HTML::Element\nprint \"And here it is, bizarrely rerendered as HTML:\\n\",\n$tree->asHTML, \"\\n\";\n# Now that we're done with it, we must destroy it.\n# $tree = $tree->delete; # Not required with weak references\n}",
    "sections": {
        "NAME": {
            "content": "HTML::TreeBuilder - Parser that builds a HTML syntax tree\n",
            "subsections": []
        },
        "VERSION": {
            "content": "This document describes version 5.07 of HTML::TreeBuilder, released August 31, 2017 as part of\nHTML-Tree.\n",
            "subsections": []
        },
        "SYNOPSIS": {
            "content": "use HTML::TreeBuilder 5 -weak; # Ensure weak references in use\n\nforeach my $filename (@ARGV) {\nmy $tree = HTML::TreeBuilder->new; # empty tree\n$tree->parsefile($filename);\nprint \"Hey, here's a dump of the parse tree of $filename:\\n\";\n$tree->dump; # a method we inherit from HTML::Element\nprint \"And here it is, bizarrely rerendered as HTML:\\n\",\n$tree->asHTML, \"\\n\";\n\n# Now that we're done with it, we must destroy it.\n# $tree = $tree->delete; # Not required with weak references\n}\n",
            "subsections": []
        },
        "DESCRIPTION": {
            "content": "(This class is part of the HTML::Tree dist.)\n\nThis class is for HTML syntax trees that get built out of HTML source. The way to use it is to:\n\n1. start a new (empty) HTML::TreeBuilder object,\n\n2. then use one of the methods from HTML::Parser (presumably with \"$tree->parsefile($filename)\"\nfor files, or with \"$tree->parse($documentcontent)\" and \"$tree->eof\" if you've got the content\nin a string) to parse the HTML document into the tree $tree.\n\n(You can combine steps 1 and 2 with the \"newfromfile\" or \"newfromcontent\" methods.)\n\n2b. call \"$root->elementify()\" if you want.\n\n3. do whatever you need to do with the syntax tree, presumably involving traversing it looking\nfor some bit of information in it,\n\n4. previous versions of HTML::TreeBuilder required you to call \"$tree->delete()\" to erase the\ncontents of the tree from memory when you're done with the tree. This is not normally required\nanymore. See \"Weak References\" in HTML::Element for details.\n",
            "subsections": []
        },
        "ATTRIBUTES": {
            "content": "Most of the following attributes native to HTML::TreeBuilder control how parsing takes place;\nthey should be set *before* you try parsing into the given object. You can set the attributes by\npassing a TRUE or FALSE value as argument. E.g., \"$root->implicittags\" returns the current\nsetting for the \"implicittags\" option, \"$root->implicittags(1)\" turns that option on, and\n\"$root->implicittags(0)\" turns it off.\n\nimplicittags\nSetting this attribute to true will instruct the parser to try to deduce implicit elements and\nimplicit end tags. If it is false you get a parse tree that just reflects the text as it stands,\nwhich is unlikely to be useful for anything but quick and dirty parsing. (In fact, I'd be\ncurious to hear from anyone who finds it useful to have \"implicittags\" set to false.) Default\nis true.\n\nImplicit elements have the \"implicit\" in HTML::Element attribute set.\n\nimplicitbodyptag\nThis controls an aspect of implicit element behavior, if \"implicittags\" is on: If a text\nelement (PCDATA) or a phrasal element (such as \"<em>\") is to be inserted under \"<body>\", two\nthings can happen: if \"implicitbodyptag\" is true, it's placed under a new, implicit \"<p>\"\ntag. (Past DTDs suggested this was the only correct behavior, and this is how past versions of\nthis module behaved.) But if \"implicitbodyptag\" is false, nothing is implicated -- the PCDATA\nor phrasal element is simply placed under \"<body>\". Default is false.\n\nnoexpandentities\nThis attribute controls whether entities are decoded during the initial parse of the source.\nEnable this if you don't want entities decoded to their character value. e.g. '&amp;' is decoded\nto '&' by default, but will be unchanged if this is enabled. Default is false (entities will be\ndecoded.)\n\nignoreunknown\nThis attribute controls whether unknown tags should be represented as elements in the parse\ntree, or whether they should be ignored. Default is true (to ignore unknown tags.)\n\nignoretext\nDo not represent the text content of elements. This saves space if all you want is to examine\nthe structure of the document. Default is false.\n\nignoreignorablewhitespace\nIf set to true, TreeBuilder will try to avoid creating ignorable whitespace text nodes in the\ntree. Default is true. (In fact, I'd be interested in hearing if there's ever a case where you\nneed this off, or where leaving it on leads to incorrect behavior.)\n\nnospacecompacting\nThis determines whether TreeBuilder compacts all whitespace strings in the document (well,\noutside of PRE or TEXTAREA elements), or leaves them alone. Normally (default, value of 0), each\nstring of contiguous whitespace in the document is turned into a single space. But that's not\ndone if \"nospacecompacting\" is set to 1.\n\nSetting \"nospacecompacting\" to 1 might be useful if you want to read in a tree just to make\nsome minor changes to it before writing it back out.\n\nThis method is experimental. If you use it, be sure to report any problems you might have with\nit.\n\npstrict\nIf set to true (and it defaults to false), TreeBuilder will take a narrower than normal view of\nwhat can be under a \"<p>\" element; if it sees a non-phrasal element about to be inserted under a\n\"<p>\", it will close that \"<p>\". Otherwise it will close \"<p>\" elements only for other \"<p>\"'s,\nheadings, and \"<form>\" (although the latter may be removed in future versions).\n\nFor example, when going thru this snippet of code,\n\n<p>stuff\n<ul>\n\nTreeBuilder will normally (with \"pstrict\" false) put the \"<ul>\" element under the \"<p>\"\nelement. However, with \"pstrict\" set to true, it will close the \"<p>\" first.\n\nIn theory, there should be strictness options like this for other/all elements besides just\n\"<p>\"; but I treat this as a special case simply because of the fact that \"<p>\" occurs so\nfrequently and its end-tag is omitted so often; and also because application of strictness rules\nat parse-time across all elements often makes tiny errors in HTML coding produce drastically bad\nparse-trees, in my experience.\n\nIf you find that you wish you had an option like this to enforce content-models on all elements,\nthen I suggest that what you want is content-model checking as a stage after TreeBuilder has\nfinished parsing.\n\nstorecomments\nThis determines whether TreeBuilder will normally store comments found while parsing content\ninto $root. Currently, this is off by default.\n\nstoredeclarations\nThis determines whether TreeBuilder will normally store markup declarations found while parsing\ncontent into $root. This is on by default.\n\nstorepis\nThis determines whether TreeBuilder will normally store processing instructions found while\nparsing content into $root -- assuming a recent version of HTML::Parser (old versions won't\nparse PIs correctly). Currently, this is off (false) by default.\n\nIt is somewhat of a known bug (to be fixed one of these days, if anyone needs it?) that PIs in\nthe preamble (before the \"<html>\" start-tag) end up actually *under* the \"<html>\" element.\n\nwarn\nThis determines whether syntax errors during parsing should generate warnings, emitted via\nPerl's \"warn\" function.\n\nThis is off (false) by default.\n",
            "subsections": []
        },
        "METHODS": {
            "content": "Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods\ninherited from HTML::Parser are used for building the HTML tree, and the methods inherited from\nHTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder)\ndocumentation, you must also carefully read the HTML::Element documentation, and also skim the\nHTML::Parser documentation -- probably only its parse and parsefile methods are of interest.\n\nnewfromfile\n$root = HTML::TreeBuilder->newfromfile($filenameorfilehandle);\n\nThis \"shortcut\" constructor merely combines constructing a new object (with the \"new\" method,\nbelow), and calling \"$new->parsefile(...)\" on it. Returns the new object. Note that this\nprovides no way of setting any parse options like \"storecomments\" (for that, call \"new\", and\nthen set options, before calling \"parsefile\"). See the notes (below) on parameters to\n\"parsefile\".\n\nIf HTML::TreeBuilder is unable to read the file, then \"newfromfile\" dies. The error can also\nbe found in $!. (This behavior is new in HTML-Tree 5. Previous versions returned a tree with\nonly implicit elements.)\n\nnewfromcontent\n$root = HTML::TreeBuilder->newfromcontent(...);\n\nThis \"shortcut\" constructor merely combines constructing a new object (with the \"new\" method,\nbelow), and calling \"for(...){$new->parse($)}\" and \"$new->eof\" on it. Returns the new object.\nNote that this provides no way of setting any parse options like \"storecomments\" (for that,\ncall \"new\", and then set options, before calling \"parse\"). Example usages:\n\"HTML::TreeBuilder->newfromcontent(@lines)\", or\n\"HTML::TreeBuilder->newfromcontent($content)\".\n\nnewfromurl\n$root = HTML::TreeBuilder->newfromurl($url)\n\nThis \"shortcut\" constructor combines constructing a new object (with the \"new\" method, below),\nloading LWP::UserAgent, fetching the specified URL, and calling \"$new->parse(\n$response->decodedcontent)\" and \"$new->eof\" on it. Returns the new object. Note that this\nprovides no way of setting any parse options like \"storecomments\".\n\nIf LWP is unable to fetch the URL, or the response is not HTML (as determined by\n\"contentishtml\" in HTTP::Headers), then \"newfromurl\" dies, and the HTTP::Response object is\nfound in $HTML::TreeBuilder::lwpresponse.\n\nYou must have installed LWP::UserAgent for this method to work. LWP is not installed\nautomatically, because it's a large set of modules and you might not need it.\n\nnew\n$root = HTML::TreeBuilder->new();\n\nThis creates a new HTML::TreeBuilder object. This method takes no attributes.\n\nparsefile\n$root->parsefile(...)\n\n[An important method inherited from HTML::Parser, which see. Current versions of HTML::Parser\ncan take a filespec, or a filehandle object, like *FOO, or some object from class IO::Handle,\nIO::File, IO::Socket) or the like. I think you should check that a given file exists *before*\ncalling \"$root->parsefile($filespec)\".]\n\nWhen you pass a filename to \"parsefile\", HTML::Parser opens it in binary mode, which means it's\ninterpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16,\nthis will not do the right thing.\n\nOne solution is to open the file yourself using the proper \":encoding\" layer, and pass the\nfilehandle to \"parsefile\". You can automate this process by using \"htmlfile\" in IO::HTML,\nwhich will use the HTML5 encoding sniffing algorithm to automatically determine the proper\n\":encoding\" layer and apply it.\n\nIn the next major release of HTML-Tree, I plan to have it use IO::HTML automatically. If you\nreally want your file opened in binary mode, you should open it yourself and pass the filehandle\nto \"parsefile\".\n\nThe return value is \"undef\" if there's an error opening the file. In that case, the error will\nbe in $!.\n\nparse\n$root->parse(...)\n\n[A important method inherited from HTML::Parser, which see. See the note below for\n\"$root->eof()\".]\n\neof\n$root->eof();\n\nThis signals that you're finished parsing content into this tree; this runs various kinds of\ncrucial cleanup on the tree. This is called *for you* when you call \"$root->parsefile(...)\",\nbut not when you call \"$root->parse(...)\". So if you call \"$root->parse(...)\", then you *must*\ncall \"$root->eof()\" once you've finished feeding all the chunks to \"parse(...)\", and before you\nactually start doing anything else with the tree in $root.\n\nparsecontent\n$root->parsecontent(...);\n\nBasically a handy alias for \"$root->parse(...); $root->eof\". Takes the exact same arguments as\n\"$root->parse()\".\n\ndelete\n$root->delete();\n\n[A previously important method inherited from HTML::Element, which see.]\n\nelementify\n$root->elementify();\n\nThis changes the class of the object in $root from HTML::TreeBuilder to the class used for all\nthe rest of the elements in that tree (generally HTML::Element). Returns $root.\n\nFor most purposes, this is unnecessary, but if you call this after (after!!) you've finished\nbuilding a tree, then it keeps you from accidentally trying to call anything but HTML::Element\nmethods on it. (I.e., if you accidentally call \"$root->parsefile(...)\" on the already-complete\nand elementified tree, then instead of charging ahead and *wreaking havoc*, it'll throw a fatal\nerror -- since $root is now an object just of class HTML::Element which has no \"parsefile\"\nmethod.\n\nNote that \"elementify\" currently deletes all the private attributes of $root except for \"tag\",\n\"parent\", \"content\", \"pos\", and \"implicit\". If anyone requests that I change this to leave\nin yet more private attributes, I might do so, in future versions.\n\nguts\n@nodes = $root->guts();\n$parentfornodes = $root->guts();\n\nIn list context (as in the first case), this method returns the topmost non-implicit nodes in a\ntree. This is useful when you're parsing HTML code that you know doesn't expect an HTML\ndocument, but instead just a fragment of an HTML document. For example, if you wanted the parse\ntree for a file consisting of just this:\n\n<li>I like pie!\n\nThen you would get that with \"@nodes = $root->guts();\". It so happens that in this case, @nodes\nwill contain just one element object, representing the \"<li>\" node (with \"I like pie!\" being its\ntext child node). However, consider if you were parsing this:\n\n<hr>Hooboy!<hr>\n\nIn that case, \"$root->guts()\" would return three items: an element object for the first \"<hr>\",\na text string \"Hooboy!\", and another \"<hr>\" element object.\n\nFor cases where you want definitely one element (so you can treat it as a \"document fragment\",\nroughly speaking), call \"guts()\" in scalar context, as in \"$parentfornodes = $root->guts()\".\nThat works like \"guts()\" in list context; in fact, \"guts()\" in list context would have returned\nexactly one value, and if it would have been an object (as opposed to a text string), then\nthat's what \"guts\" in scalar context will return. Otherwise, if \"guts()\" in list context would\nhave returned no values at all, then \"guts()\" in scalar context returns undef. In all other\ncases, \"guts()\" in scalar context returns an implicit \"<div>\" element node, with children\nconsisting of whatever nodes \"guts()\" in list context would have returned. Note that that may\ndetach those nodes from $root's tree.\n\ndisembowel\n@nodes = $root->disembowel();\n$parentfornodes = $root->disembowel();\n\nThe \"disembowel()\" method works just like the \"guts()\" method, except that disembowel\ndefinitively destroys the tree above the nodes that are returned. Usually when you want the guts\nfrom a tree, you're just going to toss out the rest of the tree anyway, so this saves you the\nbother. (Remember, \"disembowel\" means \"remove the guts from\".)\n",
            "subsections": []
        },
        "INTERNAL METHODS": {
            "content": "You should not need to call any of the following methods directly.\n\nelementclass\n$classname = $h->elementclass;\n\nThis method returns the class which will be used for new elements. It defaults to HTML::Element,\nbut can be overridden by subclassing or esoteric means best left to those will will read the\nsource and then not complain when those esoteric means change. (Just subclass.)\n\ncomment\nAccept a \"here's a comment\" signal from HTML::Parser.\n\ndeclaration\nAccept a \"here's a markup declaration\" signal from HTML::Parser.\n\ndone\nTODO: document\n\nend\nEither: Accept an end-tag signal from HTML::Parser Or: Method for closing currently open\nelements in some fairly complex way, as used by other methods in this class.\n\nTODO: Why is this hidden?\n\nprocess\nAccept a \"here's a PI\" signal from HTML::Parser.\n\nstart\nAccept a signal from HTML::Parser for start-tags.\n\nTODO: Why is this hidden?\n\nstunt\nTODO: document\n\nstunted\nTODO: document\n\ntext\nAccept a \"here's a text token\" signal from HTML::Parser.\n\nTODO: Why is this hidden?\n\ntightenup\nLegacy\n\nRedirects to \"deleteignorablewhitespace\" in HTML::Element.\n\nwarning\nWrapper for CORE::warn\n\nTODO: why not just use carp?\n",
            "subsections": []
        },
        "SUBROUTINES": {
            "content": "DEBUG\nAre we in Debug mode? This is a constant subroutine, to allow compile-time optimizations. To\ncontrol debug mode, set $HTML::TreeBuilder::DEBUG *before* loading HTML::TreeBuilder.\n",
            "subsections": []
        },
        "HTML AND ITS DISCONTENTS": {
            "content": "HTML is rather harder to parse than people who write it generally suspect.\n\nHere's the problem: HTML is a kind of SGML that permits \"minimization\" and \"implication\". In\nshort, this means that you don't have to close every tag you open (because the opening of a\nsubsequent tag may implicitly close it), and if you use a tag that can't occur in the context\nyou seem to using it in, under certain conditions the parser will be able to realize you mean to\nleave the current context and enter the new one, that being the only one that your code could\ncorrectly be interpreted in.\n\nNow, this would all work flawlessly and unproblematically if: 1) all the rules that both\nprescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of\nthese rules and wrote their code in compliance to them.\n\nHowever, it didn't happen that way, and so most HTML pages are difficult if not impossible to\ncorrectly parse with nearly any set of straightforward SGML rules. That's why the internals of\nHTML::TreeBuilder consist of lots and lots of special cases -- instead of being just a generic\nSGML parser with HTML DTD rules plugged in.\n\nTRANSLATIONS?\nThe techniques that HTML::TreeBuilder uses to perform what I consider very robust parses on\neveryday code are not things that can work only in Perl. To date, the algorithms at the center\nof HTML::TreeBuilder have been implemented only in Perl, as far as I know; and I don't foresee\ngetting around to implementing them in any other language any time soon.\n\nIf, however, anyone is looking for a semester project for an applied programming class (or if\nthey merely enjoy *extra-curricular* masochism), they might do well to see about choosing as a\ntopic the implementation/adaptation of these routines to any other interesting programming\nlanguage that you feel currently suffers from a lack of robust HTML-parsing. I welcome\ncorrespondence on this subject, and point out that one can learn a great deal about languages by\ntrying to translate between them, and then comparing the result.\n\nThe HTML::TreeBuilder source may seem long and complex, but it is rather well commented, and\nsymbol names are generally self-explanatory. (You are encouraged to read the Mozilla HTML parser\nsource for comparison.) Some of the complexity comes from little-used features, and some of it\ncomes from having the HTML tokenizer (HTML::Parser) being a separate module, requiring somewhat\nof a different interface than you'd find in a combined tokenizer and tree-builder. But most of\nthe length of the source comes from the fact that it's essentially a long list of special cases,\nwith lots and lots of sanity-checking, and sanity-recovery -- because, as Roseanne Rosannadanna\nonce said, \"it's always *something*\".\n\nUsers looking to compare several HTML parsers should look at the source for Raggett's Tidy\n(\"<http://www.w3.org/People/Raggett/tidy/>\"), Mozilla (\"<http://www.mozilla.org/>\"), and\npossibly root around the browsers section of Yahoo to find the various open-source ones\n(\"<http://dir.yahoo.com/ComputersandInternet/Software/Internet/WorldWideWeb/Browsers/>\").\n",
            "subsections": []
        },
        "BUGS": {
            "content": "* Framesets seem to work correctly now. Email me if you get a strange parse from a document with\nframesets.\n\n* Really bad HTML code will, often as not, make for a somewhat objectionable parse tree.\nRegrettable, but unavoidably true.\n\n* If you're running with \"implicittags\" off (God help you!), consider that\n\"$tree->contentlist\" probably contains the tree or grove from the parse, and not $tree itself\n(which will, oddly enough, be an implicit \"<html>\" element). This seems counter-intuitive and\nproblematic; but seeing as how almost no HTML ever parses correctly with \"implicittags\" off,\nthis interface oddity seems the least of your problems.\n",
            "subsections": []
        },
        "BUG REPORTS": {
            "content": "When a document parses in a way different from how you think it should, I ask that you report\nthis to me as a bug. The first thing you should do is copy the document, trim out as much of it\nas you can while still producing the bug in question, and *then* email me that mini-document\n*and* the code you're using to parse it, to the HTML::Tree bug queue at\n\"<bug-html-tree at rt.cpan.org>\".\n\nInclude a note as to how it parses (presumably including its \"$tree->dump\" output), and then a\n*careful and clear* explanation of where you think the parser is going astray, and how you would\nprefer that it work instead.\n",
            "subsections": []
        },
        "SEE ALSO": {
            "content": "For more information about the HTML-Tree distribution: HTML::Tree.\n\nModules used by HTML::TreeBuilder: HTML::Parser, HTML::Element, HTML::Tagset.\n\nFor converting between XML::DOM::Node, HTML::Element, and XML::Element trees: HTML::DOMbo.\n\nFor opening a HTML file with automatic charset detection: IO::HTML.\n",
            "subsections": []
        },
        "AUTHOR": {
            "content": "Current maintainers:\n\n*   Christopher J. Madsen \"<perl AT cjmweb.net>\"\n\n*   Jeff Fearn \"<jfearn AT cpan.org>\"\n\nOriginal HTML-Tree author:\n\n*   Gisle Aas\n\nFormer maintainers:\n\n*   Sean M. Burke\n\n*   Andy Lester\n\n*   Pete Krawczyk \"<petek AT cpan.org>\"\n\nYou can follow or contribute to HTML-Tree's development at\n<https://github.com/kentfredric/HTML-Tree>.\n",
            "subsections": []
        },
        "COPYRIGHT AND LICENSE": {
            "content": "Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy Lester, 2006 Pete Krawczyk,\n2010 Jeff Fearn, 2012 Christopher J. Madsen.\n\nThis library is free software; you can redistribute it and/or modify it under the same terms as\nPerl itself.\n\nThe programs in this library are distributed in the hope that they will be useful, but without\nany warranty; without even the implied warranty of merchantability or fitness for a particular\npurpose.\n",
            "subsections": []
        }
    },
    "summary": "HTML::TreeBuilder - Parser that builds a HTML syntax tree",
    "flags": [],
    "examples": [],
    "see_also": []
}