" element node, with children consisting of whatever nodes "guts()" in list context would have returned. Note that that may detach those nodes from $root's tree. disembowel @nodes = $root->disembowel(); $parent_for_nodes = $root->disembowel(); The "disembowel()" method works just like the "guts()" method, except that disembowel definitively destroys the tree above the nodes that are returned. Usually when you want the guts from a tree, you're just going to toss out the rest of the tree anyway, so this saves you the bother. (Remember, "disembowel" means "remove the guts from".) ## INTERNAL METHODS You should not need to call any of the following methods directly. element_class $classname = $h->element_class; This method returns the class which will be used for new elements. It defaults to [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown), but can be overridden by subclassing or esoteric means best left to those will will read the source and then not complain when those esoteric means change. (Just subclass.) comment Accept a "here's a comment" signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown). declaration Accept a "here's a markup declaration" signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown). done TODO: document end Either: Accept an end-tag signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown) Or: Method for closing currently open elements in some fairly complex way, as used by other methods in this class. TODO: Why is this hidden? process Accept a "here's a PI" signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown). start Accept a signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown) for start-tags. TODO: Why is this hidden? stunt TODO: document stunted TODO: document text Accept a "here's a text token" signal from [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown). TODO: Why is this hidden? tighten_up Legacy Redirects to "delete_ignorable_whitespace" in [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown). warning Wrapper for [CORE::warn](https://www.chedong.com/phpMan.php/perldoc/CORE%3A%3Awarn/markdown) TODO: why not just use carp? ## SUBROUTINES DEBUG Are we in Debug mode? This is a constant subroutine, to allow compile-time optimizations. To control debug mode, set $[HTML::TreeBuilder::DEBUG](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder%3A%3ADEBUG/markdown) *before* loading [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown). ## HTML AND ITS DISCONTENTS HTML is rather harder to parse than people who write it generally suspect. Here's the problem: HTML is a kind of SGML that permits "minimization" and "implication". In short, this means that you don't have to close every tag you open (because the opening of a subsequent tag may implicitly close it), and if you use a tag that can't occur in the context you seem to using it in, under certain conditions the parser will be able to realize you mean to leave the current context and enter the new one, that being the only one that your code could correctly be interpreted in. Now, this would all work flawlessly and unproblematically if: 1) all the rules that both prescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of these rules and wrote their code in compliance to them. However, it didn't happen that way, and so most HTML pages are difficult if not impossible to correctly parse with nearly any set of straightforward SGML rules. That's why the internals of [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown) consist of lots and lots of special cases -- instead of being just a generic SGML parser with HTML DTD rules plugged in. TRANSLATIONS? The techniques that [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown) uses to perform what I consider very robust parses on everyday code are not things that can work only in Perl. To date, the algorithms at the center of [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown) have been implemented only in Perl, as far as I know; and I don't foresee getting around to implementing them in any other language any time soon. If, however, anyone is looking for a semester project for an applied programming class (or if they merely enjoy *extra-curricular* masochism), they might do well to see about choosing as a topic the implementation/adaptation of these routines to any other interesting programming language that you feel currently suffers from a lack of robust HTML-parsing. I welcome correspondence on this subject, and point out that one can learn a great deal about languages by trying to translate between them, and then comparing the result. The [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown) source may seem long and complex, but it is rather well commented, and symbol names are generally self-explanatory. (You are encouraged to read the Mozilla HTML parser source for comparison.) Some of the complexity comes from little-used features, and some of it comes from having the HTML tokenizer ([HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown)) being a separate module, requiring somewhat of a different interface than you'd find in a combined tokenizer and tree-builder. But most of the length of the source comes from the fact that it's essentially a long list of special cases, with lots and lots of sanity-checking, and sanity-recovery -- because, as Roseanne Rosannadanna once said, "it's always *something*". Users looking to compare several HTML parsers should look at the source for Raggett's Tidy ("<>"), Mozilla ("<>"), and possibly root around the browsers section of Yahoo to find the various open-source ones ("<>"). ## BUGS * Framesets seem to work correctly now. Email me if you get a strange parse from a document with framesets. * Really bad HTML code will, often as not, make for a somewhat objectionable parse tree. Regrettable, but unavoidably true. * If you're running with "implicit_tags" off (God help you!), consider that "$tree->content_list" probably contains the tree or grove from the parse, and not $tree itself (which will, oddly enough, be an implicit "" element). This seems counter-intuitive and problematic; but seeing as how almost no HTML ever parses correctly with "implicit_tags" off, this interface oddity seems the least of your problems. ## BUG REPORTS When a document parses in a way different from how you think it should, I ask that you report this to me as a bug. The first thing you should do is copy the document, trim out as much of it as you can while still producing the bug in question, and *then* email me that mini-document *and* the code you're using to parse it, to the [HTML::Tree](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATree/markdown) bug queue at "". Include a note as to how it parses (presumably including its "$tree->dump" output), and then a *careful and clear* explanation of where you think the parser is going astray, and how you would prefer that it work instead. ## SEE ALSO For more information about the HTML-Tree distribution: [HTML::Tree](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATree/markdown). Modules used by [HTML::TreeBuilder](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATreeBuilder/markdown): [HTML::Parser](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AParser/markdown), [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown), [HTML::Tagset](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ATagset/markdown). For converting between [XML::DOM::Node](https://www.chedong.com/phpMan.php/perldoc/XML%3A%3ADOM%3A%3ANode/markdown), [HTML::Element](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3AElement/markdown), and [XML::Element](https://www.chedong.com/phpMan.php/perldoc/XML%3A%3AElement/markdown) trees: [HTML::DOMbo](https://www.chedong.com/phpMan.php/perldoc/HTML%3A%3ADOMbo/markdown). For opening a HTML file with automatic charset detection: [IO::HTML](https://www.chedong.com/phpMan.php/perldoc/IO%3A%3AHTML/markdown). ## AUTHOR Current maintainers: * Christopher J. Madsen "" * Jeff Fearn "" Original HTML-Tree author: * Gisle Aas Former maintainers: * Sean M. Burke * Andy Lester * Pete Krawczyk "" You can follow or contribute to HTML-Tree's development at <>. ## COPYRIGHT AND LICENSE Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn, 2012 Christopher J. Madsen. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The programs in this library are distributed in the hope that they will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.