\" element node, with children\nconsisting of whatever nodes \"guts()\" in list context would have returned. Note that that may\ndetach those nodes from $root's tree.\n\ndisembowel\n@nodes = $root->disembowel();\n$parentfornodes = $root->disembowel();\n\nThe \"disembowel()\" method works just like the \"guts()\" method, except that disembowel\ndefinitively destroys the tree above the nodes that are returned. Usually when you want the guts\nfrom a tree, you're just going to toss out the rest of the tree anyway, so this saves you the\nbother. (Remember, \"disembowel\" means \"remove the guts from\".)\n", "subsections": [] }, "INTERNAL METHODS": { "content": "You should not need to call any of the following methods directly.\n\nelementclass\n$classname = $h->elementclass;\n\nThis method returns the class which will be used for new elements. It defaults to HTML::Element,\nbut can be overridden by subclassing or esoteric means best left to those will will read the\nsource and then not complain when those esoteric means change. (Just subclass.)\n\ncomment\nAccept a \"here's a comment\" signal from HTML::Parser.\n\ndeclaration\nAccept a \"here's a markup declaration\" signal from HTML::Parser.\n\ndone\nTODO: document\n\nend\nEither: Accept an end-tag signal from HTML::Parser Or: Method for closing currently open\nelements in some fairly complex way, as used by other methods in this class.\n\nTODO: Why is this hidden?\n\nprocess\nAccept a \"here's a PI\" signal from HTML::Parser.\n\nstart\nAccept a signal from HTML::Parser for start-tags.\n\nTODO: Why is this hidden?\n\nstunt\nTODO: document\n\nstunted\nTODO: document\n\ntext\nAccept a \"here's a text token\" signal from HTML::Parser.\n\nTODO: Why is this hidden?\n\ntightenup\nLegacy\n\nRedirects to \"deleteignorablewhitespace\" in HTML::Element.\n\nwarning\nWrapper for CORE::warn\n\nTODO: why not just use carp?\n", "subsections": [] }, "SUBROUTINES": { "content": "DEBUG\nAre we in Debug mode? This is a constant subroutine, to allow compile-time optimizations. To\ncontrol debug mode, set $HTML::TreeBuilder::DEBUG *before* loading HTML::TreeBuilder.\n", "subsections": [] }, "HTML AND ITS DISCONTENTS": { "content": "HTML is rather harder to parse than people who write it generally suspect.\n\nHere's the problem: HTML is a kind of SGML that permits \"minimization\" and \"implication\". In\nshort, this means that you don't have to close every tag you open (because the opening of a\nsubsequent tag may implicitly close it), and if you use a tag that can't occur in the context\nyou seem to using it in, under certain conditions the parser will be able to realize you mean to\nleave the current context and enter the new one, that being the only one that your code could\ncorrectly be interpreted in.\n\nNow, this would all work flawlessly and unproblematically if: 1) all the rules that both\nprescribe and describe HTML were (and had been) clearly set out, and 2) everyone was aware of\nthese rules and wrote their code in compliance to them.\n\nHowever, it didn't happen that way, and so most HTML pages are difficult if not impossible to\ncorrectly parse with nearly any set of straightforward SGML rules. That's why the internals of\nHTML::TreeBuilder consist of lots and lots of special cases -- instead of being just a generic\nSGML parser with HTML DTD rules plugged in.\n\nTRANSLATIONS?\nThe techniques that HTML::TreeBuilder uses to perform what I consider very robust parses on\neveryday code are not things that can work only in Perl. To date, the algorithms at the center\nof HTML::TreeBuilder have been implemented only in Perl, as far as I know; and I don't foresee\ngetting around to implementing them in any other language any time soon.\n\nIf, however, anyone is looking for a semester project for an applied programming class (or if\nthey merely enjoy *extra-curricular* masochism), they might do well to see about choosing as a\ntopic the implementation/adaptation of these routines to any other interesting programming\nlanguage that you feel currently suffers from a lack of robust HTML-parsing. I welcome\ncorrespondence on this subject, and point out that one can learn a great deal about languages by\ntrying to translate between them, and then comparing the result.\n\nThe HTML::TreeBuilder source may seem long and complex, but it is rather well commented, and\nsymbol names are generally self-explanatory. (You are encouraged to read the Mozilla HTML parser\nsource for comparison.) Some of the complexity comes from little-used features, and some of it\ncomes from having the HTML tokenizer (HTML::Parser) being a separate module, requiring somewhat\nof a different interface than you'd find in a combined tokenizer and tree-builder. But most of\nthe length of the source comes from the fact that it's essentially a long list of special cases,\nwith lots and lots of sanity-checking, and sanity-recovery -- because, as Roseanne Rosannadanna\nonce said, \"it's always *something*\".\n\nUsers looking to compare several HTML parsers should look at the source for Raggett's Tidy\n(\"\"), Mozilla (\"\"), and\npossibly root around the browsers section of Yahoo to find the various open-source ones\n(\"\").\n", "subsections": [] }, "BUGS": { "content": "* Framesets seem to work correctly now. Email me if you get a strange parse from a document with\nframesets.\n\n* Really bad HTML code will, often as not, make for a somewhat objectionable parse tree.\nRegrettable, but unavoidably true.\n\n* If you're running with \"implicittags\" off (God help you!), consider that\n\"$tree->contentlist\" probably contains the tree or grove from the parse, and not $tree itself\n(which will, oddly enough, be an implicit \"\" element). This seems counter-intuitive and\nproblematic; but seeing as how almost no HTML ever parses correctly with \"implicittags\" off,\nthis interface oddity seems the least of your problems.\n", "subsections": [] }, "BUG REPORTS": { "content": "When a document parses in a way different from how you think it should, I ask that you report\nthis to me as a bug. The first thing you should do is copy the document, trim out as much of it\nas you can while still producing the bug in question, and *then* email me that mini-document\n*and* the code you're using to parse it, to the HTML::Tree bug queue at\n\"\".\n\nInclude a note as to how it parses (presumably including its \"$tree->dump\" output), and then a\n*careful and clear* explanation of where you think the parser is going astray, and how you would\nprefer that it work instead.\n", "subsections": [] }, "SEE ALSO": { "content": "For more information about the HTML-Tree distribution: HTML::Tree.\n\nModules used by HTML::TreeBuilder: HTML::Parser, HTML::Element, HTML::Tagset.\n\nFor converting between XML::DOM::Node, HTML::Element, and XML::Element trees: HTML::DOMbo.\n\nFor opening a HTML file with automatic charset detection: IO::HTML.\n", "subsections": [] }, "AUTHOR": { "content": "Current maintainers:\n\n* Christopher J. Madsen \"\"\n\n* Jeff Fearn \"\"\n\nOriginal HTML-Tree author:\n\n* Gisle Aas\n\nFormer maintainers:\n\n* Sean M. Burke\n\n* Andy Lester\n\n* Pete Krawczyk \"\"\n\nYou can follow or contribute to HTML-Tree's development at\n.\n", "subsections": [] }, "COPYRIGHT AND LICENSE": { "content": "Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy Lester, 2006 Pete Krawczyk,\n2010 Jeff Fearn, 2012 Christopher J. Madsen.\n\nThis library is free software; you can redistribute it and/or modify it under the same terms as\nPerl itself.\n\nThe programs in this library are distributed in the hope that they will be useful, but without\nany warranty; without even the implied warranty of merchantability or fitness for a particular\npurpose.\n", "subsections": [] } }, "summary": "HTML::TreeBuilder - Parser that builds a HTML syntax tree", "flags": [], "examples": [], "see_also": [] }