{
    "content": [
        {
            "type": "text",
            "text": "# XML::SAX::Intro (perldoc)\n\n## NAME\n\nXML::SAX::Intro - An Introduction to SAX Parsing with Perl\n\n## Sections\n\n- **NAME**\n- **Introduction**\n- **Replacing XML::Parser**\n- **Introducing SAX**\n- **Callback Parameters**\n- **Tip of the iceberg**\n- **AUTHOR**\n\nUse structuredContent.sections for detailed options, examples, and full documentation.\n"
        }
    ],
    "structuredContent": {
        "command": "XML::SAX::Intro",
        "section": "",
        "mode": "perldoc",
        "summary": "XML::SAX::Intro - An Introduction to SAX Parsing with Perl",
        "synopsis": null,
        "tldr_summary": null,
        "tldr_examples": [],
        "tldr_source": null,
        "flags": [],
        "examples": [],
        "see_also": [],
        "section_outline": [
            {
                "name": "NAME",
                "lines": 2,
                "subsections": []
            },
            {
                "name": "Introduction",
                "lines": 6,
                "subsections": []
            },
            {
                "name": "Replacing XML::Parser",
                "lines": 34,
                "subsections": []
            },
            {
                "name": "Introducing SAX",
                "lines": 74,
                "subsections": []
            },
            {
                "name": "Callback Parameters",
                "lines": 106,
                "subsections": []
            },
            {
                "name": "Tip of the iceberg",
                "lines": 77,
                "subsections": []
            },
            {
                "name": "AUTHOR",
                "lines": 4,
                "subsections": []
            }
        ],
        "sections": {
            "NAME": {
                "content": "XML::SAX::Intro - An Introduction to SAX Parsing with Perl\n",
                "subsections": []
            },
            "Introduction": {
                "content": "XML::SAX is a new way to work with XML Parsers in Perl. In this article we'll discuss why you\nshould be using SAX, why you should be using XML::SAX, and we'll see some of the finer\nimplementation details. The text below assumes some familiarity with callback, or push based\nparsing, but if you are unfamiliar with these techniques then a good place to start is Kip\nHampton's excellent series of articles on XML.com.\n",
                "subsections": []
            },
            "Replacing XML::Parser": {
                "content": "The de-facto way of parsing XML under perl is to use Larry Wall and Clark Cooper's XML::Parser.\nThis module is a Perl and XS wrapper around the expat XML parser library by James Clark. It has\nbeen a hugely successful project, but suffers from a couple of rather major flaws. Firstly it is\na proprietary API, designed before the SAX API was conceived, which means that it is not easily\nreplaceable by other streaming parsers. Secondly it's callbacks are subrefs. This doesn't sound\nlike much of an issue, but unfortunately leads to code like:\n\nsub handlestart {\nmy ($e, $el, %attrs) = @;\nif ($el eq 'foo') {\n$e->{insidefoo}++; # BAD! $e is an XML::Parser::Expat object.\n}\n}\n\nAs you can see, we're using the $e object to hold our state information, which is a bad idea\nbecause we don't own that object - we didn't create it. It's an internal object of XML::Parser,\nthat happens to be a hashref. We could all too easily overwrite XML::Parser internal state\nvariables by using this, or Clark could change it to an array ref (not that he would, because it\nwould break so much code, but he could).\n\nThe only way currently with XML::Parser to safely maintain state is to use a closure:\n\nmy $state = MyState->new();\n$parser->setHandlers(Start => sub { handlestart($state, @) });\n\nThis closure traps the $state variable, which now gets passed as the first parameter to your\ncallback. Unfortunately very few people use this technique, as it is not documented in the\nXML::Parser POD files.\n\nAnother reason you might not want to use XML::Parser is because you need some feature that it\ndoesn't provide (such as validation), or you might need to use a library that doesn't use expat,\ndue to it not being installed on your system, or due to having a restrictive ISP. Using SAX\nallows you to work around these restrictions.\n",
                "subsections": []
            },
            "Introducing SAX": {
                "content": "SAX stands for the Simple API for XML. And simple it really is. Constructing a SAX parser and\npassing events to handlers is done as simply as:\n\nuse XML::SAX;\nuse MySAXHandler;\n\nmy $parser = XML::SAX::ParserFactory->parser(\nHandler => MySAXHandler->new\n);\n\n$parser->parseuri(\"foo.xml\");\n\nThe important concept to grasp here is that SAX uses a factory class called\nXML::SAX::ParserFactory to create a new parser instance. The reason for this is so that you can\nsupport other underlying parser implementations for different feature sets. This is one thing\nthat XML::Parser has always sorely lacked.\n\nIn the code above we see the parseuri method used, but we could have equally well called\nparsefile, parsestring, or parse(). Please see XML::SAX::Base for what these methods take as\nparameters, but don't be fooled into believing parsefile takes a filename. No, it takes a file\nhandle, a glob, or a subclass of IO::Handle. Beware.\n\nSAX works very similarly to XML::Parser's default callback method, except it has one major\ndifference: rather than setting individual callbacks, you create a new class in which to receive\nthe callbacks. Each callback is called as a method call on an instance of that handler class. An\nexample will best demonstrate this:\n\npackage MySAXHandler;\nuse base qw(XML::SAX::Base);\n\nsub startdocument {\nmy ($self, $doc) = @;\n# process document start event\n}\n\nsub startelement {\nmy ($self, $el) = @;\n# process element start event\n}\n\nNow, when we instantiate this as above, and parse some XML with this as the handler, the methods\nstartdocument and startelement will be called as method calls, so this would be the equivalent\nof directly calling:\n\n$object->startelement($el);\n\nNotice how this is different to XML::Parser's calling style, which calls:\n\nstartelement($e, $name, %attribs);\n\nIt's the difference between function calling and method calling which allows you to subclass SAX\nhandlers which contributes to SAX being a powerful solution.\n\nAs you can see, unlike XML::Parser, we have to define a new package in which to do our\nprocessing (there are hacks you can do to make this uneccessary, but I'll leave figuring those\nout to the experts). The biggest benefit of this is that you maintain your own state variable\n($self in the above example) thus freeing you of the concerns listed above. It is also an\nimprovement in maintainability - you can place the code in a separate file if you wish to, and\nyour callback methods are always called the same thing, rather than having to choose a suitable\nname for them as you had to with XML::Parser. This is an obvious win.\n\nSAX parsers are also very flexible in how you pass a handler to them. You can use a constructor\nparameter as we saw above, or we can pass the handler directly in the call to one of the parse\nmethods:\n\n$parser->parse(Handler => $handler,\nSource => { SystemId => \"foo.xml\" });\n# or...\n$parser->parsefile($fh, Handler => $handler);\n\nThis flexibility allows for one parser to be used in many different scenarios throughout your\nscript (though one shouldn't feel pressure to use this method, as parser construction is\ngenerally not a time consuming process).\n",
                "subsections": []
            },
            "Callback Parameters": {
                "content": "The only other thing you need to know to understand basic SAX is the structure of the parameters\npassed to each of the callbacks. In XML::Parser, all parameters are passed as multiple options\nto the callbacks, so for example the Start callback would be called as mystart($e, $name,\n%attributes), and the PI callback would be called as myprocessinginstruction($e, $target,\n$data). In SAX, every callback is passed a hash reference, containing entries that define our\n\"node\". The key callbacks and the structures they receive are:\n\nstartelement\nThe startelement handler is called whenever a parser sees an opening tag. It is passed an\nelement structure consisting of:\n\nLocalName\nThe name of the element minus any namespace prefix it may have come with in the document.\n\nNamespaceURI\nThe URI of the namespace associated with this element, or the empty string for none.\n\nAttributes\nA set of attributes as described below.\n\nName\nThe name of the element as it was seen in the document (i.e. including any prefix associated\nwith it)\n\nPrefix\nThe prefix used to qualify this element's namespace, or the empty string if none.\n\nThe Attributes are a hash reference, keyed by what we have called \"James Clark\" notation. This\nmeans that the attribute name has been expanded to include any associated namespace URI, and put\ntogether as {ns}name, where \"ns\" is the expanded namespace URI of the attribute if and only if\nthe attribute had a prefix, and \"name\" is the LocalName of the attribute.\n\nThe value of each entry in the attributes hash is another hash structure consisting of:\n\nLocalName\nThe name of the attribute minus any namespace prefix it may have come with in the document.\n\nNamespaceURI\nThe URI of the namespace associated with this attribute. If the attribute had no prefix,\nthen this consists of just the empty string.\n\nName\nThe attribute's name as it appeared in the document, including any namespace prefix.\n\nPrefix\nThe prefix used to qualify this attribute's namepace, or the empty string if none.\n\nValue\nThe value of the attribute.\n\nSo a full example, as output by Data::Dumper might be:\n\n....\n\nendelement\nThe endelement handler is called either when a parser sees a closing tag, or after\nstartelement has been called for an empty element (do note however that a parser may if it is\nso inclined call characters with an empty string when it sees an empty element. There is no\nsimple way in SAX to determine if the parser in fact saw an empty element, a start and end\nelement with no content..\n\nThe endelement handler receives exactly the same structure as startelement, minus the\nAttributes entry. One must note though that it should not be a reference to the same data as\nstartelement receives, so you may change the values in startelement but this will not affect\nthe values later seen by endelement.\n\ncharacters\nThe characters callback may be called in several circumstances. The most obvious one is when\nseeing ordinary character data in the markup. But it is also called for text in a CDATA section,\nand is also called in other situations. A SAX parser has to make no guarantees whatsoever about\nhow many times it may call characters for a stretch of text in an XML document - it may call\nonce, or it may call once for every character in the text. In order to work around this it is\noften important for the SAX developer to use a bundling technique, where text is gathered up and\nprocessed in one of the other callbacks. This is not always necessary, but it is a worthwhile\ntechnique to learn, which we will cover in XML::SAX::Advanced (when I get around to writing it).\n\nThe characters handler is called with a very simple structure - a hash reference consisting of\njust one entry:\n\nData\nThe text data that was received.\n\ncomment\nThe comment callback is called for comment text. Unlike with \"characters()\", the comment\ncallback *must* be invoked just once for an entire comment string. It receives a single simple\nstructure - a hash reference containing just one entry:\n\nData\nThe text of the comment.\n\nprocessinginstruction\nThe processing instruction handler is called for all processing instructions in the document.\nNote that these processing instructions may appear before the document root element, or after\nit, or anywhere where text and elements would normally appear within the document, according to\nthe XML specification.\n\nThe handler is passed a structure containing just two entries:\n\nTarget\nThe target of the processing instruction\n\nData\nThe text data in the processing instruction. Can be an empty string for a processing\ninstruction that has no data element. For example <?wiggle?> is a perfectly valid processing\ninstruction.\n",
                "subsections": []
            },
            "Tip of the iceberg": {
                "content": "What we have discussed above is really the tip of the SAX iceberg. And so far it looks like\nthere's not much of interest to SAX beyond what we have seen with XML::Parser. But it does go\nmuch further than that, I promise.\n\nPeople who hate Object Oriented code for the sake of it may be thinking here that creating a new\npackage just to parse something is a waste when they've been parsing things just fine up to now\nusing procedural code. But there's reason to all this madness. And that reason is SAX Filters.\n\nAs you saw right at the very start, to let the parser know about our class, we pass it an\ninstance of our class as the Handler to the parser. But now imagine what would happen if our\nclass could also take a Handler option, and simply do some processing and pass on our data\nfurther down the line? That in a nutshell is how SAX filters work. It's Unix pipes for the 21st\ncentury!\n\nThere are two downsides to this. Number 1 - writing SAX filters can be tricky. If you look into\nthe future and read the advanced tutorial I'm writing, you'll see that Handler can come in\nseveral shapes and sizes. So making sure your filter does the right thing can be tricky.\nSecondly, constructing complex filter chains can be difficult, and simple thinking tells us that\nwe only get one pass at our document, when often we'll need more than that.\n\nLuckily though, those downsides have been fixed by the release of two very cool modules. What's\neven better is that I didn't write either of them!\n\nThe first module is XML::SAX::Base. This is a VITAL SAX module that acts as a base class for all\nSAX parsers and filters. It provides an abstraction away from calling the handler methods, that\nmakes sure your filter or parser does the right thing, and it does it FAST. So, if you ever need\nto write a SAX filter, which if you're processing XML -> XML, or XML -> HTML, then you probably\ndo, then you need to be writing it as a subclass of XML::SAX::Base. Really - this is advice not\nto ignore lightly. I will not go into the details of writing a SAX filter here. Kip Hampton, the\nauthor of XML::SAX::Base has covered this nicely in his article on XML.com here <URI>.\n\nTo construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker whose modules you will\nprobably have heard of or used, wrote a very clever module called XML::SAX::Machines. This\ncombines some really clever SAX filter-type modules, with a construction toolkit for filters\nthat makes building pipelines easy. But before we see how it makes things easy, first lets see\nhow tricky it looks to build complex SAX filter pipelines.\n\nuse XML::SAX::ParserFactory;\nuse XML::Filter::Filter1;\nuse XML::Filter::Filter2;\nuse XML::SAX::Writer;\n\nmy $outputstring;\nmy $writer = XML::SAX::Writer->new(Output => \\$outputstring);\nmy $filter2 = XML::SAX::Filter2->new(Handler => $writer);\nmy $filter1 = XML::SAX::Filter1->new(Handler => $filter2);\nmy $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);\n\n$parser->parseuri(\"foo.xml\");\n\nThis is a lot easier with XML::SAX::Machines:\n\nuse XML::SAX::Machines qw(Pipeline);\n\nmy $outputstring;\nmy $parser = Pipeline(\nXML::SAX::Filter1 => XML::SAX::Filter2 => \\$outputstring\n);\n\n$parser->parseuri(\"foo.xml\");\n\nOne of the main benefits of XML::SAX::Machines is that the pipelines are constructed in natural\norder, rather than the reverse order we saw with manual pipeline construction.\nXML::SAX::Machines takes care of all the internals of pipe construction, providing you at the\nend with just a parser you can use (and you can re-use the same parser as many times as you need\nto).\n\nJust a final tip. If you ever get stuck and are confused about what is being passed from one SAX\nfilter or parser to the next, then Devel::TraceSAX will come to your rescue. This perl debugger\nplugin will allow you to dump the SAX stream of events as it goes by. Usage is really very\nsimple just call your perl script that uses SAX as follows:\n\n$ perl -d:TraceSAX <scriptname>\n\nAnd preferably pipe the output to a pager of some sort, such as more or less. The output is\nextremely verbose, but should help clear some issues up.\n",
                "subsections": []
            },
            "AUTHOR": {
                "content": "Matt Sergeant, matt@sergeant.org\n\n$Id$\n",
                "subsections": []
            }
        }
    }
}