| # sax js |
| |
| A sax-style parser for XML and HTML. |
| |
| Designed with [node](http://nodejs.org/) in mind, but should work fine in |
| the browser or other CommonJS implementations. |
| |
| ## What This Is |
| |
| * A very simple tool to parse through an XML string. |
| * A stepping stone to a streaming HTML parser. |
| * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML |
| docs. |
| |
| ## What This Is (probably) Not |
| |
| * An HTML Parser - That's a fine goal, but this isn't it. It's just |
| XML. |
| * A DOM Builder - You can use it to build an object model out of XML, |
| but it doesn't do that out of the box. |
| * XSLT - No DOM = no querying. |
| * 100% Compliant with (some other SAX implementation) - Most SAX |
| implementations are in Java and do a lot more than this does. |
| * An XML Validator - It does a little validation when in strict mode, but |
| not much. |
| * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic |
| masochism. |
| * A DTD-aware Thing - Fetching DTDs is a much bigger job. |
| |
| ## Regarding `<!DOCTYPE`s and `<!ENTITY`s |
| |
| The parser will handle the basic XML entities in text nodes and attribute |
| values: `& < > ' "`. It's possible to define additional |
| entities in XML by putting them in the DTD. This parser doesn't do anything |
| with that. If you want to listen to the `ondoctype` event, and then fetch |
| the doctypes, and read the entities and add them to `parser.ENTITIES`, then |
| be my guest. |
| |
| Unknown entities will fail in strict mode, and in loose mode, will pass |
| through unmolested. |
| |
| ## Usage |
| |
| ```javascript |
| var sax = require("./lib/sax"), |
| strict = true, // set to false for html-mode |
| parser = sax.parser(strict); |
| |
| parser.onerror = function (e) { |
| // an error happened. |
| }; |
| parser.ontext = function (t) { |
| // got some text. t is the string of text. |
| }; |
| parser.onopentag = function (node) { |
| // opened a tag. node has "name" and "attributes" |
| }; |
| parser.onattribute = function (attr) { |
| // an attribute. attr has "name" and "value" |
| }; |
| parser.onend = function () { |
| // parser stream is done, and ready to have more stuff written to it. |
| }; |
| |
| parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close(); |
| |
| // stream usage |
| // takes the same options as the parser |
| var saxStream = require("sax").createStream(strict, options) |
| saxStream.on("error", function (e) { |
| // unhandled errors will throw, since this is a proper node |
| // event emitter. |
| console.error("error!", e) |
| // clear the error |
| this._parser.error = null |
| this._parser.resume() |
| }) |
| saxStream.on("opentag", function (node) { |
| // same object as above |
| }) |
| // pipe is supported, and it's readable/writable |
| // same chunks coming in also go out. |
| fs.createReadStream("file.xml") |
| .pipe(saxStream) |
| .pipe(fs.createWriteStream("file-copy.xml")) |
| ``` |
| |
| |
| ## Arguments |
| |
| Pass the following arguments to the parser function. All are optional. |
| |
| `strict` - Boolean. Whether or not to be a jerk. Default: `false`. |
| |
| `opt` - Object bag of settings regarding string formatting. All default to `false`. |
| |
| Settings supported: |
| |
| * `trim` - Boolean. Whether or not to trim text and comment nodes. |
| * `normalize` - Boolean. If true, then turn any whitespace into a single |
| space. |
| * `lowercase` - Boolean. If true, then lowercase tag names and attribute names |
| in loose mode, rather than uppercasing them. |
| * `xmlns` - Boolean. If true, then namespaces are supported. |
| * `position` - Boolean. If false, then don't track line/col/position. |
| * `strictEntities` - Boolean. If true, only parse [predefined XML |
| entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent) |
| (`&`, `'`, `>`, `<`, and `"`) |
| |
| ## Methods |
| |
| `write` - Write bytes onto the stream. You don't have to do this all at |
| once. You can keep writing as much as you want. |
| |
| `close` - Close the stream. Once closed, no more data may be written until |
| it is done processing the buffer, which is signaled by the `end` event. |
| |
| `resume` - To gracefully handle errors, assign a listener to the `error` |
| event. Then, when the error is taken care of, you can call `resume` to |
| continue parsing. Otherwise, the parser will not continue while in an error |
| state. |
| |
| ## Members |
| |
| At all times, the parser object will have the following members: |
| |
| `line`, `column`, `position` - Indications of the position in the XML |
| document where the parser currently is looking. |
| |
| `startTagPosition` - Indicates the position where the current tag starts. |
| |
| `closed` - Boolean indicating whether or not the parser can be written to. |
| If it's `true`, then wait for the `ready` event to write again. |
| |
| `strict` - Boolean indicating whether or not the parser is a jerk. |
| |
| `opt` - Any options passed into the constructor. |
| |
| `tag` - The current tag being dealt with. |
| |
| And a bunch of other stuff that you probably shouldn't touch. |
| |
| ## Events |
| |
| All events emit with a single argument. To listen to an event, assign a |
| function to `on<eventname>`. Functions get executed in the this-context of |
| the parser object. The list of supported events are also in the exported |
| `EVENTS` array. |
| |
| When using the stream interface, assign handlers using the EventEmitter |
| `on` function in the normal fashion. |
| |
| `error` - Indication that something bad happened. The error will be hanging |
| out on `parser.error`, and must be deleted before parsing can continue. By |
| listening to this event, you can keep an eye on that kind of stuff. Note: |
| this happens *much* more in strict mode. Argument: instance of `Error`. |
| |
| `text` - Text node. Argument: string of text. |
| |
| `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string. |
| |
| `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument: |
| object with `name` and `body` members. Attributes are not parsed, as |
| processing instructions have implementation dependent semantics. |
| |
| `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>` |
| would trigger this kind of event. This is a weird thing to support, so it |
| might go away at some point. SAX isn't intended to be used to parse SGML, |
| after all. |
| |
| `opentagstart` - Emitted immediately when the tag name is available, |
| but before any attributes are encountered. Argument: object with a |
| `name` field and an empty `attributes` set. Note that this is the |
| same object that will later be emitted in the `opentag` event. |
| |
| `opentag` - An opening tag. Argument: object with `name` and `attributes`. |
| In non-strict mode, tag names are uppercased, unless the `lowercase` |
| option is set. If the `xmlns` option is set, then it will contain |
| namespace binding information on the `ns` member, and will have a |
| `local`, `prefix`, and `uri` member. |
| |
| `closetag` - A closing tag. In loose mode, tags are auto-closed if their |
| parent closes. In strict mode, well-formedness is enforced. Note that |
| self-closing tags will have `closeTag` emitted immediately after `openTag`. |
| Argument: tag name. |
| |
| `attribute` - An attribute node. Argument: object with `name` and `value`. |
| In non-strict mode, attribute names are uppercased, unless the `lowercase` |
| option is set. If the `xmlns` option is set, it will also contains namespace |
| information. |
| |
| `comment` - A comment node. Argument: the string of the comment. |
| |
| `opencdata` - The opening tag of a `<![CDATA[` block. |
| |
| `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get |
| quite large, this event may fire multiple times for a single block, if it |
| is broken up into multiple `write()`s. Argument: the string of random |
| character data. |
| |
| `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block. |
| |
| `opennamespace` - If the `xmlns` option is set, then this event will |
| signal the start of a new namespace binding. |
| |
| `closenamespace` - If the `xmlns` option is set, then this event will |
| signal the end of a namespace binding. |
| |
| `end` - Indication that the closed stream has ended. |
| |
| `ready` - Indication that the stream has reset, and is ready to be written |
| to. |
| |
| `noscript` - In non-strict mode, `<script>` tags trigger a `"script"` |
| event, and their contents are not checked for special xml characters. |
| If you pass `noscript: true`, then this behavior is suppressed. |
| |
| ## Reporting Problems |
| |
| It's best to write a failing test if you find an issue. I will always |
| accept pull requests with failing tests if they demonstrate intended |
| behavior, but it is very hard to figure out what issue you're describing |
| without a test. Writing a test is also the best way for you yourself |
| to figure out if you really understand the issue you think you have with |
| sax-js. |