MXL MicroXML Parser MicroXML is a simplified subset of normal XML 1.0 (5th edition) created by James Clark and John Cowan. The specification is at: https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html It is currently the subject of a W3C Community Group, but is not a W3C Standard or on the Standards Track. An informative pair of articles about it by Uche Ogbuji are at: http://www.ibm.com/developerworks/library/x-microxml1/ http://www.ibm.com/developerworks/library/x-microxml2/ The MXL Parser is designed to parse MicroXML by two different methods. It produces a Data Model in stricr accordance with the spec, and also provides SAX-type push parsing at the same time. Either or both are selectable. MXL reports as errors everything in the document that is not conformant to the MicroXML spec. In its FullXML mode, it does the same, and when in its SAX mode it also reports the content of four constructs excluded from the data model: comments, DOCTYPEs, CDATA sections, and PIs. They each have their own callback function. None of the four ever appear in the Data Model itself. MXL is written in C++ and is currently compiled with Visual C++ 6.0. It references windows.h along with stdio.h and stdlib.h, but does not use Microsoft-specific functions, so it should be readily portable to other platforms. The Windows version consists of two parts: mxlparser.dll, the parser itself, and mxl.exe, a simple console driver for it. Operation In a Windows command prompt window, the MXL parser is invoked by: Usage: mxl [sourcefile (default is stdin)] [options] Options: -o outputfile Default is stdout -e errorfile Default is stderr -n No content model, otherwise sent as JSON to outputfile -s Send SAX messages (as diagnostics) to errorfile -f FullXML report DOCTYPE, CDATA, and PIs as SAX messages to errorfile instead of reporting them as errors -x Expat callbacks for start and end tags, text, and PIs -a Provide brief help on the mxlparser.dll API -h or -?, Provide help (this message) The API help mentioned there is: API for mxlparser.dll: First create an MxlParser with: MxlParser *Parser = new MxlParser(); Optionally, set up options and SAX callbacks: Parser->SetOptions(UseSAX, UseModel, FullXML); (all bools) Parser->SetCallbacks(ErrFileName, ReportErrorFunc, StartTagFunc, EndTagFunc, TextContentFunc, ReportCDataFunc, ReportPIFunc, ReportDoctypeFunc, ReportCommentFunc); For expat-compatible callbacks, use SetExpatCallbacks instead, which has a longer list of callbacks. Finally, parse the file: element *DataModel = Parser->ParseFile(SourceFileName); Error messages and comments are sent to ErrFileName (default stderr) unless the Report*Func says otherwise. If UseSax, the Tag and Text callbacks are used; the stub functions for them report the UTF-32 strings in JSON to ErrFileName. If UseModel, the data model is returned at the end as a struct with all strings in UTF-32 encoding, zero terminated. If FullXML, the DOCTYPE, CDATA, and PIs are reported as SAX messages instead of errors; they are never in the data model. Data Model The Data Model is returned as a structure by mxlparser.dll upon completion of the parse. All text items in it are in UTF-32 strings, zero-terminated, for which length is also given in the structure: typedef unsigned char unc; typedef unsigned long unl; struct element { // data model uses one top element per doc unl *name; // array of UTF-32 chars long namelen; pair **attrs; // array of attribute pairs long attrcnt; cont **content; // arrays of element ptrs or UTF-32 chars long contcnt; }; struct pair { unl *name; // attribute name, UTF-32 long namelen; unl *val; // attribute value, UTF-32 long vallen; }; struct cont { void *it; // ptr to array of UTF-32 chars or element long cnt; // count if chars, 0 if element }; For convenient study, the driver converts the structure to JSON format as used in the spec, and writes it to stdout (or to a specified file) at completion. Here is what it produces for the sample in par. 3.1 of the spec: [ "comment", { "lang": "en", "date": "2012-09-11" }, [ "\nI ", [ "em", {}, [ "love" ] ], " \u00B5XML!", [ "br", {}, [] ], "\nIt's so clean & simple." ] ] This is slightly different formatting from the spec, as we wanted to make all braces and brackets have matching start and end columns when they held more than one item. SAX Callbacks When SAX mode is enabled, the parser calls back to these functions, which are sent character data in UTF-32: void StartTag(unl *name, long namecnt, pair **attrs, long attrcnt); void EndTag(unl *name, long namecnt); void TextContent(unl *text, long textcnt); In FullXML SAX mode, it also uses these, which are sent character data in UTF-8: void ReportComment(char *comment); void ReportPI(char *pi); void ReportDoctype(char *doctype); void ReportCData(char *cdata); Whether in SAX mode or not, it always reports errors via this function, which is sent character data in UTF-8: void ReportError(long line, char *warning, char *cpt, bool fatal); Hardly any errors are considered fatal; for most, some form of recovery is at least attempted. For example, the parser tries to match an end tag that doesn't match the current start tag to the parents of the current element. It reports any such issues and fixes as errors. The stub functions provided for the callbacks all report the name of the callback and the text sent to it in JSON format to stderr (or to the errorfile set by the user). Hence callbacks and errors precede the output of the Data Model when SAX is specified but no callbacks are set in mxlparser.dll by the using program. expat-compatible Callbacks When callbacks are set with SetExpatCallbacks, these are used: void StartTag(void *userdata, char *name, long namecnt, char **attrs); void EndTag(void *userdata, char *name, long namecnt); void TextContent(void *userdata, char *text, long textcnt); void StartCdataSection(void *userdata); void EndCdataSection(void *userdata); void ReportPI(void *userdata, char *target, char *data); void XMLDecl(void *userdata, char *version, char *encoding, int standalone); [standalone = -1] void StartDoctypeDecl(void *userdata, char *name, char *sys, char *pub, int internalsubset); [internalsubset = 0] void EndDoctypeDecl(void *userdata); void ReportComment(void *userdata, char *comment); The stub functions add "Ex" to the start of the reports, as in "ExPI:". All names and content are in UTF-8. Licensing The MXL Parser is entirely written by Jeremy H. Griffith of Omni Systems, . Omni intends to use it for an upcoming product, working name uDoc, which is a MicroXML editor specifically configured for a document format similar to a simplified DITA. We intend to license at least the parser, and probably the entire product, as FOSS. We are currently considering the GPL, although the Apache license is also a possibility. At that point, we will create a SourceForge project for it. Omni currently has three products available. The first is Mif2Go, , a commercial converter from FrameMaker source to a variety of output formats, including Word, DITA, HTML, and many forms of Help such as FOSS OmniHelp hosted on SourceForge. Mif2Go is free for a large number of its users: the unemployed, retired, underemployed consultants, academics (staff, faculty, students), most nonprofits, and FOSS developers. Quite a few of its paying customers are Fortune 100's and government agencies, who can afford to support the rest. The second is DITA2Go, , a converter from DITA to the same outputs as Mif2Go, with which it shares a large part of its code. The third is uDoc2Go, , whick converts from uDoc to the same outputs as Mif2Go, with which it also shares a large part of its code. Part of the impetus for the newest product is concern over the deteriorating quality and increasing cost of Adobe's Framemaker. The other part is concern over the difficulties many users are experiencing with the increasing complication of DITA. MicroXML fits well with a product meant to improve life for the Technical Writers using both Frame and DITA.