• Print

Why You Should Care About XML

Since we began talking about the StartWithXML project, a few offline comments have come in suggesting that imposing XML on authors (and editors for that matter) won’t work.

When framed that way, I’m in violent agreement. I would never argue that authors and editors should or will become fluent in XML or be expected to manually mark-up their content. I naively tried fighting that battle before, and was consistently defeated soundly. It is simply too much “extra” work that gets in the way of the writing process.

But there are several reasons why it’s really really important for publishers to start paying attention to XML right now, and across their entire workflow:

  • XML is here to stay, for the reasonably forseeable future. While it’s always dangerous to attempt to predict expiration dates on technology, I think it’s fair to assume XML will have a shelf life at least as long as ASCII, which has been with us for more than 40 years, and isn’t going anywhere soon.
  • Web publishing and print publishing are converging, and writing and production for print will be much more influenced by the Web than vice-versa. It will only get harder to succeed in publishing without putting the Web on par with (or ahead of) print as the primary target. The longer you wait to get that content into Web-friendly and re-usable XML, the worse.

Many in publishing balk at bringing XML “up the stack” to the production, editing, or even the authoring stage. And with good reason; XML isn’t really meant to be created or edited by hand (though a nice feature is that in a pinch it easily can be). There are two places to look for useful clues about how XML will actually fit into a publisher’s workflow: Web publishing and the “alpha geeks.”

Web Publishing

In the early days (mid ’90s), there were two primary ways that content got from a writer to the Web:

  1. Adventurous authors dove into HTML, learning the code needed to express lists and headings and tables. Most people relied on simple text editors, though HTML-specific tools like BBEdit began to emerge
  2. For many other writers, the workflow didn’t change much — articles were written with a word processor, then handed off to the production staff — in the case of the Web, for markup as HTML rather than for composition into print.

Today, the writers behind successful new media and content companies like the Huffington Post, PaidContent, TechCrunch, or Gawker depend on Web-friendly tools like blog platforms, RSS readers, and more recently dedicated writing software for bloggers (I’m writing this post with Mars Edit, though Ecto and Windows Live Writer are other popular choices). For most writers, most of the time, there’s little need to know more than minimal tagging (how to fix an errant hyperlink, for example). The substantial complexity of the XML at work is hidden. But no one will become the next Huffington Post accepting submissions as Word attachments. The tools will evolve, and there’s a real opportunity for publishers and writers willing to experiment on the edge, which brings me to the next place to look for clues about the future:

Alpha Geeks

By “alpha geeks” I mean those experimenting and innovating out on the edge, often doing it as much for the challenge and the learning value as for any specific payback. These early experiments can have a sizable impact on the direction of later effort and innovation. In the context of publishing, I’d say that much of what Harlequin has been doing lately qualifies, as does Bookworm, the Web-based EPUB reader project. Here at O’Reilly we believe in “eating our own dogfood,” and for a large chunk of our frontlist, books are either written directly in XML, or are converted to XML as the first step of production. That’s meant the ability to rapidly prototype new design elements and features, as well as to effectively separate design and content, and to achieve real “single source” publishing for many titles — simultaneously creating on-demand output for print, for Web-friendly PDFs, for ebooks, and for online-access via Safari Books Online. What we’re doing might not make sense for a lot of publishers today, but sitting on the sidelines waiting indefinitely for tools that don’t require new knowledge or skills doesn’t make much sense either.

I wouldn’t be surprised at all if publishers start seeing “manuscripts” in the form of a series of blog posts, or a set of Google Docs. In either case, that’s already Web-friendly XML, and if publishers want to spend their time and money pushing that it into Quark, then onto PDF, and finally on to a vendor to create an ebook, that’s their choice. But someone more nimble and willing to work natively in a Web-friendly format will be difficult to compete with.

Arguing over whether authors can/should/will “use XML” is not a debate I’m interested in having. Maybe they will, maybe they won’t. But XML is becoming part of the fabric of what will only become more digital and more networked content creation, production, and distribution, and continuing to treat it as just an output format for a vendor or developer to care about means missing substantial opportunities. And as books become more connected to the Web, the collaboration and communication made possible become powerful motivators. As Marc Andreessen said (quoted in The World is Flat) about the early Web and the technical hurdles it presented:

People will change their habits quickly when they have a strong reason to do so, and people have an innate urge to connect with other people. And when you give people a new way to connect with other people, they will punch through any technical barrier, they will learn new languages — people are wired to want to connect with other people and they find it objectionable not to be able to.

But what authors will (or won’t) do with XML doesn’t change the importance for publishers of understanding and applying XML to their workflow.

tags: , , , , ,
  • bowerbird

    there’s so many big contradictions in this post
    they can’t be contained in this tiny comment-box.

    but if you _really_ want to see what “alpha geeks”
    are doing in this sphere, check out light markup.

    the best known — used in many blogging tools –
    is “markdown”. currently light-markup systems are
    geared toward _creating_ x/html, but the next step
    logically is to move their conversion routines
    into the browser (and other viewer-programs) so
    there won’t be any need for anyone to know x.m.l.

    simplicity is key. unfortunately for o’reilly,
    simplicity doesn’t sell lotsa conference tickets.

    but hey, dinosaurs, keep spending your money on
    bloated formats. mammals expect you to be slow.

    -bowerbird

  • http://toc.oreilly.com Andrew Savikas

    @bowerbird — you raise a very good point; a mention of markdown was actually in an early draft of this post. I don’t see it as a contradiction at all; quite the opposite — it’s a great example of just the kind of tools that are emerging to mask much of the complexity of the markup, without sacrificing the value of the markup.

    Regardless of how effectively tools like markup hide complexity for an author, publishers still need to develop internal knowledge and expertise around XML.

    BTW, bowerbird, your comments are often insightful, but anonymity diminishes your credibility. Why not join the conversation as yourself?

  • http://toc.oreilly.com Andrew Savikas

    @bowerbird — if you’re suggesting it’s “simple” to store content in (non-standard!) markup like markdown and then rely on browsers to render, I think you and I must have different definitions of “simple”. Browsers can’t even be relied on to consistently handle XSLT stylesheets — there’s no way they’ll be dynamically transforming markdown anytime soon.

    And given the choice between searching, filtering, manipulating and rendering massive amounts of markdown or XML, I’d pick the XML any day.

  • bowerbird

    andrew, i’m not “anonymous”.

    nor am i worried about my “credibility”…

    yes, “bowerbird” is a name i gave myself –
    initially as my performance poetry name –
    but now i use it for all creative purposes.

    if you kept up with the e-book community,
    you would probably have known that, since
    i’ve been an outspoken member in that arena
    since the internet tubes started filling up.

    you might as well ask mark twain or bob dylan
    or george orwell to go by their “real” name…

    i’m even willing to put my phone number out
    — 310.980.9202 — if anyone wants to chat…
    i’ve got nothing to hide. i’m an open e-book.

    -bowerbird

    p.s. and yes, i’m suggesting that the future
    of publishing is indeed with a “simple” means
    of storing the text. the complexity occurs
    when you “remix” that text in various ways…

    (which, by the way, is also true with x.m.l.
    indeed, xslt stuff is _exceedingly_ complex.
    dealing with the obstacle of markup is _hard_.)

    so you go with your way, and i’ll go with mine,
    and we will see in the end who provides people
    with the most cost-effective way to do the job.

    oh, and browsers could “dynamically transform”
    a markdown file by incorporating the small amount
    of code constituting an average markdown parser.
    there’s very little utility — if any — in the
    “middle step” of an independent (x)html file.

  • bowerbird

    oh yeah, i should have mentioned “pandoc”…

    > http://johnmacfarlane.net/pandoc/

    pandoc is basically a rosetta stone between
    different formats, which (to me) demonstrates
    there’s nothing special about any one of ‘em.

    trying to put the “smarts” into the _format_
    is the wrong approach. instead, make the format
    as transparent as possible, and concentrate instead
    on putting the “smarts” into the _applications_…

    -bowerbird

  • http://toc.oreilly.com Andrew Savikas

    @bowerbird — thanks for clarifying; it seems your reputation failed to precede you among those of us who are relatively new to “the ebook community”.

    We’ll have to agree to disagree on the value of XML as a storage and delivery format.

    Thanks for the pointer to pandoc — I’m always happy to check out “universal” conversion tools and see how they perform against our content.

  • bowerbird

    > it seems your reputation failed to precede you
    > among those of us who are relatively new to
    > “the ebook community”.

    ask around. you’ll find i’ve been banned from
    all of the finest establishments… :+)

    and as soon as you deem me to be bad for your
    conference business, you will ban me as well.

    -bowerbird

  • http://www.apsed.com/blog/ Alain Pierrot

    bumblebee, hummingbird
    I guess places which host and welcome the somewhat ominous drone of bowerbird get more pollinisation than sting…
    Even if I disagree with a too steep condemnation of XML, I do hear the query for ‘markdown’.
    Let’s let bowerbirds build their crafty nests within our discussions ?

  • http://www.schlagergroup.com Neil Schlager

    Andrew, let me diverge from bowerbird and say thanks for a most interesting and useful post. I agree completely that publishers who don’t begin to engage with XML on some level are going to be at a serious competitive disadvantage. My company creates our content in XML from the get-go, but we are also a new publisher with no legacy data or systems to worry about. Thus, it’s simpler for us. However, I’ve found that creating or getting content into XML is really the easy part. Far harder is finding/using tools to output that content into print or onto the Web, and to manage it throughout the production process. Small publishers like us, with limited budgets and a decided lack of “alpha geeks” on staff, must look to off-the-shelf programs that are user-friendly and affordable at the same time. It’s these tools that could help move the entire industry forward.

  • bowerbird

    neil said:
    > Far harder is finding/using tools to output
    > that content into print or onto the Web, and to
    > manage it throughout the production process.

    that’s not “diverging” from my position, neil…
    you’re _making_ my position. :+)

    the main reason there aren’t a lot of good tools
    to manage x.m.l. workflow and output the content
    is because it’s darn hard to write programs that
    can dodge the markup when they need to do that,
    and implement it when they need to do _that_…

    even when those tools come, they won’t be cheap.

    > Small publishers like us, with limited budgets
    > and a decided lack of “alpha geeks” on staff,
    > must look to off-the-shelf programs that are
    > user-friendly and affordable at the same time.

    and here’s another problem area.

    although typically sold as a “generic” solution,
    an x.m.l. workflow usually requires customization.
    most especially when you’ve utilized the “x” part,
    and “extensibled” what you want the content to do.

    odds are that the off-the-shelf software will not
    handle one or more of those aspects the way that
    you _intended_ it to be handled, and you must then
    revisit your x.m.l. coding so it conforms to the
    assumptions implicit in your off-the-shelf tools.
    (which might not be that readily apparent to you.)
    this is when the software _can_ do what you want.
    sometimes you’re just gonna be plain out of luck,
    and have to accept the fact that it cannot do it.

    the x.m.l. proponents don’t tell you all of this,
    not in a way that really drives home the point…

    of course, they’ll be willing to hold a conference
    where you can _learn_ these things. for a price.
    “start with x.m.l.” appears to be such a thing…
    take a look at its price. that’s to get started.
    (there will be more conferences to come later on.)

    don’t get me wrong. x.m.l. can be made to work.
    but other methods can _also_ “be made to work”.

    so the question is “which is most cost-effective?”
    x.m.l. works. it’s just not cheap, and not fast.
    and — perhaps worst of all — it requires you to
    pay costs up-front, while the benefits come later.

    of course, if you’ve already decided that x.m.l.
    “is the future”, then i highly suggest you bite
    the bullet and accept the high cost of doing it,
    including the high cost of starting out with it.

    because if you don’t do it right, from the start,
    you’ll find that it will get even more and more
    expensive down the line to correct your errors…

    heck, maybe i _am_ good for o’reilly’s business.

    > It’s these tools that could
    > help move the entire industry forward.

    well, it will be some time before there are enough
    competitors in the sphere that the price declines.
    you might even wait for tools that _never_ arrive.
    does that warp the cost-benefit equation for you?

    -bowerbird

  • http://www.apsed.com/blog/ Alain Pierrot

    @bowerbird & Neil Schlager
    Prices do decline on the market for XML editors and do not compare with Adobe’s and Quark’s quasi-mandatory packages pricing…

    Take EditiX XML Editor : 60$/120$ (small business)
    [Ok, it’s an ‘alpha geek’ tool].

    You also have Pixware XMLMind XMLEditor, with a very generous licence for the free version, and their open source, professional user licence at 300$ — with Docbook, XSL-FO, (X)HTML output…

    Add into the landscape Open Office’s (free) management of XSL stylesheets leveraging the .odt bundles…

    The initial investment can be kept fairly reasonable.

  • http://www.schlagergroup.com Neil Schlager

    Dear Alain, good point about the cost and growing ubiquity of XML editors. We use OXygen, which is a little more expensive but works quite well. The tougher challenge is on the CMS side: finding tools that can manage your content not only for print but also for the Web, single-source publishing, output software on the page layout and typesetting front, etc. There are a zillion CMSs, including many open source ones, but finding one that can really assist with single-source publishing is not so easy. On the other end of the spectrum are packages like the Really Suite, but these get very expensive, indeed. Perhaps some convergence of these various tools will assist publishers. Do you see that happening?

  • http://www.apsed.com/blog/ Alain Pierrot

    “Ay, there’s the rub”, CMS are still far from what multiple media publishing requires.

    Some hope and experiments seem on the good track, with eXist (Native XML db) and xquery techniques.

    A lot has yet to be invented, shared and discussed about
    versioning,
    generic/device specific data to be stored,
    profiles of the actors responsible for the creation, edition, use and maintenance of the different components of a content…

  • bowerbird

    alain said:
    > Prices do decline on the market for XML editors
    > and do not compare with Adobe’s and Quark’s
    > quasi-mandatory packages pricing…

    but do those editors do the job as well? there is
    a reason adobe and quark can do their extortion,
    and that reason is that their tools actually work…

    bill hill over at microsoft has commented on this:
    > http://billhillsblog.blogspot.com/2008/08/lack-of-decent-tools-holding-back-web.html

    > A lot has yet to be invented, shared and discussed

    that’s true. but the tech is not being sold that way.
    it’s being pitched as if it were some mature process.

    -bowerbird

  • http://thinkubator.ccsp.sfu.ca John Maxwell

    As I read this article (and the rest of the StartWithXML pieces) part of me has been struggling to properly articulate my discomfort. But when I read bowerbird’s first post here, I relaxed, because he nailed it.

    The issue isn’t XML vs. not-XML. The issue is simplicity vs complexity.

    The “light markup” approach leads straight to XHTML if you think about it. In conjuction with a web CMS’ higher level categories/tagging architecture this can give you loads to work with. It is just not necessary to design/adopt/commit to complex schemas, toolkits, and the “XML industry” in order to make this stuff work. Maybe if your business is large-scale aggregation you need to go the distance with tagging everything within an inch of eternity… but for most prose publications? Look to the simple, ubiquitous tools already in play.

    We did some work last year with wiki-based content fed (via some trivial transforms) into InDesign for print output, and it worked very nicely(http://thinkubator.ccsp.sfu.ca/FunnelWeb) The InDesign end of it was by far the most difficult, because the tool’s goals vis a vis XML are so scrambled and opaque. Getting nicely organized XHTML (with metadata) out of our wiki was utterly trivial in comparison.

    Embrace simplicity; this is the only way to make XML work for most of us.

  • http://toc.oreilly.com Andrew Savikas

    @John — your perspective is not at all uncommon, and completely understandable. Especially for small-scale projects, the approach you’ve described might make perfect sense (though few publishers have the in-house expertise to transform wiki-markup into InDesign, however trivial it may seem to you).

    Over the past six years, I’ve built and/or maintained plumbing among Word, OpenOffice, FrameMaker, XHTML, InDesign, PageMaker, troff, DocBook, RedCloth and MediaWiki (among others). In my experience, paths like the one you’ve described work very well on a small scale, with a team working closely together that includes at least one person with strong scripting and markup skills. But I’ve not seen any that scale beyond a handful of projects.

    It’s also my experience that once someone becomes comfortable with XML, they find much of the workflow far simpler than what they faced previously. If your workflow makes sense for your projects, that’s great. On our end, we’re doing things faster, cheaper, and better atop a solid XML-based foundation, and I absolutely believe it’s a useful model for a good number of other publishers.

    It also doesn’t have to be super expensive. The only piece of our XML toolchain that we pay for is our PDF renderer, which costs less than a few licenses of InDesign. Plenty of people are doing it completely non-commercial, like the Subversion project (we currently use Subversion in lieu of a CMS or DAM — not a bad example of choosing simplicity). There’s a lot of room for experimentation and innovation, especially when the interfaces are standards-based (wiki formats proliferate, but they all send HTML to the browser). Thanks for your comments.

  • bowerbird

    andrew said:
    > But I’ve not seen any that scale
    > beyond a handful of projects.

    well, it’s hard to prove an ability to scale,
    unless you actually _do_ scale.

    but one way to prove it is to demonstrate an
    ability to handle a comprehensive test-suite.
    if you can deal with everything in the suite,
    then presumably you can scale without problems.

    so how about a little experiment here, andrew?

    provide the raw material — text, images, etc. –
    for one of your most difficult-to-handle books,
    and that will form the beginning of a test-suite.

    then anyone who comes along with a system can
    prove its cost-benefit value against the suite.

    -bowerbird

    p.s. not to diss o’reilly, but i’d say that your
    timetable for putting out most of your catalog in
    kindle format indicates an _inability_ to scale.

  • http://toc.oreilly.com Andrew Savikas

    @bowerbird — A number of our books are already available in a variety of formats under licenses that allow the kind of tinkering you propose. Have at it, and please do share your findings.

  • http://thinkubator.ccsp.sfu.ca/ John Maxwell

    Anderw wrote:

    > @John — your perspective is not at all uncommon, and completely understandable.
    > Especially for small-scale projects, the approach you’ve described might make perfect
    > sense (though few publishers have the in-house expertise to transform wiki-markup
    > into InDesign, however trivial it may seem to you).

    Just clarifying for the sake of accuracy… we didn’t transform wiki markup — the wiki does that already. The trivial transforms were simply XHTML into something a little more palatable to InDesign’s stylesheet system.

    The point being, it *is* XML… but it doesn’t have to be complicated XML, or require complicated toolsets (open source or otherwise).

  • bowerbird

    andrew said:
    > A number of our books are already available

    unless we agree that it is a test-suite, and develop it
    with that specific purpose in mind, and then evaluate
    the results on agreed dimensions, with acute attention
    to both the costs and the benefits, there is little point.

    but perhaps research that explicit wouldn’t benefit you.

    -bowerbird

  • http://www.magellanmediapartners.com Brian O'Leary

    Or, you could find an example of an installation that actually did scale using the approach you endorse. I would be delighted to document such a case and include it in the research paper that will accompany the conference.

  • bowerbird

    > Or, you could find an example of an installation that
    > actually did scale using the approach you endorse.

    i’ve been as nonspecific in my description of
    “the approach i endorse” as you have been…

    however, i don’t mind getting more concrete.

    but there are too many free parameters here.

    what are the objectives?

    what is the state of the input?

    what types of output do we expect?

    what are the specific terms of evaluation?

    how do we measure the costs and benefits?

    on the big scale, how do we design the cyberlibrary?

    i’ve developed my own answers to these questions,
    but i think a healthy dialog on them would be good,
    for everyone concerned… and crucial if we _really_
    wanna design our future. let’s hear what you think.

    -bowerbird

  • http://www.magellanmediapartners.com Brian O'Leary

    You may be making my suggestion overly complicated. I’m saying that, if you know of a publishing operation that has scaled using the kind of approach you endorse, I would be happy to contact them, profile their work and include them as an “alternative to XML” in our case studies.

  • bowerbird

    i’m not familiar with any specific “publishing operations”…

    but if you’re trying to sell companies on an x.m.l. approach,
    you must know a lot of them that aren’t using it currently…

    perhaps they are succeeding, perhaps muddling through.
    perhaps on the brink of disaster, and you can save them.
    i just don’t keep up with the box scores of the dinosaurs.

    still, if you know of no competing approaches to x.m.l.,
    then i guess you have got the whole market to yourself!

    x.m.l. is inevitable. isn’t it?

    -bowerbird

    p.s. i don’t think those questions are “complicated” at all.
    i think answering them is an essential aspect of our future.

  • http://www.magellanmediapartners.com Brian O'Leary

    Taking your thoughts in turn:

    - My offer to profile companies that you feel offer credible alternatives to the use of XML remains open.

    - I think we are making a good-faith effort to do what we can to provide readers and forum attendees with a range of alternatives that would help them improve content production and reuse. That we favor XML is a sticking point for you, but our work involves assessments of alternatives and priorities based on a publisher’s situation.

    - I didn’t ask you to identify companies that are not using XML or are using it poorly; I invited you to name a firm that employed a lighter, more nimble approach at scale, with the offer to profile them as part of the research. I appreciate that you are unable to name a firm.

    - I didn’t find your questions complicated; I found your proposal unduly involved when all you wanted to do is prove that a non-XML option can scale.

    I accept your point that the questions need to be answered as new systems are designed and evaluated. I’ll see what I can do to include them in the next draft of the research outline.

  • bowerbird

    brian-

    i do appreciate that your concern is with
    the preparation of your upcoming seminar.

    best of luck with that. :+)

    as for scaling, i’m doing research on
    the project gutenberg corpus, with its
    e-texts numbering 15-20,000 currently.

    i think of that as my proof-of-concept,
    and will then turn to the o.c.a. corpus,
    which should number in the millions then,
    and will eventually hit tens of millions…

    scale doesn’t scare me. just part of the job.

    -bowerbird