Why it’s so hard to get print stories online
These days my world at work revolves around getting news stories out of my employer’s newspaper publishing system and onto the web via my employer’s in-house web CMS.
Welcome to my nightmare.
You’d think it would be easy, just issue a SELECT * FROM news WHERE date = $today; and be done with it, right? Well, not so much.
1980 called, they want their markup back
The first hurdle is that a fair percentage of newspaper publication systems are either old or based on old underpinnings. The system we have is no exception. There exists an “export” mechanism which gets you an “ASCII” dump of whatever stories you asked for, except it’s not really ASCII because it’s got all these custom high-bit characters in it that you have to deal with. Nope, its not UTF-8 or ISO-8859-1 either. Standards?!? What were you thinking?
But that’s not the hard part. In these days of HTML and XML and even standards based on XML like NewsML and NITF the notion that a print publication system would have and maintain it’s own proprietary markup language is mind boggling. But ours does. Modern notions like “close all open tags” have no meaning here.
Now the overall file format isn’t too bad, if you overlook the smattering of control characters scattered randomly throughout the data (never have figured that one out). It’s the content of the individual story bits that is a witch’s brew of bizarre tagging rules.
Submitted for your approval:
[TEXTOBJ]<USNEWS><NO1>
Web hed: Man bites dog
card/b1/mmedit
<NO>[BY]By Sam Spade
<MC>FOOTOWN GAZETTE WRITER
[TEXT]Lorum Ipsum blaa blaa blaa…
See, this is slick, you have markup tags within <...> brackets, and context-sensitive style “macros” between the [...] brackets. Tags can be “closed” by other tags or the classic </...> or the end of a block or nutso stuff like the <NO1>...<NO> above. That, by the way, is an “editor’s note” — be sure to strip all of those lest you let through damning internal comments. Good place for additional printable data too (um, not).
Some markup can reference config files, like <CFnn> means “change font” and will contain a number that references the font face specified in some INI file on the (UNIX) server somewhere (yes, an INI file).
When I say the style macros are context-sensitive I mean they essentially apply a set of <...> tags depending on their position. A [RAIL] in Sports may be completely different than a [RAIL] in Features… and lest we forget, a [...] tag is optional, you can always specify the <...> tags directly too. Ah, good times.
Oh, and don’t even get me started on the mixed DOS and Unix style line enders.
Thinking about running away screaming yet? I’m just getting started.
Field of Nightmares
Fielded data. Every web story database expects radical, forward-thinking things like headlines in headline fields and bylines in byline fields, etc. Well in our print system there are in-fact headline “objects” and there are even photo objects… and separate caption objects too (because separating photos and their captions makes sense… somehow).
There is however no separate delineation for bylines or datelines or anything else, because that’s all body copy. It’s just body copy with a different style applied, or maybe it’s a tag… or one of several, nested tags, or maybe a tag and a macro… whatever as long as it looks good in print.
On the upside, you get really good at Regular Expressions.
But wait, there’s more!
On the page layout, someone has to specifically mark each headline, story, photo and caption as belonging together. They need to be referenced together. This has no bearing on the printed page mind you so how often do you think this happens properly? Know references, know headlines; no references, no headlines.
Its my way or the highway
Ok, so you have your proprietary markup format, and your custom styles by section and publication. A tall order for even Perl’s legendary regular expression engine. Lets make it worse.
Dig it: you can have fundamentally different ways of doing things at different publications. Isn’t that cool?
Take headlines for instance: one of the sites I support uses a “headline object” for the main headline and another one for each subhead. This makes perfect sense until you find out that a another site uses a single headline object for all their headlines with the first one listed being the main headline.
No, problem, a little if...then action and… oh wait… a third site uses a headline object for the main headline and [...] styles within the body copy for subheads. Ok, now we’re into code hell.
When I suggested that some sites might want to change the way they do things… well, lets just say I’d have been better off suggesting boiled cat for lunch. “The way WE do it is better than the way THEY do it.”
Yeah, I’m sure it is. Would you like another helping of cat with that?
Lost in translation
Finally we have massaged and cajoled all this data onto whatever convoluted output format is being asked of us (usually some XML variant). We’ve converted and handled all the quirks and even mapped the print product’s lettered sections to the arbitrary section names used online (sortof).
It sucks and you know it. You strive for 80% of the content in the right place and you know you’re not there. No matter, you shove it off to the custom-built in-house web CMS (designed for newspapers!) and only then do you find out that the CMS has no notion of, oh I don’t know, how about stuff like “editions” or even freaking page numbers.
On top of that you get esoteric little nits like XML entity parsing errors. We spent a lot of time making things like m-dashes show up as — in the XML feeds only to find out that the parser being used by the CMS doesn’t handle that right at all and converts them to &#151; which gets you a literal — on the page, which you might notice is not an m-dash.
That’s ok, there’s a bug report on it and it’s slated to be addressed in an update scheduled for Q1 2008, which is a nice way of saying, “fix it your own damn self.”
And in Q3 2008 when they roll in that fix, who’s pager you think is gonna go off?
So what’s my point
Most editors don’t care about any of this. They’re paying bottom dollar for a staff of 20-something-year-old upload monkeys to get their paper’s news online and rants over such obscure things as ASCII markup and XML entities are, to them, moot and irrelevant.
But my little rant isn’t just to vent my spleen (ok, maybe a little), it’s to illustrate a point. The entire technology foundation that many newspapers are built on today was never designed for web publishing (or any other non-newsprint-based publishing) and is in some cases so ill-suited to it as to become a huge road block to taking advantage of web-based opportunities. And I’m not just referring to editorial systems either, classified, advertising, billing… they all suffer like this.
Solutions become Rube Goldberg monstrosities duct-taped onto the outside of these legacy systems and we wonder why they dont work worth a damn. We wonder why our butts get handed to us on almost a daily basis. These systems are broken and no amount of duct-tape will fix them.
“Don’t give me a big mess of patches where each new patch you apply makes water squirt out somewhere else,” an editor once told me in a meeting about story exports to the web.
“But that’s what you already have!” I said.
Leave a comment