hvelarde: December 2007

Thursday, December 27, 2007

Mapping NITF into Plone's metadata

As I mentioned previously, Plone standard News Items have not enough metadata for using them seriously in a newspaper publication.

When we started using Plone for our breaking news site, we tried to fill the gap using some fancy Web 2.0 features like tags and tag clouds. I was convinced (well, in fact I am still convinced) that online media should face the problem of organizing news in a different way. Unfortunately, the implementation of this solution was problematic from the beginning.

First, tag clouds are wild beasts and you have to learn a lot before trying to implement one (my early version of the TagCloud product proved to have a lot of flaws). Second, social bookmarking in our environment showed a lot of disadvantages like nonexistence of a controlled vocabulary, use of "unclear" tags and spelling errors.

Worst of all, publishers never liked the concept.

A tag cloud was used for navigation in the microsite for the Mexican general election in 2006

I also think many ordinary readers never understood the tag cloud neither because not many newspapers were using it at that time (even today, almost 2 years later, it's difficult to find tag clouds in online versions of traditional media).

At the end we just had to abandon the idea and, obviously, it was quite evident that we needed to extend the functionalities of our site to include concepts like sections and a way to indicate if a new article was more important than another.

In La Jornada we have been using NITF to store news articles for the printed edition's site since some time and it had simplified our work. We needed to bring that experience to the breaking news site.

As you can see in its documentation, NITF defines a lot of metadata to be used on the different stages of a news article life. So, the first thing I did was a mapping between Plone's metadata and NITF elements and attributes:

Subject (nitf/head/docdata/key-list): list of keywords; holds a list of keywords about the document
Contributors (nitf/body/body.end/tagline): a byline at the end of a story
Creation Date (nitf/head/docdata/date.issue): date/time document was issued
Last Modified Date (nitf/head/revision-history): information about the creative history of the document; also used as an audit trail (includes who made changes, when the changes were made, and why)
Effective Date (nitf/head/docdata/date.release): date/time document is available to be released
Expiration Date (nitf/head/docdata/date.expire): date/time at which the document has no validity
Language (nitf/body/@xml:lang ): language value governed by RFC3066
Rights (nitf/head/docdata/doc.copyright): copyright information for document header

Then, I identified what was the information we were missing:

Property (nitf/head/tobject/tobject.property/@tobject.property.type): subject code property; includes such items as analysis, feature, and obituary. In our case we use it to differentiate news articles produced in-house from the ones written by our associates
Section (nitf/head/pubdata/@position.section): named section of a publication where a news object appear, such as Science, Sports, Weekend, etc.
Urgency (nitf/head/docdata/urgency/@ed-urg): is used to define the importance of a news article (1=most, 5=normal, 8=least)
Byline (nitf/body/body.head/byline): container for byline information; it can be unstructured text or structured text with direct specification of the responsible person/entity and their title

After having this in mind, I started looking how to accomplish the task the easier way.

Friday, December 21, 2007

Plone development in La Jornada. A little bit of history

Back in 1995, La Jornada was the first newspaper on Spanish language to have a web site. Over the years, the site evolved from static HTML files, written entirely by hand, to use some dynamic content creation in PHP, but the work involved on maintaining the content and structure was huge. We needed a change, so in 2005 we started looking for a CMS.

The web site of La Jornada in 1996

I asked La Mancha, a friend of mine who was one of the visionaries who brought the newspaper into the web, and he gave me 2 options: one was written in Perl and the other one, in Python. The choice was obvious for me, and we started looking for more information on Plone soon.

I wrote to the Enfold Systems' guys and they put me in touch with Carlos de la Guardia. In three weeks, Carlos gave us an introduction to the Python/Zope/Plone world and he started acting as a consultant for the development.

Those were the early days of Plone 2.1 and Archetypes was the way to go. Our idea was to create content types for everything: editions, current news, analysis, features, opinions... you name it.

We spent several months working on that and, I have to admit, our first tests were not very successful. We had many things to learn and very few experience in many topics.

In the middle of 2006 we had to create a microsite to cover the Mexican general election. We decided to use a plain-vanilla Plone instance with no funky stuff on it and just install a basic skin. The experience was so good that we started using the same system, with some minor changes, as the breaking news edition of the diary.

In late 2006, when we were about to restart the work, I traveled to Seattle for the Plone Conference 2006 and after assisting to Martin Aspeli's talk, I finally saw the light: the way we'd been working was fine... but there were better and easier ways to do it!

From content types to adapters was the new paradigm shift, and we wanted to embrace it as soon as possible. Unfortunately, we couldn't restart the work because we started experiencing performance issues as a result of the increasing traffic. We spent many months understanding and fixing the problems.

The graph shows the traffic growth in La Jornada in the last 5 years (from 2003 to 2007)

The main development has been in standby since then but we've been releasing some of our work slowly. We decided to move some of our products to the Collective in order to enhance collaboration.

I've been talking lately with Carlos about current trends on Plone and Zope deployment and we have now a better idea of what we want to implement in the near future.

Tuesday, December 18, 2007

Beyond News Items: the need for news industry standards in Plone

News articles in Plone are instances of the News Item content type: they can contain a title, a description, a body text, an image and some basic metadata. If you publish a couple of items from time to time, this is fine.

But suppose you have to publish dozens of items everyday... How do you tell your readers who they are about? What do they cover? Where do they took place? And, more important, how do you classify them? How do you organize them? How do you tell your readers which ones are newsworthy?

To solve these, and other issues, the IPTC developed XML standards to define the content and structure of news articles. NITF, NewsML and NewsCodes are among these standards and they support the classification, identification and description of a huge number of news articles characteristics.

NITF and NewsML have different uses: NITF is intended to structure independent news articles; NewsML is for the structuring of multimedia news packages. Inside a NewsML object we can have things like alternative representations of the same article to be used on different media, or different translations. NewsML can also be used to encapsulate many NITF objects to form an edition.

NewsCodes are about consistent coding of news metadata (taxonomies). NewsCodes define sets of topics to be assigned as metadata values to any object inside a news article (text, photos, graphs, audios and videos).

As you can imagine this is quite powerful. Typical uses of these standards are: in and between editorial systems, between news agencies and their customers, between publishers and news aggregators, and between news service providers and end users. Many content providers and system vendors currently support them worldwide.

If Plone could understand NITF and NewsML, it would be easy to interoperate with editorial systems and feed a whole edition of a newspaper or magazine into a website, even in different languages and formats, and including all sort of multimedia content. You could also use Plone to distribute your content to different aggregators or sell it to your customers.

If Plone could handle NewsCodes it would be easier to describe, manage, transmit and exchange news articles.

That's a lot of work, and that's what we had in mind when we started the Julius project in early 2006.

Wednesday, December 5, 2007

ZODB performance: a small change can make a big difference

Tunning Zope's ZODB performance can be a big challenge. Almost everything you can find on the web about it can be resumed on the following: use the Edisonian approach: trial and error discovery.

At La Jornada we are using ZEO to serve about 500,000 pages to more than 60,000 visitors everyday. Some weeks ago I noticed we were facing performance issues with the servers holding most of the ZEO clients of our installation: at peak times the load was so high that we were unable to serve some pages.

In the beginning I thought I had made a mistake with the amount of memory dedicated to the servers but, based on our previous experiences with the main server, I started investigating how to decrease CPU usage looking for bottlenecks using the usual Linux tools like vmstat and iostat.

I didn't found anything clear but, suddenly, I noticed that the clients on the main server were also acting strangely: one of them had a very high CPU usage while the other had almost none. I started looking for a mistake on the load-balancing configuration, but after some time I discovered that one of the ZEO clients had a value of 20,000 in the cache-size directive and the other, the one running smoothly, a value of 30,000.

I made the following change on the zope.conf file of all of my ZEO clients:

<zodb_db main>
cache-size 30000
<zeoclient>
…
</zeoclient>
mount-point /
</zodb_db>

After restarting them I was shocked with the results: a minimum increase on memory usage and an amazing decrease on CPU usage. You can see below the behavior of one of the servers in the previous month (the change in the configuration occurred at the middle of the graph):

This chart shows processor load on a ZEO server in the previous month

So, talking about cache-size, how big is big enough? According to Chris McDonough's excelent presentation at the Plone Symposium 2005 in New Orleans: as big as you can make it without causing your machine to swap.

And, how many objects can we store given a certain amount of memory? According to Dieter Maurer, as the size of objects can vary unboundedly, this gives very imprecise control over the main memory used for the cache.

In our case, with about 700,000 objects in the ZODB, memory consumption is around 1GB on a cluster of 3 ZEO clients (with 4 threads and 30,000 objects in cache each one).

As we reserved 2GB for this server, we still have some space to grow up; we only need to find out some time to test it.

Monday, December 3, 2007

Implementing a photo gallery viewlet for CompositePack using jQuery

CompositePack is a beautiful piece of code written by Godefroid Chapelle. We use CompositePack at La Jornada to create the frontpage for the breaking news edition.

Some time ago I was asked to create a viewlet for a photo gallery and we started testing some gallery products to accomplish this task. We tried FriendlyAlbum, Plone SmoothGallery, plonegalleryview and, the best and most promising by far, Slideshow Folder.

All products had limitations that keep me away from using them as a base for my solution. I didn't wanted to create any new content type neither, so the approach I followed was this:

use a folder as a container for the gallery; the name and description of the folder was going to be the title and introduction text for the gallery
use Image as the content type for the photos; the name and description of every image was going to be the alternative text (the alt attribute) and caption of the photo

Having this in mind, I started analyzing how to integrate one of the many JavaScript libraries available to create the gallery. I wanted to use KSS, as this framework is on the way to become the standard on the Plone world, but I abandoned the idea soon. KSS is still on development and most of the current work is being done on the Plone 3.0 branch. The only thing I found for Plone 2.5 was a product named PloneAzax with a 1.0 release in Alpha state. PloneAzax was more like a demo of how AJAX would be used in Plone 3.0, than an usable product. Worst: it had some conflicts with CompositePack and I didn't wanted to mess with it.

After some research on the web I decided to use jQuery JavaScript Library. jQuery is fast and concise, and it lets you traverse HTML documents and handle events. From here all my work was pretty straightforward... well I had to fight a little bit with jQuery and IE, but that's another story.

I created the viewlet the usual way and I inserted the JavaScript code with the help of a Python script. To get the list of images in a given folder I used context.atctListAlbum(images=1). I took this idea from atct_album_view.pt, the page template used to view the images in a folder as small thumbnails. I registered the viewlet for Folder content type using a modified version of compositepack-customisation-policy.py and the JavaScript file using portal_javascripts, the JavaScript Registry.

The photo gallery is working fine on IE 6.0+, Firefox 1.5+, Safari 2.0+ and Opera 9.0+. Please note that only one gallery is allowed per page. It would be nice to add some effects, but that will have to wait.