Wednesday, January 2, 2008

Relating content automatically in Plone

A question arose today at the Plone general mailing list (a.k.a. Plone-users): it is possible to create a list of related content automatically?

Well, the answer is yes and I'm going to tell you how.

Some time ago Benjamin Saller created a proof-of-concept product called Haystack to do auto-classification of content. Haystack was built around Open Text Summarizer and the haystack_tool included a couple of methods to summarize text and to get a list of "topics" extracted from the content. Haystack also included some portlets to demonstrate its functionality.

We used Haystack in La Jornada for some time with mixed results: the summarizer worked well; we called it to create the description field of our content using Ajax in order to reduce the work of our publishers at edition time.

On the other hand, with the "topics" obtained we were creating a portlet that retrieved the related content. The main problems with this were the low quality of the "topics" and the implementation of the relation. Sometimes we had some embarrassing results relating content from Iraq with some other of, let's say, Shakira, just because they shared some "topic".

Haystack didn't understood the meaning of words and, of course, Ben Saller was aware of that. Last time I saw him was during the Plone Conference 2006 in Seattle. He gave a talk on Haystack 2.0 and he was really excited about its new features: linguistic mapping and automated conceptual mapping, providing high-quality relationships with little or no human effort.

Unfortunately for us, Ben has been a little bit away from the Plone community for some time. So I don't know what's the status on his work.

Going back to the original question in the mailing list, Matt Bowen pointed out to me that Yahoo! has a web service called Term Extraction that does almost the same thing and he even found a python implementation for it.

I tested Term Extraction with some text in Spanish and I was very pleased with the results:

<ResultSet xsi:schemaLocation="urn:yahoo:cate http://api.search.yahoo.com/ContentAnalysisService/V1/TermExtractionResponse.xsd">
    <Result>wong kar wai</Result>
    <Result>stephen frears</Result>
    <Result>festival de cannes</Result>
    <Result>sean penn</Result>
    <Result>25 de mayo</Result>
    <Result>cines</Result>
    <Result>organizadores</Result>
    <Result>evidencia</Result>
    <Result>el presidente</Result>
    <Result>hace mucho tiempo</Result>
    <Result>afp</Result>
    <Result>ya</Result>
</ResultSet>

Implementing this in Plone seems not to be quite complicated: you can trigger a script in a workflow transition, or use Content Rules in Plone 3.0, to fill the Subject field or, better, add an additional field to store this information. Just remember the Term Extraction web service is limited to 5,000 queries per IP address per day.

Yes, I know this solution suffers from the same problems that Haystack, but the "topics" obtained here have better quality and you can always find a better algorithm to do the relation, like testing for more than one "topic" or using only "topics" longer than one word.

Anyway I will put this in my list of pending stuff to test (with a little help of Matt Bowen, of course).

No comments:

Post a Comment