Enterprise Search at The Guardian

Went to this on Monday, it seems to have been a bit of an activity trigger, so thought I’d add – OK, bit of a copy-and-paste cheat for this one – it’s from an internal post I made for other people’s interest at work – internal references removed.

Earlier this week I went to http://www.meetup.com/es-london/calendar/14829629/ – a highly interesting & informative event about search technologies at The Guardian, a long-running UK newspaper at the forefront of digital publishing.

I’m expecting that the presentations given will be published online, in the interim here are some points you may find interesting:

Initially various in-house and customised search technologies in the mid-90s, has had to grow massively in scope and content
Now consolidating on Apache Solr – this is powerful enough to run out-of-the-box, yet scale to serve millions of documents across all their properties – they are even migrating 3rd-party applications
To ensure the search is scalable, their search schema is kept simple – most searches are by freetext, and curated tags. They then use search features such as faceted search & related items to do the heavy lifting
Expanding into Linked Data and Reference IDs – http://www.guardian.co.uk/open-platform/blog/linked-data-open-platform for more
Even given they are in the content publishing business, published content is basically a frontend to a bunch of search queries – ’search as a platform’ – http://www.flax.co.uk/blog/2010/10/19/when-search-isnt-just-search-at-the-guardian/
They are constantly adding content via a system of automated indexing/replication, 24 hours a day. They have optimised their publishing system for continual, incremental updates rather than large pushes
They have also opened the technology, and the content – to the world – http://www.guardian.co.uk/open-platform describes their Open Platform
That link is a bit of a kid-in-a-candy-store one (this must be how our users feel when we do a Data Upload ) – some sections I found useful are http://www.guardian.co.uk/open-platform/faq , http://www.guardian.co.uk/news/datablog , http://www.guardian.co.uk/open-platform/content-api-content-search-reference-guide , and…
http://explorer.content.guardianapis.com/ to explore the content itself – http://content.guardianapis.com/search?q=spending+review&order-by=newest&format=json
- http://explorer.content.guardianapis.com/#/search?tag=sport%2Fboxing%2Csport%2Fchess&order-by=newest&format=json
Their strategy appears to be, as a content organization, to invest in the platform as a core resource, and then expose it for use by partners – “Our vision is to weave the Guardian into the fabric of the Internet, to become ‘of’ the web rather than ‘on’ the web.” They are already seeing large benefits with their commercial partners from this approach – http://blog.rodger-brown.com/2010/10/guardian-open-platform-it-or-innovation.html