Bring on them pings!

In standard web search, we seldom need to obtain as results pages that were published just a few minutes ago.

This is very different for semantic index that want to help you build applications that respond to “information out there”, possibly as quickly as possible. A way to make this happen is to efficiently process large amounts of “Pings”

Pinging Sindice is simple… submit your url and find your page in the index 15 minutes later. Go to http://sindice.com/main/submit and paste in your url, or use the ping api described at http://sindice.com/developers/api

At least that’s the theory. In practice this has historically never been working too well in Sindice.

Up to two weeks ago.

As normal, the story starts earlier. Three months ago I answered “yes” when Giovanni asked “can you fix our ping manager?”. I didn’t know what I was putting myself up for… but it’s fixed now and Giovanni is a happy man.

Here’s how it works:

  1. Pings are received by a servlet and inserted into a mysql database with a status of ‘queued’.
  2. Every 5 minutes the ping manager wakes up and looks for any queued pings. For each ping:- change the ping status to ‘in_progress’
    - fetch the ping using any23
    - apply TBox reasoning, recursively loading referenced ontologies fetching any missing ontologies as required.
    - apply ABox reasoning
    - update the Sindice page repository
    - update the Sindice master index (SIREn)
    - change the ping status to ‘processed’.
  3. Every 20 minutes copy the master (writable) index to the slave (read-only) index.

Some of the changes we’ve made recently to improve the ping manager are:

  1. A precise monitoring of each stage of the ping manager into our master control panel
  2. Reports and triggers when something goes wrong
  3. We added a ‘slow queue’ for large sets of pings. While we typically receive about 5000 pings each day, we have received as many as 100,000 from a single site! In these cases we move the large set of pings into the ‘slow queue’ allowing them to be processed with a lower priority than individual pings or smaller data sets.
  4. We’ve made many refactorings to process large ontologies and to limit ontology fetching recursion.

As a work in progress, the upcoming improvements include:

  1. Batch processing of large sets of pings. We will fetch the pings with sindice crawler and process them in our hadoop cluster. Note that this is what is already happening now when Sindice processes large sites via semantic sitemaps ( http://sw.deri.org/2007/07/sitemapextension/ ) or blocks of crawled data. Once this is done, the single machine limit is overcome in terms of pings volume.
  2. Loopback from the ping manager into the crawler. We will use the pings as seeds for the sindice crawler which will navigate sites honouring the robots.txt file.
  3. Auto refresh of pings. We will refresh our index automatically after your site changes in order to maintain up-to-date results.

Fifteen years in the software game and I’m still learning every day. Here are three important lessons I have learned or re-learned in the past few months:

  1. Caching is magic. Applying reasoning is a fairly expensive operation, but we’ve found that very many documents use the same set of ontologies. By caching each set of TBox reasoning results we save tons of computational overhead, far more than anticipated. Right now we are using JCS (http://jakarta.apache.org/jcs/) which is working great. Later I hope we can move memcached (http://www.danga.com/memcached/) in order to a) easily share the cache between servers and b) automatically expire the cached results.
  2. Reason in a sandbox. Renaud’s contextual reasoning approach[1] provides good and reliable results without any risks of contamination. Renaud is working on a blog article that will explain in practice what it does and what what you can do to make your date work nice with it.
  3. Java profiles are very useful. Yourkit is great. (http://www.yourkit.com/java/profiler/ )

So don’t forget or be shy: make sure your RDF producing app gets indexed right away by pinging us when new information is made available.

Keep them pings coming! We are now ready for plenty of them.

And special thanks to Michele, Renaud, Stephen, Giovanni and the rest of the great team here at DERI, it’s a pleasure working with ye!


[1]  R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, . [slides]

Post filed under Sindice.

One comment

  1. Pingback from danja's status on Wednesday, 10-Jun-09 08:30:07 UTC - Identi.ca  

About this blog

In this blog you'll find announcements related to Sindice project, as well as news about Semantic Web topics or technical issues strictly related to the search engine.

Categories