Monday, November 10, 2014

Preloading data at Solr startup

Yesterday, I was explaining to a friend of mine something about Solr. So I had several cores with different configurations and some sample data.

While switching between one example and another I just realized that each (first) time I had to load and index the sample data manually. That was good for the very first time, but after that, it was just a repetitive work. So I started looking for some helpful built-in thing...and I found it (at least I think): the SolrEventListeners

SolrEventListener is an interface that defines a set of callbacks on several Solr lifecycle events:
  • void postCommit()
  • void postSoftCommit()
  • void newSearcher(SolrIndexSearcher newSearcher, SolrIndexSearcher currentSearcher)
For this example, I'm not interested in the two first callbacks because the corresponding invocations will happen, as their name suggests, after hard and soft commit events.
The interesting method is instead newSearcher(...) which allows me to register a custom event listener associated with two events:
  • firstSearcher
  • newSearcher
In Solr, the Index Searcher which serves requests at a given time is called the current searcher.  At startup time, there's no current searcher because the first one is created; hence we are in the "firstSearcher" event, which is exactly what I was looking for ;)

When another (i.e new) searcher is opened, it is prepared (i.e. auto-warmed) while the current one still serves the incoming requests. When the new searcher is ready, it will become the current searcher, it will handle any new search requests, and the old searcher will be closed (as soon as all requests it was servicing finish). This scenario is where the "newSearcher" callback is invoked.

As you can see, the callback method for those two events is the same, there's no a "firstSearcher" and a "newSearcher" method. The difference resides in the input arguments: for "firstSearcher" events there's no a currentSearcher so the second argument is null; this is obviously not true for "newSearcher" callbacks where both first and second arguments contain a valid searcher reference.

Returning to my scenario, all what I need is
  • to declare that listener in solrconfig.xml
  • a concrete implementation of SolrEventListener
 In solrconfig.xml, within the <udateHandler> section I can declare my listener

<listener event="firstSearcher" class="a.b.c.SolrStartupListener">
    <str name="datafile">${solr.solr.home}/sample/data.xml&lt;/str>
</listener>

The listener will be initialized with just one parameter, the file that contains the sample data. Using the "event" attribute I can inform Solr about the kind of event I'm interested on (i.e firstSearcher).

The implementation class is quite simple: it extends SolrEventListener:

public class SolrStartupListener implements SolrEventListener

in the init(...) method it retrieves the input argument:

@Override
public void init(final NamedList args) {
    this.datafile = (String) args.get("datafile");
}

last, the newSearcher method preloads the data:

    LocalSolrQueryRequest request = null;
    try {
           // 1. Create the arguments map for the update request
           final NamedList args = new SimpleOrderedMap();
            args.add(

                    UpdateParams.ASSUME_CONTENT_TYPE,  
                    "text/xml");
            addEventParms(currentSearcher, args);

            // 2. Create a new Solr (update) request
            request = new LocalSolrQueryRequest(

                     newSearcher.getCore(), 
                     args);
           
            // 3. Fill the request with the (datafile) input stream
           
final List streams = new ArrayList();
            streams.add(new ContentStreamBase() {
                @Override
                public InputStream getStream() throws IOException {
                    return new FileInputStream(datafile);
                }
            });
           
            request.setContentStreams(streams);
           
            // 4. Creates a new Solr response
           
final SolrQueryResponse response = 
                new SolrQueryResponse();
           
            // 5. And finally call invoke the update handler
            SolrRequestInfo.setRequestInfo( 

                new SolrRequestInfo(request, response))

            newSearcher
                 .getCore()
                 .getRequestHandler("/update")
                 .handleRequest(request, response);    
  
        } finally {
            request.close();
        }


Et voilĂ , if you start Solr you will see sample data loaded. Other than avoiding me a lot of repetitive tasks, this could be useful when you're using a SolrCore as a NoSql storage, like for example if you are storing SKOS vocabularies for synonyns, translations and broader / narrower searches.    

No comments: