Sunday, December 04, 2016

Composing and reusing request handlers: the "Invisible Queries Request Handler"

Here is an extract of an old article [1] on Lucidworks.com by Grant Ingersoll:

"It is often necessary in many applications to execute more than one query for any given user query.  For instance, in applications that require very high precision (only good results, forgoing marginal results), the app. may have several fields, one for exact matches, one for case-insensitve matches and yet another with stemming.  Given a user query, the app may try the query against the exact match field first and if there is a result, return only that set.  If there are no results, then the app would proceed to search the next field, and so on."


The sentence above assumes the reader has the capability of changing the (client) application behaviour by issuing to Solr several and subsequent requests on top of a user query.

What about you don't have such control? Imagine you're the relevance engineer of an e-commerce portal that has been built using Magento, which, in this scenario, acts as the Solr client; someone installed and configured the Solr connector and everything is working: when the user submits a search, the connector forwards the request to Solr, which in turns executes a (single) query according with the configuration.

What if that query returns no results? The interaction is gone, and the user will probably see something like "Sorry, no results for your search". Although this sounds perfectly reasonable, in this post I'd like to focus on a different approach / alternative (that could still end itself with a no results message), based on the "invisible queries" thing you can read in the extract above.

The main point here is a precondition: I cannot change the client code; that because, for example:
  • I don't want to introduce custom code in Magento
  • I don't know PHP
  • I'm strictly responsible for the Solr infrastructure and the frontend developer doesn't want / is not able to implement this feature in a configurable way
  • I want to move as much as possible the search logic in Solr    
What I'd like to do is to provide a single entry point (i.e. one single request handler) to my clients and behind the scenes, being able to execute a workflow like this:


In few words: I want to execute several request handlers until one of them produces a positive result.

Before entering in the implementation, which is very simple and can (will) be improved with a lot of cool things, here, in my github account, you can find / see / use a working version of such component:


Other than source code with comments, you can find a brief documentation, unit and integration test and a maven repository.

The underlying idea is to provide a Facade which is able to chain several handlers; something like this, in solrconfig.xml:


<requestHandler name="/search" class="...InvisibleQueriesRequestHandler">
    <str name="chain">/rh1,/rh2,/rh3</str>
</requestHandler>

where /rh1, /rh2 and /rh3 are standard SearchHandler instances you've already declared that you want to chain in the workflow described in the diagram above. 

The InvisibleQueriesRequestHandler implementation is very simple: as you can see the handleRequestBody method sequentially executes the configured handler references, and stops when a query returns positive results (i.e. numFound > 0):

chain.stream()
// Get the request handler associated with a given name
.map(refName -> { return requestHandler(request, refName); })
// Only SearchHandler instances are allowed in the chain
.filter(SearchHandler.class::isInstance) 
// executes the handler logic 
.map(handler -> { return executeQuery(request, response, params, handler); })
// Don't consider negative (i.e. no results) executions
.filter(qresponse -> howManyFound(qresponse) > 0)
// Stop at first positive execution
.findFirst()
// or, if we don't have any positive executions, just returns an empty response.
.orElse(emptyResponse(request, response)));

I tried to use a composed method approach, so the remaining part of the class is composed by several small and (hopefully) cohesive methods. I think in this way the code is more readable.

As I told you before, although the handler is working it can be improved a lot. A useful thing could be, for example, to put default parameters / values in the InvisibleQueriesRequestHandler and have each subsequent handler just override them (instead of declaring everything from scratch).

If you want to give a try without entering in the implementation details, there's a Maven repository with the last stable version of the library; see the README in the github repository for detailed instructions.

Feel free to try and, as usual, any feedback is warmly welcome!  


Monday, November 21, 2016

Quickly debug your Solr add-on

Let’s say you need to write a component, or a request handler, or in general some piece of custom code that needs to be plugged in Solr. 

You will probably write unit tests, integration tests, everything to make sure things behave as you would expect; but while developing, it is (at least for me) tremendously useful a productive and debug environment where it is possible to follow step by step what's happening in my code or within the hosting framework (Solr, in this case), taking a deep look at how actually things work behind the scenes. 


The following are a simple sequence of steps I’m usually doing to quickly have such environment.
All you need is:

  • Java (of course)
  • Eclipse or Intellij (I will use Eclipse in this post) equipped with Maven or Gradle
The approach is the same you would use for running integration tests with Solr, as described here

The difference, this time, is that you will use a "dummy" integration test that starts up Solr and just wait. 
I know, you’re thinking that is not an elegant solution…yeah, I agree with you, but I never told you I’m going to tell you about an elegant solution…just keep in mind that everything can be setup in 5 minutes…this is the real focus here, and in my opinion the most important thing; on top of that you can spend the next decade writing down the most beatiful-x-unit-oriented test suite, but you can do that beside that "dummy" test.

Step #1: Create a new Maven project


The pom.xml (you can find the complete pom.xml in github) has two interesting sections; the first is a central area where we declare the main dependencies versions as general properties:

<properties>
     <jdk.version>1.8</jdk.version>
     ...
     <solr.version>6.2.1<solr.version>
</properties>

I used to declare this section because it is a central point where I can easily control the main artifacts versions (e.g. if I want to switch to Solr 5.5 it's just a matter of replacing 6.2.1 with 5.5)


The second section includes the dependencies:

<dependency>
     <groupId>org.apache.solr</groupId>
     <artifactId>solr-core</artifactId>
     <version>${solr.version}</version>
</dependency>
<dependency>
     <groupId>org.apache.solr</groupId>
     <artifactId>solr-test-framework</artifactId>
     <version>${solr.version}</version>
</dependency>
...

Step #2: Write your add-on

The code I'm working on while writing this post is a custom RequestHandler (a nice subject for a next post); something like this:
public class CompositeSearchHandler extends SearchHandler { 
    @Override
    public void init(NamedList args) {
       super.init(args);
    }
 
    @Override
    public void handleRequestBody(
       final SolrQueryRequest request, 
       final SolrQueryResponse response) throws Exception {
  
      Arrays.stream(refNames(request.getParams()))
         .forEach(refName -> {
            requestHandler(refName).handleRequest(request, response);
         });
 ...
}
It's not important here what my custom code is actually doing; what I want is
  • put a debug breakpoint on the red line
  • run the component within Solr 
  • debug step by step what happens, starting from the red line and going in the Solr codebase (RequestHandlerBase, in this case)

Step #3: Write a base (abstract) Solr IntegrationTest 


This will be the superclass of my "dummy" test: it will just provide a Solr environment. The core method is decorated with the @BeforeClass annotation:
public abstract class BaseIntegrationTest extends SolrJettyTestBase {
    protected static JettySolrRunner SOLR;

    @BeforeClass
    public static void init() { 
       System.setProperty(
              "solr.data.dir"initCoreDataDir.getAbsolutePath());
   
       try {
            SOLR = createJetty(
                 "path-to-solr-home",
                 JettyConfig.builder()
                    .setPort(8983)
                    .setContext("/solr")
                    .stopAtShutdown(true)
                    .build());  
      } catch (final Exception exception) {
        throw new RuntimeException(exception);
      }
 }

Step #4: Write the "dummy" test


I guess no explanation is needed here....
public abstract class StartDevSolr extends BaseIntegrationTest {
    @Test
    public void start() throws IOException {
        System.out.printn("Press any key to stop Solr");
        System.in.read(); 
    }

Step #5: Done! Start debugging


Right click on the test case and "Debug as -> JUnit test". As you can see an embedded Solr is started, and your classes are there, loaded within the same JVM. Now it is up to you. In my example, the request handler answers on the "/sample" endpoint so all what I need is a command line like this:
> curl "http://127.0.0.1:8983/solr/gazza/sample?q=something" 
and as expected, this is what I see in Eclipse:


Enjoy!

Tuesday, March 29, 2016

Randomizing top-n results in Solr

So, after shuffling a bit [1] the top-n search results returned by Solr, you may want to effectively randomize them in a *non-repeatable* way. Why? I don't know...I'm just enjoying some coding experiment while I'm travelling :)

What I want to do is: run a query and (pseudo) randomly reorder the first top results. I will be using again the query reranking feature, but this time, I need a re-ranking query that produces different results each time is invoked.

I created a simple function [2] (i.e. a ValueSourceParser plus a ValueSource subclasses) that is based on a (threaded-local) java.util.Random instance which simply returns a (pseudo) random number each time it is invoked.

Once the two classes have been packed in a jar, put under the lib folder and configured in solrconfig.xml with the name rnd:

<valueSourceParser name="rnd" class="com.faearch.search.function.RandomValueSourceParser"/>

I only need to use it in a re-rank query using the boost parser:

<requestHandler ...>
    <str name="rqq">{!boost b=rnd() v=$q}</str>
    <str name="rq">{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=1.2}</str>
...

You can now start Solr, index some document, run several times the same query (by default ordered by score) and see what happens.  Don't forget to include the score in the field list (fl) parameter; in this way you will see the concrete effect of the multiplicative random boost:

http://...?q=shoes&fl=score,*

1st time

<result name="response" numFound="2" start="0" maxScore="0.32487732">  <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.32487732</float></doc>
 <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.22645184</float></doc>
</result>

2nd time (ooops that's the same order...don't worry, it's the randomness, and I indexed only 2 docs, see the score value, which is different from the previous example)

<result name="response" numFound="2" start="0" maxScore="0.61873287">  <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.61873287</float></doc>
 <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.3067757</float></doc>
</result>
  
3rd time

<result name="response" numFound="2" start="0" maxScore="0.24988756">  <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.24988756</float></doc>
 <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.22548665</float></doc>
</result>

See you next time ;)

[1] http://andreagazzarini.blogspot.it/2015/11/shuffling-top-results-with-query-re.html
[2] https://gist.github.com/agazzarini/a802eff3b50c03fae2364458719be94e

Sunday, November 08, 2015

Shuffling top results in Solr with query re-ranking

You built a cool e-commerce portal on top of Apache Solr; brands and shops are sending you their data in CSV and you index everything with a little effort, just a matter of few commands (more than one as the content of each CSV slightly changes between sources).

Now it's search time but...yes, there's a but: sometimes, the first top results (10, 20, 30 or more) belong to the same shop (or the same brand), even if other shops (or brands) have that kind / type of product.

For instance, a search for "shirt"
  • returns 5438 results in 109 pages (60 results / page)
  • the first 118 results (the first two pages) belong to the "C0C0BABE" brand
  • starting from the 119 result, other brands appear
This could be a problem, because sooner or later other brands will complain about that "hiding" issue: the impression rate of the third page is definitely lower than the first page. As consequence of that, it seems like your website is selling only items from "C0C0BABE".

What can we do? Results need to be sorted by score, another criterion would necessarily compromise the computed relevancy.

Well, in this scenario, I discovered the Query Re-Ranking [1] capability of Solr; I know, it is not a new feature, it has been introduced in Solr very long time ago...I never met before a scenario like this ("Mater artium necessitas")

From the official Solr Reference Guide:

"Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself"

The component interface is very simple. You need to provide three parameters:
  • reRankQuery: this is the query that will be used for re-ranking;
  • reRankDocs: the (minimum) number of top N results to re-rank; Solr could increase that number during the re-ranking
  • reRankWeight: a multiplicative factor applied to the score of the documents matching the reRankQuery and, at the same time, belonging to the top reRankDocs set. For each of them, that additional score will be added to the original score of the document (i.e. the score resulting from the main query)
Cool! But the actual question was: what about the reRankQuery?? I should emulate a random behaviour, like random querying a field with a non-structured content. At the end that has been exactly what I did: I saw in the schema a non-structured field, the product description, which contains free text.

Then I created a copy of such field in another searchable "shuffler" (Text)field , with a minimum text analysis (standard tokenization, lowercasing,  word delimiter):
<field name="shuffler" type="unstemmed-text" indexed="true" .../>
<copyField src="prd_descr" dest="shuffler"/>
As last thing, I configured the request handler with the re-rank parameters as follow:
<str name="rqq">
    {!lucene q.op=OR df=shuffler v=$rndq}
</str>
<str name="rq">
     {!rerank reRankQuery=$rqq reRankDocs=220 reRankWeight=1.2}
</str> 
As you can see I'm using a plain Solr query parser for executing a search on the "shuffler" field mentioned above. What about the $rndq parameter? That is the query, which should contain a (probably long) list of terms. I defined a default value like this:
<str name="rndq">
    (just top bottom button style fashion up down chic elegance ... )
</str> 
What is the goal here? The default operator of the query parser has been set to OR so the reRankQuery will give a chance to the first reRankDocs to collect an additional "bonus" score if their shuffler field contains one or (better) more terms provided in the $rdnq parameter.

The default value, of course, will be always the same, but a client could provide an its own $rndq parameter with a list of terms different for each request.

For the other parameters (reRankWeight and reRankDocs) those are the values that work for me...you should run some test with your dataset and try / adjust them.

The overall stuff is not precise, is not so deterministic...but it works ;)

-----------------------------------------------------

[1] https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking 
   

Saturday, October 17, 2015

How to do Integration tests with Solr 5.x

Please keep in mind that what is described below is valid only if you have a Solr instance with a single core. Thanks to +Alessandro Benedetti for alerting me on this sneaky stuff ;) 
I recently migrated a project [1] from Solr 4.x to Solr 5.x (actually Solr 5.3.1),  and the only annoying stuff has been a (small) refactoring of my integration test suite.

Previously, I always used the cool Maven Cargo plugin for running and stopping Solr (ehmm Jetty with a solr.war deployed in) before and after my suite. For those who are still using Solr 4.x here [2] there's the configuration. It is just a matter of a single command:

> maven install

Unfortunately, Solr 5.x is no longer (formally) a web application, so I'd need to find another way to run the integration suite. After googling a bit I wasn't able to find a solution so I asked to myself : "How do Solr folks run their integration tests?" and I find this artifact [3] on the Maven repository: solr-test-framework..."well, the name sounds good", I said. 

Effectively, I found a lot of already built things that do a lot of stuff for you. In my case, I only had to change a bit my integration suite superclass; actually a simple change because I had to extend from org.apache.solr.SolrJettyTestBase.

This class provides methods for starting and stopping Jetty (yes, still Jetty because even formally Solr is no longer a JEE web application, actually it still is, and it comes bundled with a Jetty, which provides the HTTP connectivity). Starting the servlet container in your methods is up to you, by means of the several createJetty(...) static methods. Instead, that class provides a @BeforeClass annotated method which stops Jetty at the end of the execution, of course in case it has been previously started.

You can find my code here [4], any feedback is warmly welcome ;) 

--------------------------
[1] https://github.com/agazzarini/SolRDF
[2] pom.xml using the Maven Cargo plugin and Solr 4.10.4
[3] http://mvnrepository.com/artifact/org.apache.solr/solr-test-framework/5.3.1
[4] SolRDF test superclass using the solr-test-framework

Sunday, June 07, 2015

Towards a scalable Solr-based RDF Store

SolRDF (i.e. Solr + RDF) is a set of Solr extensions for managing (index and search) RDF data.



In a preceding post I described how to quickly set-up a standalone SolRDF instance in two minutes; here, after some work more or less described in this issue, I'll describe in few steps how to run SolRDF in a simple cluster (using SolrCloud). The required steps are very similar to what you (hopefully) already did for the standalone instance. 

All what you need  

  • A shell  (in case you are on the dark side of the moon, all steps can be easily done in Eclipse or whatever IDE) 
  • Java 7
  • Apache Maven (3.x)
  • Apache Zookeeper  (I'm using the 3.4.6 version)
  • git (optional, you can also download the repository from GitHub as a zipped file)


Start Zookeeper 


Open a shell and type the following

# cd $ZOOKEPER_HOME/bin
# ./zkServer -start

That will start Zookeeper in background (start-foreground for foreground mode). By default it will listen on localhost:2181

Checkout SolRDF


If it is the first time you hear about SolRDF you need to clone the repository. Open another shell and type the following:

# cd /tmp
# git clone https://github.com/agazzarini/SolRDF.git solrdf-download

Alternatively, if you've already cloned the repository you have to pull the latest version, or finally, if you don't have git, you can download the whole repository from here.

Build and Run SolRDF nodes


For this example we will set-up a simple cluster consisting of a collection with two shards.

# cd solrdf-download/solrdf
# mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Dzk=ZOOKEEPER_HOST_PORT \
    -Pcloud \
    clean package cargo:run

Where
  • $PORT is the hosting servlet engine listen port;
  • $DATA_DIR is the directory where Solr will store its datafiles (i.e. the index)
  • $ULOG_DIR is the directory where Solr will store its transaction logs.
  • $ZOOKEEPER_HOST_PORT is the Zookeeper listen address (e.g. localhost:2181)
The very first time you run this command a lot of things will be downloaded, Solr included. At the end you should see something like this:

[INFO] Jetty 7.6.15.v20140411 Embedded started on port [8080]
[INFO] Press Ctrl-C to stop the container...

the first node of SolRDF is up and running! 

(The command above assume the node is running on localohost:8080)

The second node can be started by opening another shell and re-executing the command above

# cd solrdf-download/solrdf
# mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Pcloud \
    cargo:run

Note:
  • "clean package" options have been omitted: you've already did that in the previous step
  • you need to declare different parameters values (port, data dir, ulog dir) if you are on the same machine
  • you can use the same parameters values if you are on a different machine
If you open the administration console you should see something like this:



(Distributed) Indexing


Open another shell and type the following (assuming a node is running on localhost:8080):

# curl -v http://localhost:8080/solr/store/update/bulk \
    -H "Content-Type: application/n-triples" \
    --data-binary @/tmp/solrdf-download/solrdf/src/test/resources/sample_data/bsbm-generated-dataset.nt 


Wait a moment...ok! You just added 5007 triples! They've been distributed across the cluster: you can see that by opening the administration consoles of the participating nodes. Selecting the "store" core of each node, you can see how many triples have been assigned to that specific node.



Querying


Open another shell and type the following:

# curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  

# curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+xml"
...

  In the examples above I'm using only (for indexing and querying) the node running on localhost:8080 but you can send the query to any node in the cluster. For instance you can re-execute the query above with the other node (assuming it is running on localhost:8081):

# curl "http://127.0.0.1:8081/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  


You will get the same results.

Is that ready for a production scenario? No, absolutely not. I think a lot needs to be done on indexing and querying optimization side. At the moment only the functional side has been covered: the integration test suite includes about 150 SPARQL queries (ASK, CONSTRUCT, SELECT and DESCRIBE) and updates (e.g. INSERT, DELETE) taken from the LearningSPARQL book [1], that are working regardless the target service is running as a standalone or clustered instance.

I will run the first benchmarks as soon as possible but honestly at the moment I don't believe I'll see high throughputs.

Best,
Andrea

[1] http://www.learningsparql.com

Sunday, April 19, 2015

RDF Faceting with Apache Solr: SolRDF

"Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters."
(Source: Wikipedia)

Apache Solr built-in faceting capabilities are nicely described in the official Solr Reference Guide [1] or in the Solr Wiki [2].
In SolRDF, due to the nature of the underlying data, faceted search assumes a shape which is a bit different from traditional faceting over structured data. For instance, while in a traditional Solr schema we could have something like this:

<field name="title" .../>
<field name="author" .../>

<field name="publisher" .../>
<field name="publication_year" .../>
<field name="isbn" .../>
<field name="subject" .../>
...


In SolRDF data is always represented as a sequence of triples, that is, a set of assertions (aka statements), representing the state of a given entity by means of three / four compounding members: a subject, a predicate, an object and an optional context. The holding schema, which is described better in a dedicated section of this Wiki, is, simplifying, something like this:

<!-- Subject -->
<field name="s" .../>

<!-- Predicate -->
<field name="p" .../>
 
<!-- Object -->
<field name="o" .../>

A "book" entity would be represented, in RDF, in the following way:

<#xyz>
    dc:title "La Divina Commedia" ;  
    dc:creator "Dante Alighieri" ;
    dc:publisher "ABCD Publishing";
    ...


A faceted search makes sense only when the target aggregation field or criteria leads to a literal value, a number, something that can be aggregated. That's the reason you will see, in a traditional Solr that indexes books, a request like this: 

facet=true 
&facet.field=year 
&facet.field=subject  
&facet.field=author
 
In the example above, we are requesting facets for three fields: year, subject and author.

In SolRDF we don't have such "dedicated" fields like year or author, but we always have s, p, o and an optional c. Faceting on those fields, although perfectly possible using plain Solr facet fields (e.g. facet.field=s&facet.field=p), doesn't make much sense because they are always URI or blank nodes.

Instead, the field where faceting reveals its power is the object. But again, asking for plain faceting on o field (i.e. facet.field=o), will result in a facet that aggregates apples and bananas: each object represents a different meaning, it could have a different domain and data-type. We need a way to identify a given range of objects.

In RDF, what determines the range of the object of a given triple, is the second member, the predicate. So instead of indicating what is the target field of a given facet, we will indicate a query that selects a given range of objects values. An example will be surely more clear.
Solr (field) faceting:

facet=true&facet.field=author

SolRDF (field) faceting:

facet=true&facet.field.q=p:<http://purl.org/dc/elements/1.1/creator>

The query will select all objects having an author as value, and then faceting will use those values. The same concept can be applied to range faceting.

Facet Fields


Traditional field faceting is supported on SolRDF: you can have a field (remember: s,p,o or c) to be treated as a facet by means of the facet.field parameter. All other parameters described in the Solr Reference Guide [1] are supported. Some examples: 

Ex #1: field faceting on predicates with a minimum count of 1

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.mincount=1   

Ex #2: field faceting on subjects and predicates with a different minimum count


q=SELECT * WHERE { ?s ?p ?o }   
&facet=true   
&facet.field=p 
&facet.field=s 
&
f.s.facet.mincount=1  
&f.p.facet.mincount=10   

Ex #3: field faceting on predicates with a prefix (Dublin Core namespace) and minimum count constraints

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.prefix=<http://purl.org/dc
   

Object Queries Faceting


Facet field queries have basically the same meaning of facet fields: the only difference is that, instead on indicating a target field, faceting is always done on the o(bject) field, and you can indicate, with a query, what are the objects that will be faceted. Some examples:

Ex #1: faceting on publishers


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/publisher>


Ex #2: faceting on names (creators or collaborators)


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> p:<http://purl.org/dc/elements/1.1/collaborator>


Ex #3: faceting on relationships of a given resource


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=s:<http://example.org#xyz> p:<http://purl.org/dc/elements/1.1/relation>


The facet.field.q parameter can be repeated using an optional progressive number as suffix in the parameter name:  

Ex #4: faceting on creators and languages


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> &facet.object.q=p:<http://purl.org/dc/elements/1.1/language>


or

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>
&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>
 
In this case you will get a facet for each query, keyed using the query itself:

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="p:">
    
<int name="Ross, Karlint">12
     <int name="Earl, James">9
     <int name="Foo, John">9
     ...
   </lst>
  
<lst name="p:">
    
<int name="en">3445

     <int name="de">2958
     <int name="it">2865
     ...
   </lst>
 
</lst> 

</lst

The suffix in the parameter name is not required, but it is useful to indicate an alias for each query:

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>

&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>  
&facet.object.q.alias.1=author
&facet.object.q.alias.2=language 

The response in this case will be (note that each facet is now associated with the alias):

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="author">
    
<int name="Ross, Karlint">12</int>
     <int name="Earl, James">9</int>
     <int name="Foo, John">9</int>
     ...
   </lst>
  
<lst name="language">
    
<int name="en">3445
</int>
     <int name="de">2958</int>
     <int name="it">2865</int>
     ...
   </lst>
 
</lst><</lst



Object Range Queries Faceting


Range faceting is described in the Solr Reference Guide [1] or in the Solr Wiki [2]. You can get this kind of facet on all fields that support range queries (e.g. dates and numerics).
A request like this:

facet.range=year
&facet.range.start=2000
&facet.range.end=2015
&facet.range.gap=1

will produce a response like this:


<lst name="facet_ranges">
   
<lst name="year">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ... 
      
</lst>     
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</
lst>
    ...

 

As briefly explained before, with semi-structured data like RDF we don't have "year" or "price" or whatever strictly dedicated field for representing a given concept; we always have 3 or 4 fields:
  • a s(ubject)
  • a p(redicate)
  • an o(bject)
  • and optionally a c(ontext)
Requesting something like this:

facet.range=o

wouldn't work: we would mix again apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Range faceting for s or p or c attributes doesn't make any sense at all because the corresponding URI datatype (i.e. string) doesn't support range queries.

In order to enable range faceting on SolRDF, the default FacetComponent has been replaced with a custom subclass that does something I called Objects Range Query Faceting, which is actually a mix between facet ranges and facet queries.
  • Facet because, of course, the final results are a set of facets
  • Object because faceting uses the o(bject) field
  • Range because what we are going to compute are facet ranges
  • Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on them.
In this way, we can issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>
&facet.range.start=2000
&facet.range.end=2010
&facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>  
&facet.range.start=2000-01-10T17:00:00Z 
&facet.range.end=2010-01-10T17:00:00Z
&facet.range.gap=+1MONTH

You can also have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ...
      
</lst>
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</lst>
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000-03-29T17:06:02Z">2516</int>
         
<int name="2001-04-03T21:30:00Z">1272</int>
          ...
      
</lst>       

      <int name="gap">+1YEAR</int>
      
<int name="start">2000-01-10T17:00:00Z</int>
      
<int name="end">2010-01-10T17:00:00Z
</int>
    </lst>
    ...


Aliasing is supported in the same way that has been described for Facet Objects Queries. The same request above with aliases would be:

facet.range.q.1=p:
&facet.range.q.alias.1=start_year_alias
&facet.range.q.hint.1=num <-- as="" default="" eric="" font="" is="" num="" optional="" the="" value="">

&facet.range.start.1=2000 
&facet.range.end.1=2010
&facet.range.gap.1=1
&facet.range.q.2=p:
&facet.range.q.alias.2=release_date_alias
&facet.range.q.hint.2=date
&facet.range.start.2=2000-01-10T17:00:00Z
&facet.range.end.2=2010-01-10T17:00:00Z
&facet.range.gap.2=+1MONTH


Note in the response the aliases instead of the full queries:


<lst name="facet_ranges">
    <lst name="start_year_alias">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="release_date_alias">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>       

      <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

Here you can find a sample response containing all facets described above.

You can find the same content of this post in the SolRDF Wiki [3]. As usual any feedback is warmly welcome!

-------------------------------------
[1] https://cwiki.apache.org/confluence/display/solr/Faceting
[2] https://wiki.apache.org/solr/SolrFacetingOverview
[3] https://github.com/agazzarini/SolRDF/wiki