Categories
Uncategorized

deep dive on fulltext indexing with Neo4j

In a previous blog post I’ve explained the differences of the different types of indexes being available in Neo4j. A common requirement for a lot of projects is the usage of fulltext indexes. With current versions of Neo4j (2.1.5 as of now) this can only be accomplished with the usage of manual indexes.

In this article I want to explain how you can use language specific analyzers for fulltext indexing and how to do regex searches for those.

When looking at the reference manual on fulltext indexing there is the notion of providing a custom analyzer class by specifying a config parameter analyzer upon index creation. It’s value is the full class name of the analyzer. There are two ways to create a manual index this, either using java api

GraphDatabaseService graphDb = ....
IndexManager indexManager = graphDb.index()
try (Transaction tx = graphDb.beginTx()) {
    Map<String,String> params = Collections.singletonMap("analyzer", 
        "my.package.Analyzer")
    Index index = indexManager.forNodes("myfulltextindex", params);
}

or using REST API (using the wonderful httpie http command line client)

http -v -j localhost:7474/db/data/index/node \
   name=myfulltextindex config:='{"analyzer":"my.package.Analyzer"}'

Lucene provides an optional set of language specific analyzers. These analyzers have some knowledge on the language their operating on and use that for word stemming, see http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene for details on the internals of the GermanAnalyzer. As an example the German word for houses “Häuser” is stemmed to its singular form “Haus”. Consequently a query for “Haus” retrieves all both, occurrences of “Haus” and “Häuser”.

The language specific analyzers are residing in an optional jar file called lucene-analyzers-3.6.2.jar that is not shipping by default with Neo4j. Therefore copy lucene-analyzers-3.6.2.jar into Neo4j’s plugins folder.

When trying e.g. to use Lucene’s GermanAnalyzer using

http -v -j localhost:7474/db/data/index/node name=fulltext_de \
   config:='{"analyzer":"org.apache.lucene.analysis.de.GermanAnalyzer"}'

you get back a HTTP status 500. The log files show up a strange exception java.lang.InstantiationException: org.apache.lucene.analysis.de.GermanAnalyzer. The reason for this exception is that Neo4j tries to instantiate the analyzer class using a noarg default constructor. Unfortunately Lucene’s language specific analyzers don’t have such a constructor, see javadocs. The solution for this is write a thin analyzer class with a default constructor. Internally that class uses the Lucene provided analyzer as a delegate.

In order to simplify the process of setting this up I’ve create a small project on github called neo4j-fti. It contains the mentioned wrappers in package org.neo4j.contrib.fti.analyzers for all languages having a lucene analyzer. It also provides a kernel extension to Neo4j to automatically create fulltext indexes by a config option. In neo4j.properties you need to set:

fullTextIndexes=fulltext_de:org.neo4j.contrib.fti.analyzers.German,\
    fulltext_en:org.neo4j.contrib.fti.analyzers.English

Additionally this project features an example how to use regular expression for search an index. Using Java API you need to pass a Lucene RegexQuery based on a Term holding your regular expression. The RegexQuery class isn’t part of lucene-core either, so be sure to have lucene-queries in your Neo4j’s plugins folder as well. This example is exposed in a unmanaged extension using the following code snippet:

try (Transaction tx = graphDatabaseService.beginTx()) {
    IndexManager indexManager = graphDatabaseService.index();
    if (!indexManager.existsForNodes(indexName)) {
        throw new IllegalArgumentException("index " + indexName + " does not exist");
    }
    Index index = indexManager.forNodes(indexName);
    IndexHits hits = index.query(new RegexQuery(new Term(field, regex)));

    List result = new ArrayList<>();
    for (Node node: hits) {
        result.add(node.getId());
    }
}

Assuming a index named fulltext_de has been configured using the German analyzer (see above), use the following code using httpie again to create a node, add it to the fulltext index and perform a regular expression index query:

# create a node
http -j localhost:7474/db/data/cypher query="create (n:Blog {description:'Auf der Straße stehen fünf Häuser'}) return id(n)"

# put it to the index:
http -j localhost:7474/db/data/index/node/fulltext_de \
   uri="http://localhost:7474/db/data/node/xxxx" \
   key="description" value="Auf der Straße stehen fünf Häuser"

# query the index for words starting with "h" and ending with "s"
http localhost:7474/regex/fulltext_de/description/h.*s