In a previous blog post I’ve explained the differences of the different types of indexes being available in Neo4j. A common requirement for a lot of projects is the usage of fulltext indexes. With current versions of Neo4j (2.1.5 as of now) this can only be accomplished with the usage of manual indexes.
In this article I want to explain how you can use language specific analyzers for fulltext indexing and how to do regex searches for those.
When looking at the reference manual on fulltext indexing there is the notion of providing a custom analyzer class by specifying a config parameter analyzer
upon index creation. It’s value is the full class name of the analyzer. There are two ways to create a manual index this, either using java api
GraphDatabaseService graphDb = ....
IndexManager indexManager = graphDb.index()
try (Transaction tx = graphDb.beginTx()) {
Map<String,String> params = Collections.singletonMap("analyzer",
"my.package.Analyzer")
Index index = indexManager.forNodes("myfulltextindex", params);
}
or using REST API (using the wonderful httpie http command line client)
http -v -j localhost:7474/db/data/index/node \
name=myfulltextindex config:='{"analyzer":"my.package.Analyzer"}'
Lucene provides an optional set of language specific analyzers. These analyzers have some knowledge on the language their operating on and use that for word stemming, see http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene for details on the internals of the GermanAnalyzer. As an example the German word for houses “Häuser” is stemmed to its singular form “Haus”. Consequently a query for “Haus” retrieves all both, occurrences of “Haus” and “Häuser”.
The language specific analyzers are residing in an optional jar file called lucene-analyzers-3.6.2.jar that is not shipping by default with Neo4j. Therefore copy lucene-analyzers-3.6.2.jar
into Neo4j’s plugins
folder.
When trying e.g. to use Lucene’s GermanAnalyzer using
http -v -j localhost:7474/db/data/index/node name=fulltext_de \
config:='{"analyzer":"org.apache.lucene.analysis.de.GermanAnalyzer"}'
you get back a HTTP status 500. The log files show up a strange exception java.lang.InstantiationException: org.apache.lucene.analysis.de.GermanAnalyzer
. The reason for this exception is that Neo4j tries to instantiate the analyzer class using a noarg default constructor. Unfortunately Lucene’s language specific analyzers don’t have such a constructor, see javadocs. The solution for this is write a thin analyzer class with a default constructor. Internally that class uses the Lucene provided analyzer as a delegate.
In order to simplify the process of setting this up I’ve create a small project on github called neo4j-fti. It contains the mentioned wrappers in package org.neo4j.contrib.fti.analyzers
for all languages having a lucene analyzer. It also provides a kernel extension to Neo4j to automatically create fulltext indexes by a config option. In neo4j.properties
you need to set:
fullTextIndexes=fulltext_de:org.neo4j.contrib.fti.analyzers.German,\
fulltext_en:org.neo4j.contrib.fti.analyzers.English
Additionally this project features an example how to use regular expression for search an index. Using Java API you need to pass a Lucene RegexQuery
based on a Term
holding your regular expression. The RegexQuery
class isn’t part of lucene-core
either, so be sure to have lucene-queries
in your Neo4j’s plugins
folder as well. This example is exposed in a unmanaged extension using the following code snippet:
try (Transaction tx = graphDatabaseService.beginTx()) {
IndexManager indexManager = graphDatabaseService.index();
if (!indexManager.existsForNodes(indexName)) {
throw new IllegalArgumentException("index " + indexName + " does not exist");
}
Index index = indexManager.forNodes(indexName);
IndexHits hits = index.query(new RegexQuery(new Term(field, regex)));
List result = new ArrayList<>();
for (Node node: hits) {
result.add(node.getId());
}
}
Assuming a index named fulltext_de
has been configured using the German analyzer (see above), use the following code using httpie again to create a node, add it to the fulltext index and perform a regular expression index query:
# create a node
http -j localhost:7474/db/data/cypher query="create (n:Blog {description:'Auf der Straße stehen fünf Häuser'}) return id(n)"
# put it to the index:
http -j localhost:7474/db/data/index/node/fulltext_de \
uri="http://localhost:7474/db/data/node/xxxx" \
key="description" value="Auf der Straße stehen fünf Häuser"
# query the index for words starting with "h" and ending with "s"
http localhost:7474/regex/fulltext_de/description/h.*s