In a previous blog post I’ve explained the differences of the different types of indexes being available in Neo4j. A common requirement for a lot of projects is the usage of fulltext indexes. With current versions of Neo4j (2.1.5 as of now) this can only be accomplished with the usage of manual indexes.
In this article I want to explain how you can use language specific analyzers for fulltext indexing and how to do regex searches for those.
When looking at the reference manual on fulltext indexing there is the notion of providing a custom analyzer class by specifying a config parameter analyzer
upon index creation. It’s value is the full class name of the analyzer. There are two ways to create a manual index this, either using java api
GraphDatabaseService graphDb = ....
IndexManager indexManager = graphDb.index()
try (Transaction tx = graphDb.beginTx()) {
Map<String,String> params = Collections.singletonMap("analyzer",
"my.package.Analyzer")
Index index = indexManager.forNodes("myfulltextindex", params);
}
or using REST API (using the wonderful httpie http command line client)
http -v -j localhost:7474/db/data/index/node \
name=myfulltextindex config:='{"analyzer":"my.package.Analyzer"}'
Lucene provides an optional set of language specific analyzers. These analyzers have some knowledge on the language their operating on and use that for word stemming, see http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene for details on the internals of the GermanAnalyzer. As an example the German word for houses “Häuser” is stemmed to its singular form “Haus”. Consequently a query for “Haus” retrieves all both, occurrences of “Haus” and “Häuser”.
The language specific analyzers are residing in an optional jar file called lucene-analyzers-3.6.2.jar that is not shipping by default with Neo4j. Therefore copy lucene-analyzers-3.6.2.jar
into Neo4j’s plugins
folder.
When trying e.g. to use Lucene’s GermanAnalyzer using
http -v -j localhost:7474/db/data/index/node name=fulltext_de \
config:='{"analyzer":"org.apache.lucene.analysis.de.GermanAnalyzer"}'
you get back a HTTP status 500. The log files show up a strange exception java.lang.InstantiationException: org.apache.lucene.analysis.de.GermanAnalyzer
. The reason for this exception is that Neo4j tries to instantiate the analyzer class using a noarg default constructor. Unfortunately Lucene’s language specific analyzers don’t have such a constructor, see javadocs. The solution for this is write a thin analyzer class with a default constructor. Internally that class uses the Lucene provided analyzer as a delegate.
In order to simplify the process of setting this up I’ve create a small project on github called neo4j-fti. It contains the mentioned wrappers in package org.neo4j.contrib.fti.analyzers
for all languages having a lucene analyzer. It also provides a kernel extension to Neo4j to automatically create fulltext indexes by a config option. In neo4j.properties
you need to set:
fullTextIndexes=fulltext_de:org.neo4j.contrib.fti.analyzers.German,\
fulltext_en:org.neo4j.contrib.fti.analyzers.English
Additionally this project features an example how to use regular expression for search an index. Using Java API you need to pass a Lucene RegexQuery
based on a Term
holding your regular expression. The RegexQuery
class isn’t part of lucene-core
either, so be sure to have lucene-queries
in your Neo4j’s plugins
folder as well. This example is exposed in a unmanaged extension using the following code snippet:
try (Transaction tx = graphDatabaseService.beginTx()) {
IndexManager indexManager = graphDatabaseService.index();
if (!indexManager.existsForNodes(indexName)) {
throw new IllegalArgumentException("index " + indexName + " does not exist");
}
Index index = indexManager.forNodes(indexName);
IndexHits hits = index.query(new RegexQuery(new Term(field, regex)));
List result = new ArrayList<>();
for (Node node: hits) {
result.add(node.getId());
}
}
Assuming a index named fulltext_de
has been configured using the German analyzer (see above), use the following code using httpie again to create a node, add it to the fulltext index and perform a regular expression index query:
# create a node
http -j localhost:7474/db/data/cypher query="create (n:Blog {description:'Auf der Straße stehen fünf Häuser'}) return id(n)"
# put it to the index:
http -j localhost:7474/db/data/index/node/fulltext_de \
uri="http://localhost:7474/db/data/node/xxxx" \
key="description" value="Auf der Straße stehen fünf Häuser"
# query the index for words starting with "h" and ending with "s"
http localhost:7474/regex/fulltext_de/description/h.*s
4 replies on “deep dive on fulltext indexing with Neo4j”
Hello alo!
I downloaded Httpiee, got HTTP/1.1 500 Server Error though:
could you please help?
thank you!
gg4u-2:httpie gg4u$ http -v -j localhost:7474/db/data/index/node name=topic config:='{“analyzer”:”org.apache.lucene.analysis.en”}’
POST /db/data/index/node HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Content-Length: 74
Content-Type: application/json; charset=utf-8
Host: localhost:7474
User-Agent: HTTPie/0.8.0
{
“config”: {
“analyzer”: “org.apache.lucene.analysis.en”
},
“name”: “topic”
}
HTTP/1.1 500 Server Error
Cache-Control: must-revalidate,no-cache,no-store
Connection: close
Content-Length: 0
Content-Type: text/html; charset=ISO-8859-1
Date: Tue, 19 May 2015 04:01:10 GMT
Server: Jetty(9.2.4.v20141103)
Hi,
also add that :
http localhost:7474/regex/fulltext_de/description/h.*s
does not look to exist :
http localhost:7474/regex/
(error 404)
am I missing smtg?
Hi,
I guess the analyzer class you’ve specified does not exists. Check in `data/graph.db/messages.log` for an exception. See https://github.com/sarmbruster/neo4j-fti/tree/master/src/main/java/org/neo4j/contrib/fti/analyzers for a list of the available analyzers.
Cheers,
Stefan
What’s your setting for `org.neo4j.server.thirdparty_jaxrs_classes` in `neo4j-server.properties`?