Tag Archives: neo4j

Neo4j cluster and firewalls

This article summarizes the required and optional communication channels for a Neo4j enterprise cluster and provides some sample rules. For the firewall rules I’m using Ubuntu’s ufw package.

As a sample setup we assume the following servers. All of them have 2 network interfaces, eth0 is for communication with outside world, eth1 is for cluster communication.

name eth0 ip address eth1 ip address

access to REST interfaces

By default Neo4j listens on port 7474 for http and on 7473 for https style requests to the rest api. To allow for remote access you need to set in neo4j-server.properties:

Inbound access needs to be granted using

When using SSL the rule needs to hit port 7473 instead.

cluster communication

A Neo4j cluster uses two different communication channels: one for cluster management (joining/leaving the cluster, master elections, etc.) and one for transaction propagation. By default ports 5001 and 6001 are used. On all cluster members we need to allow inbound and outbound traffic for these to the other cluster members:

online backup

Neo4j enterprise supports online backup – the default port for this 6362. To enable remote backup you need to set in neo4j.properties:

The corresponding ufw command is:

remote shell

This is pretty tricky. Under the hoods neo4j remote shell uses Java RMI. When a new connection is established the client communicates with server:1337. During this control session a secondary port is negotiated – unfortunately a random port is used. So the client opens a second connection to the server with the negotiated port – therefore we have to open up basically all ports for the ip addresses acting as shell clients:

There might be a more sophisticated approach by implementing a custom RMISocketFactory and register it with the JVM as described in the JVM docs. I have not yet tried this, so if you have explored that path yourself I’d appreciate to hear your solution to this.

neo4j and haproxy: some best practices and tricks


The goal of that blog post is to provide a reasonable haproxy configuration suitable for most neo4j clusters. It covers the Neo4j specific needs (authentication, stickyness, request routing) suitable for typically deployments.

Some words for introduction

When running a Neo4j cluster you typically want to front it with a load balancer. The load balancer itself is not part of Neo4j since a lot of customers already have hardware load balancers in their infrastructure. In case they have not or want some easy-to-go solution for that the recommended tool is haproxy.

In a Neo4j cluster its perfectly possible to run any operation (reads and writes) on any cluster member. However performance-wise it’s a good idea to send write requests directly to the master and send read-only requests to all cluster member or just to the slaves – depending how busy your master is.

What we need to cover

These days most folks interact with Neo4j by sending Cypher commands through the transactional cypher end point. This endpoint uses always a HTTP POST regardless if the Cypher commands are intended to perform read or write operation. This causes some pain for load balancing since we need to distinguish reads from writes.

One approach for this is to use a additional custom HTTTP header on the client side. Assume any request that should perform writes gets a X-Write:1 header in the request. Then haproxy just needs to check for this header in and acl and associate this acl with the backend containing only the master instance.

Another idea is that the load balancer inspects the request’s payload (aka the Cypher command you’re sending). If that payload contains write clauses like CREATE, MERGE, SET, DELETE, REMOVE or FOREACH it is most likely a write operation to be directed to the master. Luckily haproxy has capbilites to inspect the payload and apply regex matches.

A third approach would be to encapsulate each and every of your use cases into an unmanaged extension and use Cypher, Traversal API or core API inside as an implementation detail. By using the semantics of REST appropriately you use GET for side-effect free reads, PUT, POST and DELETE for writes. These can be used in haproxy’s acl as well very easily.

Another issue we have to deal when setting up the load balancer with is authentication. Since Neo4j 2.2 the database is protected by default with username and password. Of course we can switch that feature off, but if security is a concern it’s wise to have it enabled. In combination with haproxy we need to be aware that haproxy needs to use authentication as well for its status checks to identify who’s available and who’s master.

recommended setup

I’ve spent some time to fiddle out a haproxy config addressing all the points mentioned above. Let’s go through the relevant parts below line by line – for the rest I delegate to the excellent documentation for haproxy. Please note that this config requires a recent version of haproxy:

glory details

defining access control lists

The acl declare conditions to be used later one. We have acls to identify if the incoming request

  • has http method of POST, PUT or DELETE (l.16)
  • has a specific request header X-Write:1
  • contains one of the magic words in the payload to identify a write: CREATE, MERGE, DELETE, REMOVE, SET
  • is targeted at the transactional cypher endpoint
store a variable indicating tx endpoint

We do a nasty trick here: we store a internal variable holding a boolean depending on wether the request is for the tx endpoint. Later on we refer back to that variable. The rationale is that haproxy cannot access acls in other sections of the config file.

decide the backend to be used (aka do we read or write)

The acls defined above decide to which backend a request should be directed. We have one endpoint for the neo4j cluster’s master and another one for a pool of all available neo4j instances (master and slaves). A request is send to master if

  • the X-Write:1 header is set, or
  • the request hits the tx cypher endpoint AND it contains one of the magic words.
  • it is a POST, PUT or DELETE.
check status of neo4j instances, eventually with authentication

A backend needs to check which of its member is available. Therefore Neo4j has a REST endpoint exposing the individual instance’s status. In case we use Neo4j authentication the requests to the status endpoint need to be authenticated. Thanks to stackoverflow, I’ve figured out that you can add a Authorization to these requests. As usual, the value for the auth header is a base64 encoded string of “<username>:<password>”. On a unix command line use:

In case you’ve disabled authentication in Neo4j, use the commented lines instead.

stickyness for tx endpoint

Using our previously stored variable, we get the value back and use it to declare an acl in this backend. To make sure any request for the same transaction number is sent to the same neo4j instance we create a sticky table to store the transaction ids. You might want to adopt the size of that table (here 1k) to your needs, as well the expire time should be aligned with neo4j’s org.neo4j.server.transaction.timeout setting. It defaults to 60s, so if we expire the sticky table entry after 70s we’re on the safe side. If a http response from the tx endpoint has a Location header we store its 6th word – that is the transaction number – in the sticky table. If we hit the tx endpoint we check if the request path’s 4th word (the transaction id) is found in the sticky table. If so, the request gets send to the same server again, otherwise we use the default policy (round-robin).

define neo4j cluster instances

These lines hold a list of all cluster instances. Be sure to align the maxconn value with the amount of max_server_threads. By default Neo4j uses 10 threads per CPU core. So 32 is a good value for a 4 core box.

haproxy’s stats interface

Haproxy provides a nice status dashbord. In most cases you want to protect that with a password. haproxy_stats


I hope this post could provide you some insights on how to use haproxy in front of a neo4j cluster efficiently. Of course I am aware that my experience and knowledge with haproxy is limited. So I appreciate any feedback to improve the config described here.

Cypher fun: finding the position of an element in an array

At a onsite workshop with a new potential customer in wonderful Zurich I was challenged with a requirement I’ve never had in the last few years with Cypher.

Since I cannot disclose any details of the use case or the data used let’s create an artificial example with a similar structure:

  • a restaurant has several rooms
  • each room has various tables
  • each table is occupied with 0 to 6 guests

The goal of the query is to find the most occupied table for each room. As an example, I’ve created a Neo4j console at http://console.neo4j.org/r/ufmpwi.

The challenging here is that we want to do a kind of local filtering: for a given room we need the most occupied table. Calling the reduce function to the rescue. The idea is to run the reduce with 3 state variables:

  • the index of the highest occupation so far
  • the current index (aka iteration number)
  • the value of highest occupation so far

Reduce allows just one single state variable to be used but we can use a three element array instead (which is one variable :-) ). When the current element is larger than the maximum so far (aka the 3rd element of the state), we update the first element to the current position. The second element (current index) is always incremented. If the current element is not larger then the maximum so far, we just increment the current count (2nd element) and keep the other values:

The full query is:

The first line is pretty much obvious. Since there might be tables without guests OPTIONAL MATCH is required in line 2.

Cypher does not allow you to do direct aggregations of aggregations. Using multiple WITH helps here. In line 3 we first calculate counts of guests per table. Line 4 returns one line per room with two collections – one holding the tables, the other their occupation. Note that both collections do have the same order. Finally from line 5 on the reduce function is applied to find the most occupied table in each room.

deep dive on fulltext indexing with Neo4j

In a previous blog post I’ve explained the differences of the different types of indexes being available in Neo4j. A common requirement for a lot of projects is the usage of fulltext indexes. With current versions of Neo4j (2.1.5 as of now) this can only be accomplished with the usage of manual indexes.

In this article I want to explain how you can use language specific analyzers for fulltext indexing and how to do regex searches for those.

When looking at the reference manual on fulltext indexing there is the notion of providing a custom analyzer class by specifying a config parameter analyzer upon index creation. It’s value is the full class name of the analyzer. There are two ways to create a manual index this, either using java api

or using REST API (using the wonderful httpie http command line client)

Lucene provides an optional set of language specific analyzers. These analyzers have some knowledge on the language their operating on and use that for word stemming, see http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene for details on the internals of the GermanAnalyzer. As an example the German word for houses “Häuser” is stemmed to its singular form “Haus”. Consequently a query for “Haus” retrieves all both, occurrences of “Haus” and “Häuser”.

The language specific analyzers are residing in an optional jar file called lucene-analyzers-3.6.2.jar that is not shipping by default with Neo4j. Therefore copy lucene-analyzers-3.6.2.jar into Neo4j’s plugins folder.

When trying e.g. to use Lucene’s GermanAnalyzer using

you get back a HTTP status 500. The log files show up a strange exception java.lang.InstantiationException: org.apache.lucene.analysis.de.GermanAnalyzer. The reason for this exception is that Neo4j tries to instantiate the analyzer class using a noarg default constructor. Unfortunately Lucene’s language specific analyzers don’t have such a constructor, see javadocs. The solution for this is write a thin analyzer class with a default constructor. Internally that class uses the Lucene provided analyzer as a delegate.

In order to simplify the process of setting this up I’ve create a small project on github called neo4j-fti. It contains the mentioned wrappers in package org.neo4j.contrib.fti.analyzers for all languages having a lucene analyzer. It also provides a kernel extension to Neo4j to automatically create fulltext indexes by a config option. In neo4j.properties you need to set:

Additionally this project features an example how to use regular expression for search an index. Using Java API you need to pass a Lucene RegexQuery based on a Term holding your regular expression. The RegexQuery class isn’t part of lucene-core either, so be sure to have lucene-queries in your Neo4j’s plugins folder as well. This example is exposed in a unmanaged extension using the following code snippet:

Assuming a index named fulltext_de has been configured using the German analyzer (see above), use the following code using httpie again to create a node, add it to the fulltext index and perform a regular expression index query:

quick tooling tip for hacking Cypher statements – Linux only

When developing Cypher statements for a Neo4j based application there are multiple ways to do this.

A lot of people (including myself) love the new Neo4j browser shipped with 2.0 and subsequent releases. This is a nicely built locally running web application running in your browser. At the top users can easily type their Cypher code and see results after executing, either in tabular form or as a visualization enabling to click through.

Neo4j 2.0 Browser

Another way is to use the command line and either go with neo4j-shell or use the REST interface by a command line client like cURL or more conveniently httpie (which I’ve previously blogged about).

Typically while building a Cypher statement you take a lot of cycles to hack a little bit, test if it runs, hack a little bit, test, …. This cycle can be improved by automating execution as soon as the file containing the cypher statement has hanged.

Linux comes with a kernel feature called inotify that reports file system changes to applications. On Ubuntu/Debian there is a package called inotify-hookable available offering a convenient way to set up tracking for a specific file or directory and take a action triggered by a change in the file/directory.

Assume you want to quickly develop a complex cypher statement in $HOME/myquery.cql. Set up monitoring using:

Using your text editor of choice open $HOME/myquery.cql and change your code. After saving the statement will be automatically executed and you get instantly feedback.

using remote shell combined with Neo4j embedded

Neo4j can be deployed in multiple ways. Either you can run it as a server in a separate process, just like a classic database, or you can use embedded mode where your application controls the lifecycle of the graph database. Both, embedded and server mode can be used to setup a HA scenario with Neo4j enterprise edition.

In cases where Neo4j is used in embedded mode, there is often a demand for having a maintenance channel to the database, e.g. for fixing wrong data. Nothing simpler than that, there’s an easy way to enable the remote shell together with embedded mode, see a example written in groovy:

The trick is to

  1. have the neo4j-shell-<version>.jar on your classpath and
  2. pass in the config option remote_shell_enabled='true'

With this in place you can use the bin/neo4j-shell from your Neo4j distribution and access your embedded instance.


indexing in Neo4j – an overview

Neo4j as a graph database features indexing as the preferred way to find start points for graph traversals. Over the years multiple different indexing approach have been added. The goal of this article is to give an overview on this to avoid confusion esp. for those who just recently got started with Neo4j.

A graph database using a property graph model stores its data in nodes, relationships and properties. In Neo4j 2.0 this model was amended with labels.

no indexes in the beginning

In the very early days of Neo4j there was no index. The only way to walk through the graph was by linking interesting things to the reference node. The reference node or “node 0” acted as a global entry point. Up till versions 1.9.x the GraphDatabaseService had a deprecated getReferenceNode method being a historic relict cleaned up in 2.0.

manual indexes

The Neo4j hackers realized at some point that users don’t want to take the error prone and cumbersome way to find start points for graph traversals via the reference node. At this point a feature called ‘manual indexing’ appeared on the plate. This was back in the days before 1.0 – at a dark age without Cypher and server mode. The only way to speak to your graph was using the Java API. Therefore manual indexing was to be performed by Java API. The main entry point is calling graphDatabaseService.index() to get access to the IndexManager, see here for an example. Any index operation has to be done explicitly and manually. This approach enabled abuse of indexes as well. As a general pattern the index should be seen mainly as a lookup service and not as a secondary datastore. In general the index should not contain any information not residing in the graph itself.

Querying manual indexes was added to Cypher, so to access a manual index you use

With node you refer to a index on nodes, Person refers to the index named Person and name is the property within the index. With manual indexes you can index relationships as well. Indexing relationships is however a rare use case.

A pretty nice option for manual indexes is the fact that you can pass in options when the index is first created. This allows to configure a index for fulltext indexing or choose different analyzers, see http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html.

automatic indexes

In Neo4j 1.4 a new feature was introduced: auto indexing. Under the hoods it’s a manual index with a fixed name (node_auto_index, relationship_auto_index) combined with a TransactionEventHandler that mirrors changes on a set of configured property names to the index. Typically auto indexing is setup in neo4j.properties. This approach removes lot of burden from manually mirroring your property changes to the index and it allows Cypher statement to implicitly modify the index.

From Cypher perspective there is no difference to manual indexes aside that you have to use the predefined index names (node_auto_index here).

It’s important to know that a change to auto index configuration will not trigger reindexing of existing datasets. A commonly used trick is to set a property to its current value which forces reindexing.

Another shortcoming is that the configuration of property keys to be indexed is global. Assume you have persons with a name property and cities with a name property. Any query to n=node_auto_index(name='abc') can potentially return both persons and cities. Therefore you should choose distinct property keys for different semantics.

schema indexes

On of the most shiny new features in Neo4j are schema indexes. Schema indexes feel a lot like indexes as we’re used to from relational world. A schema index is declared based on a label for a certain property.

The above statement will create a index for the name property on all nodes carrying the Person label. Very convenient is the fact that the index will automatically be populated with preexisting data.

Queries do no longer have to explicitly use a index, it’s more the behaviour we know from SQL. When there is a index that can make a query more performant it will use. Assume a query like

In case of no index being set up this will look up all Person nodes and check if their name property matches Stefan. If a index is present it will be used transparently.

Constraints are used almost the same way as schema indexes. E.g. to ensure uniqueness on the name property for nodes having the Person label use

Currently schema indexes cannot be spawned over multiple properties but you can have multiple indexes for the same label. In case you want to do combined searches, it’s workaround to aggregate into a combined property. E.g. if you have firstName and lastName and want to do a combined lookup you might introduce a property name consisting of firstName + lastName and index only the name property.

Schema indexes are way more simple to use compared to manual/autoindexes – so anyone starting with Neo4j should mainly look at schema indexes. To make a clear point on this, the reference manual mentions manual and auto index in a section called ‘legacy indexes’.


running Neo4j graphgists locally with docker.io

Neo4j has a excellent tool for documenting graph models called graphgists. As the name suggests graphgists are typically stored as github gists in asciidoc format. Additionally to the regular asciidoc you can embed executable cypher and a Neo4j console in a graphgist. For most people it’s perfectly fine hosting their graphgists at github or dropbox. If you want to keep your graphgists private just use a secret gist.

However there are companies with an higher demand for security and privacy that don’t want to expose their stuff onto public networks. Graphgists itself is javascript based and works locally. For the console part it by default connects to http://console.neo4j.org to do the graph operations.

One of my colleagues at Neo Technology recently pointed me to docker.io, a nice LXC based tool to create, maintain, run and share lightweight containers. To get familiar with docker I’ve decided to set up a small project to make graphgists and Neo4j console available in a docker container. This approach allows to handle with your graphgists 100% locally – nothing leaves the container.

How the docker container is built up

Docker containers are assembled by a cookbook called a dockerfile. It specifies the container you want to inherit from and then issue couple of commands to apply your customizations. In my case we need to install a servlet container (tomcat7 here). Neo4j console’s source repo is https://github.com/neo4j-contrib/rabbithole. Based on a recent change it now allows to build a war file ready for deployment into servlet containers. Of course we could have installed maven and clone the repo and exec mvn war:war to build neo4j console. I’ve decided to provide and use a prebuilt war file located at bintray, this removes the need for downloaded a massive number of dependencies for the in-container maven installation. Neo4j console’s war file is deployed as console.war and therefore available in the console context of tomcat.

The graphgist repo is cloned into the location of tomcat’s root context and the location of CONSOLE_URL_BASE is adopted to the locally available neo4j console. Finally tomcat is started as a service. Here’s the full Dockerfile:

[getgit userid=”sarmbruster” repoid=”docker_neo4j_graphgist” path=”Dockerfile” language=”bash”]

How to use the docker container

Of course you need to install docker locally. The procedure differs among operating system and is documented here. Next is to pull the preconfigured container and run it:

This procedure might take some time on first invocation – a good candidate for having a nice espresso.

The second command maps a local directory (it’s crucial to use absolute path) into a in-container directory for accessing the gists. Port 8080 is mapped to the in-container port 8080.

When done your local graphgist setup is finished point your browser to http://localhost:8080. To create a gist, open your favourite text editor and save a <myname>.adoc file in the local gist directory used above when starting docker. For some samples of graphgist files, see https://github.com/neo4j-contrib/graphgist/tree/master/gists.Pointing the browser to http://localhost:8080?myname (without .adoc) should render your graphgist.

Using docker ps and docker stop <containerId> can be used to stop the graphgist containter.

closing words

Since I’m absolutely new to docker there might be better and more elegant ways to achieve locally running graphgists. I’m looking forward to read your comments and feedback on this.


some experiments with ratpack and neo4j

Back in May this year I’ve attended the Gr8conf in Copenhagen. As always this conference added couple of things to my personal “take-a-look-at-this” list. The most exciting thingy for me was ratpack, a lean toolkit for building web applications on the JVM. Ratpack is powered by Netty and provides an event driven network engine as opposed to classic servlet based containers like Tomcat or Jetty which bind threads to requests. In high load scenarios with a huge number of concurrent requests the thread based model suffers from thread blocking wheres Ratpack is almost non blocking. To get familiar with Ratpack I’ve decided to implement a server component for Neo4j based on Ratpack. The first goal was to have a cypher endpoint, just like the standard Neo4j offers. Secondary goals were some more features:

  • support for multiple output formats: json, html, csv, message pack
  • ability to get a list of currently running queries and a button to abort each one individually. This is IMHO a feature lacking in classic Neo4j server. Esp. people getting started with cypher tend to write queries that run very long and there is currently now way to abort them.

For the future I’d like to add some more features:

  • transactional cypher endpoint
  • tbd (if you have ideas, please send a comment)

The goal is by far not to create a full fledged alternative to the existing Neo4j server. This project should focus on maximum throughput and ease of use for a cypher-only server component. To get started I’ve cloned https://github.com/ratpack/example-ratpack-gradle-groovy-app. You’ll find my code at https://github.com/sarmbruster/neo4j-ratpack.

Handling Requests

In ratpack you either write inline handlers in src/ratpack/ratpack.groovy or, for more complex cases, write a handler class derived from AbstractHandler and register that in ratpack.groovy.

Ratpack features Google Guice as well, so we can register e.g. a GraphDatabaseService as injectable component. See Neo4jModule, we’re exposing and configuring a GraphDatabaseService, a Cypher ExecutionEngine, a guard (see below) and a QueryRegistry. Other components can refer to them using the @Inject constructor annotation.

The core piece of code is CypherHandler, it parses the cypher command and parameters out of the request, runs it and renders the result depending on the requested content type.

Terminate Queries

From tech perspective this was the most interesting part to write. Neo4j can be run with a optional guard. Since this feature is not part of the public API it is not officially documented and might therefore be changed without further notice – be warned. To enable the guard feature a config option execution_guard_enabled needs to be set to true. However you can get access to the guard by calling ((GraphDatabaseAPI)graphDb).dependencyResolver.resolveDependency(Guard.class). In neo4j-ratpack the guard is exposed as a guice component so any ratpack handler can just inject it.

Each query is registered with a QueryRegistry. Part of that process is setting up a VetoGuard that throws an exception based on a boolean flag. In case of an exception the query is aborted.

Load Tests

Next step was running some load tests to a standard Neo4j server and neo4j-ratpack in order to compare the performance of the server components. All tests were run on my ThinkPad x230 (i7-3520M, 2.9GHz, 16 GB RAM, Ubuntu 13.04). For simplicity load generation and the server itself were running on the same machine – which is by far not perfect, but a starting point.

The intention of these load tests is not measuring Neo4j itself – it focusses on the server component only.

Using jmeter I’ve run a cypher query

with different parameters against a graph db consisting of 1.6M nodes, 7M relationships and 7M properties. Kudos to my colleague Alex who helped me setting up the dataset based from the LDBC project he’s involved with.

Exactly the same graph.db was used by both Neo4j server and neo4j-ratpack. No specific JVM tuning parameters were set. I’ve run the load test with a increasing number of concurrent threads and focussed on observing throughput and latency. The following diagrams were created using a python matplot script orginating from http://www.metaltoad.com/blog/plotting-your-load-test-jmeter. Please note, the latency is displayed in green on logarithmic axis, throughput is in blue on linear axis (ranges are different for the diagrams).




We’re observing a increasing rate of errors when going beyond 25k threads. Since the loadgenerator is colocated with the system to test this seems to be point where jmeter’s own memory and CPU consumption influences the system under test too much – so we’ll disregard the range above 25k.

The most interesting finding is that with ratpack the latency remains nearly constant in the range of [2.5k – 10k] threads whereas the standard neo4j server shows increasing latency. At 2.5k threads ratpack shows fully saturated CPU that’s why throughput decreases. With more or faster CPU we could improve both, latency and throughput. The explanation for the difference observed can be found in the different threading model. Neo4j server uses internally jetty which does blocking IO in opposite to ratpack using Netty. To verify this, I’ve taken threaddumps with yourkit:

threading telemetry of neo4j server

threading telemetry of neo4j server

threading telemetry of neo4j-ratpack

threading telemetry of neo4j-ratpack

It’s interesting to see that Neo4j server uses 10 worker threads per core (40 in total on my laptop). Most of the time, most of them are in blocked status indicated by the red color. Ratpack on the other side has 8 worker threads being mostly in ‘green’ aka runnable status. So ratpack indeed uses non blocking IO.


For cypher-only use cases with high concurrency requirements using ratpack instead of neo4j server might be an interesting alternative. However be aware, ratpack is bleeding edge, the current version is 0.9-SNAPSHOT.


nice addition back in Neo4j 1.9.1: closeable ExecutionResult

When using Cypher from Java code one instantiates a ExecutionEngine and calls execute to get a instance of ExecutionResult. ExecutionResult is an Iterable and therefore provides access to an iterator() method. Up to Neo4j 1.9 it is recommended to fully consume the iterator until hasNext() returns null, otherwise it’s not guaranteed that all resources are freed up again.

Since Neo4j 1.9.1 ExecutionResult implements ResourceIterable as well. This means the iterator has a close() method to free up bound resources without completely consuming the iterator.

I guess a lot of Neo4j users might not have explored that small but very helpful addition yet, so I think it’s worth mentioning.