indexing in Neo4j – an overview

Neo4j as a graph database features indexing as the preferred way to find start points for graph traversals. Over the years multiple different indexing approach have been added. The goal of this article is to give an overview on this to avoid confusion esp. for those who just recently got started with Neo4j.

A graph database using a property graph model stores its data in nodes, relationships and properties. In Neo4j 2.0 this model was amended with labels.

no indexes in the beginning

In the very early days of Neo4j there was no index. The only way to walk through the graph was by linking `interesting` things to the reference node. The reference node or “node 0″ acted as a global entry point. Up till versions 1.9.x the GraphDatabaseService had a deprecated getReferenceNode method being a historic relict cleaned up in 2.0.

manual indexes

The Neo4j hackers realized at some point that users don’t want to take the error prone and cumbersome way to find start points for graph traversals via the reference node. At this point a feature called ‘manual indexing’ appeared on the plate. This was back in the days before 1.0 – at a dark age without Cypher and server mode. The only way to speak to your graph was using the Java API. Therefore manual indexing was to be performed by Java API. The main entry point is calling graphDatabaseService.index() to get access to the IndexManager, see here for an example. Any index operation has to be done explicitly and manually. This approach enabled abuse of indexes as well. As a general pattern the index should be seen mainly as a lookup service and not as a secondary datastore. In general the index should not contain any information not residing in the graph itself.

Querying manual indexes was added to Cypher, so to access a manual index you use

START n=node:Person(name='abc') RETURN n

With `node` you refer to a index on nodes, `Person` refers to the index named Person and `name` is the property within the index. With manual indexes you can index relationships as well. Indexing relationships is however a rare use case.

A pretty nice option for manual indexes is the fact that you can pass in options when the index is first created. This allows to configure a index for fulltext indexing or choose different analyzers, seeĀ http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html.

automatic indexes

In Neo4j 1.4 a new feature was introduced: auto indexing. Under the hoods it’s a manual index with a fixed name (node_auto_index, relationship_auto_index) combined with a TransactionEventHandler that mirrors changes on a set of configured property names to the index. Typically auto indexing is setup in neo4j.properties. This approach removes lot of burden from manually mirroring your property changes to the index and it allows Cypher statement to implicitly modify the index.

START n=node:node_auto_index(name='abc') RETURN n

From Cypher perspective there is no difference to manual indexes aside that you have to use the predefined index names (node_auto_index here).

It’s important to know that a change to auto index configuration will not trigger reindexing of existing datasets. A commonly used trick is to set a property to its current value which forces reindexing.

Another shortcoming is that the configuration of property keys to be indexed is global. Assume you have persons with a name property and cities with a name property. Any query to n=node_auto_index(name='abc') can potentially return both persons and cities. Therefore you should choose distinct property keys for different semantics.

schema indexes

On of the most shiny new features in Neo4j are schema indexes. Schema indexes `feel` a lot like indexes as we’re used to from relational world. A schema index is declared based on a label for a certain property.

CREATE INDEX ON :Person(name);

The above statement will create a index for the name property on all nodes carrying the Person label. Very convenient is the fact that the index will automatically be populated with preexisting data.

Queries do no longer have to explicitly use a index, it’s more the behaviour we know from SQL. When there is a index that can make a query more performant it will use. Assume a query like

MATCH (p:Person {name: 'Stefan'}) RETURN p

In case of no index being set up this will look up all Person nodes and check if their name property matches Stefan. If a index is present it will be used transparently.

Constraints are used almost the same way as schema indexes. E.g. to ensure uniqueness on the name property for nodes having the Person label use

CREATE CONSTRAINT ON (p:Person) ASSERT person.name IS UNIQUE

Currently schema indexes cannot be spawned over multiple properties but you can have multiple indexes for the same label. In case you want to do combined searches, it’s workaround to aggregate into a combined property. E.g. if you have firstName and lastName and want to do a combined lookup you might introduce a property name consisting of firstName + lastName and index only the name property.

Schema indexes are way more simple to use compared to manual/autoindexes – so anyone starting with Neo4j should mainly look at schema indexes. To make a clear point on this, the reference manual mentions manual and auto index in a section called ‘legacy indexes’.

 

2 thoughts on “indexing in Neo4j – an overview

  1. Stefan Armbruster Post author

    Hi Timmy, fulltext support for schema indexes is a frequently requested feature. It’s on the internal roadmap but not yet prioritized, so I cannot give any ETA at that point.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>