Neo4J is unusual in that it offers more than one production viable configuration for integrating with software. At a glance these are limited to either Embedded or Server, and I would suspect (completely unscientifically) that the most commonly asked question about on Neo4J is “Embedded or Server?” in some variation.
So having been tinkering with Neo4J for a year or so now, and personally switching my own mini project (http://size.me.uk) from Embedded to Server and then back to Embedded I feel like I can throw in my 2 pence.
I’ll also discuss a couple of additional considerations and tools for your tool box that can be explored, namely a pure in memory implementation, server extensions and the server bootstrapper.
You will then hopefully see that “Embedded vs Server” is a bit simplistic. If we add in “Embedded + ServerBootstrapper” or “Server + Server Extensions” we can mitigate some of the against arguments in both cases. With this extra power comes responsibility, which I will try to draw attention too also.
It is worth point out now that I am not considering scalability and the High Availability offering, however you should be aware that these facilities exist for both Server/Embedded in the enterprise product.
This will be the most common use case for a database in production. There exists a database and multiple applications share the data in it. The database is started as a separate service and clients interact with it via it’s API using JSON over HTTP, a fairly common set up for the NoQSL space.
Interacting with the database in server mode is essentially like interacting with a set of web services, and there is certainly a good number of them, one for nodes, one for relationships, another for properties, then Cypher and several more. The REST API does allow you to “batch” commands together (separate endpoint) to be completed as a transaction on the server. This feels a bit odd, however ensures Neo4J can deliver ACID properties.
Perhaps the verbose nature of Java doesn’t help, but I didn’t enjoy interacting with Neo4J over the REST API. It takes a lot of code to support a run of mill application. I had numerous crises of conscious over getting a larger graph in one call, or a smaller parts over several calls. Also seemed to spend an unpleasant amount of time examining a failing JSON payloads, only find something like “my bad” – head palming null on a field. The native API and Java certainly feels clunky.
Moving on, http wire chatter is cited as possibly the main down side for the Server variation as it is slower than embedded option, this is an inevitable truth. A better question might be to ask, is it he culmination of querying and wire transfer slower than using an alternative? Neo4J wins big on querying, especially with large datasets (due to the way graph databases are indexed). So likely it will still be faster.
Also consider where the hardware is located, similarly can both servers be put on the same subnet? If like me you only have a noddy application and they both run on the same server anyway, latency isn’t going to be a debilitating factor.
With the server variation you also get the management console out of the box, essentially a web UI for inspecting database statistics and querying the information held there. The console tab, which allows you to run Cypher queries and fire REST requests is essential. Also bear with the visual representation of the graph, it looks a bit basic at first until you customize what nodes should look like.
At this point it is worth mentioning Spring Data as persistence API. When I first starting cutting code I had the attitude of “I don’t need no stinking API” to put JSON on a request. I was wrong though; not because it was hard, but because it was dull, laborious and largely unsatisfying; I wanted to play with Cypher!
There is a price you pay and this is having an extra type property and indexes that define the binding class. By default this is a fully qualified class name, which can be tailored to something sensible with the @TypeAlias.
Although I am yet to try it, one of the wins of Spring Data is that you are meant to be a flick of switch from using an Embedded or Server instance; should you want to change.
If ease of use is a bigger driver than absolute performance it is definitely worth looking at rather than the standard API.
Embedded (In Process)
Firstly the term Embedded should not be confused with an “in-memory” database in the same way developers are use to using HSQLDB for throw away testing. Both the Embedded and Server version, size permitting will work in-memory; the embedded offering is “in-process” (of the parent application) and ultimately persists to disk.
Unfortunately sharing the same process has it’s limitations. Improper shutdown and termination could potentially leave the database in an unrecoverable state (using normal start up). I have indeed seen this happen, albeit on my local machine; and admittedly I am a lot more brutal with terminating processes. Bear in mind having the database in a separate process doesn’t guarantee safe shutdown either.
In reflection of the Server variation being slower the most common pro argument for embedded is speed. Ideally we get to interact with the data in completely in memory for the fastest results. Beyond that, if the graph is sufficiently large we will have to go to disk. Ultimately we have no wire transfer and communication over a network to contend with… which is orders of magnitude slower for memory-disk-network access.
You get fine grained interaction with the database. Directly accessing the database allows for easy control of transactions that can spread multiple writes. This will be familiar to developers and synonymous with traditional RDBMS interactions.
Unfortunately there is no administration console out of the box, although we can correct this concern; read on.
Neo4J does indeed offer a true in memory implementation the ImpermanentGraphDatabase, useful for integrations tests or the cheeky “unit test” that exercises Cypher queries. Inevitably data is lost after the process terminates, so limited to other uses.
Ok, so we have covered the headline acts and it is onto the supporting cast.
There is facility with the lesser know WrappingNeoServerBootstrapper class to allow you to wrap an embedded graph instance. This grants the additional benefits of the REST API and the administration console. This is actually the configuration I operate, but principally for the purpose of providing the administration console.
I’d have to urge real caution with letting other systems access the database via REST API. If the host application fails, the database also fails; if you want to deploy a new version of the application, the database will become unavailable.
That said, a production database without an administration console is real deal breaker; so likely this will be the common configuration for those running an embedded instance.
Server Extensions allow you to provide server side code that augments the standard Neo4J Rest API with your own operations.
After using the REST API for a time you might begin to feel like you want to move towards these, as I alluded to before it can actually take a lot of requests to perform fairly uninteresting use cases.
So with server extensions you then get the stock benefits of speed as you can move some logic to the server side, and reuse because other applications can call the same operations. To me however the arguments start to smell the same as the pros for PLSQL/Stored procedures; and the cons are largely the same as well. Extra db server load, harder to migrate applications away; creating a one size fits all operation is tough and even tougher to change.
That said you will need server extensions in order to expose Neo4J’s lower APIs, you’re not going to be able to utilize the traversal API the same way as you can invoke in a Cypher query. So in this case there is really no choice in the matter.
I guess I would just urge caution about creating server extensions for the reuse argument, perhaps convince me with speed or another motivator. Reuse is too often cited by people as a benefit, but it often just means “dependency”.
Ultimately the choice depends on your use case; as there are prohibiting constraints to consider with regards to how many systems are going to interact with the database. Personally I think people will too easily dismiss embedded as a viable option as it is too different from the norm. i.e. a database is a shared resource and that’s how we build software.
I do however challenge you to consider if we should still be building systems this way. If for example you are genuinely trying to deliver service orientated systems, and that only the owning service should have direct access to the data. Then the embedded option gives you the ability not just to notionally encapsulate your data in a schema, but physically encapsulate it as well.
Perhaps it is symptomatic of the organizations I have worked for, but this is a real understated plus. There leaves no room for shortcutting the service and accessing its data directly in an hour of need; similarly some illicit system will not pop up complaining when you do some model refactoring.
If we cede that only one service should be accessing the data directly, I would venture that the benefits of Embedding outweighs that of running as a Server.
[Graph Databases] Excellent starting point for Neo4J and graph databases in general. The high level makes it great holiday reading as you won’t need your PC!