infoq - neo4j graph db

My name is Charles Humble and I am here at QCon New York 2014 with Ian Robinson. Ian, can you introduce yourself to the InfoQ community?

Hello, I am Ian Robinson, I am engineer at Neo Technology, I am based in London and I work on the Neo4j graph database.

2.What is a graph database?

A graph database? It’s a database, it’s an OLTP system, the kind of database that you could insert between a web request and a web response, but it allows you to model, store and query your data as a graph or as a network structure. So instead of modeling your data as a tabular structure and querying it with set-based operations, as you do in a relational database, with a graph database you are modeling your data as a network structure, and storing it as a network structure and then querying it as a network structure. So that structure comprises nodes and relationships and properties.

3.I guess social networks have done a lot to popularise the idea of graphs; I think everyone’s heard of the Facebook social graph for instance. So what is it about social data that lends itself naturally to being treated as a graph?

Social data or social behaviors are intrinsically based upon relationships. People behave within groups and they have friends and there are friends of friends, and so on, and there is an enormous body of research that demonstrates that there are emergent behaviors within these group structures, or these network structures, and that there are properties that we can identify and effectively predict certain behaviors or the transmission of certain behaviors within those social groups. But those groups are always structured as a network, or as a graph-like structure, and therefore I think ... There have been many, many years of analysis about these kinds of things, but it’s really within the last five years or so where it’s become prominent in the public’s mind because of things like Facebook, which have really given us, everybody, the ability to make concrete all of those connections and to record them and take advantage of them over the internet.

4.So presumably, you can exploit that for things like recommendation engines and that sort of thing?

Yes, for recommendations engines, both in terms of being able to recommend other people within that network who you might want to connect with - there are emergent properties within that network that say if we have friends in common, we’re more likely to get to know one another than if we are completely disconnected within a network, and we can use those kinds of properties within a network to suggest or recommend people to connect with. And similarly, if you start introducing other elements into the network, things that people buy or actions that they perform or sports and hobbies and so on, again we can begin to identify or predict likely paths that will be pursued in the future and use that to effectively create recommendations.

5.Tell me a bit about the history of Neo4j. When did you start development on it?

As a product, it’s more than ten years old; I guess it’s about 12 years old now. It started life as a database backing a content management system for the Swedish military, I believe, and it’s still being used there in that capacity. But several years after it was broken out as a separate product and then open-sourced, so it’s been an open source graph database for nearly ten years now; lots of contributions from the community. But then about five/six years ago Neo Technology was founded to sponsor that development and Neo employ a lot of the developers who work on the kernel and we still accept a lot of contributions from the community, but Neo sponsors the development of the kernel and ensures that the integrity and the quality of the product is at it should be moving forward.

Charles: And it’s JVM based, I understand. 

Yes, it’s JVM based and hence the 4j, a wonderful piece of foresight in naming the product all those years ago. But, yes, it’s 4j, it’s JVM based, it’s written almost entirely in Java, although some of the query language or the parsing for the query language is written in Scala.

Charles: So, tell me a bit about the data structure. It’s some kind of property graph I guess. 

Yes. I think it’s probably fair to say that Neo invented the term property graph and our original data model was those three fundamental elements: nodes, relationships, and properties. We use nodes to represent the things in your domain that you are interested in, things with identity, the entities in your domain; and then you can attach properties to those nodes, simple key value pairs to represent the attributes of those entities. So at that point you have a very simple record-like structure. But then you can begin to introduce even more structure by connecting those nodes with relationships; so the relationships are first class elements in the data model. Every relationship has a name and a direction, so that lends semantic clarity. We are kind of baking in the semantics of the ways in which things are connected at the level of data model. And then we can also attach properties to these relationships: so we can qualify the relationships with their strength or their weight or their quality. That’s really useful when you’ve got a network and you want to be able to calculate the shortest route between two parts of that network: you can take advantage of the properties on the relationships in order to calculate the shortest path.

So they’re the three fundamental parts of the model that we’ve had for years: nodes, relationships and properties. And then more recently we revised the data model, at the end of last year, introducing a fourth primitive which we call labels. Labels you can attach to nodes; you can ef-fectively label or tag nodes; and we use those labels to help you identify the role that a node plays within your dataset: this node represents a person or this node represents a product. So that gives us a very simple grouping semantic where you can say to the database, “Find me all of the nodes that are labeled product”, for example. But we also then use the labels to associate certain behaviors with those nodes, things such as indexing and constraints. So we can associate indexes and constraints with labels and we can say to the database, “For every node that’s labeled ‘book’, I want you to index it based on its ISBN property” or “Every node that’s labeled ‘book’, I want you to ensure that its ISBN property is unique within the context of this dataset”. So the label is a very, very simple way of tagging nodes, of helping express the role that they play in the dataset, but they also give us those additional capabilities of indexing and optional constraints that we can layer onto the structure.

6.What kind of constraints can you do with that?

Today we have two kind of behaviors: indexing and unique constraints; this is what we released at the end of last year; but we hang them all off labels. But effectively the labels in the future will be adding additional constraints. So, I think you can expect within the next year or so to see additional constraints, such as every node that is labeled ‘person’ you can expect to see a first name, and a last name property; or every node that is labeled ‘author’ can only be connected by way of outgoing rote relationships to nodes labeled ‘book’; so all different kinds of constraints. We don’t have those today, they are things that you have to build in the application, but they are some of the kind of constraints that we have in mind for the future.

7.Neo4j is quite unusual in NoSQL in that it supports ACID transactions. Is it the data structure that informs that design choice?

Yes. Well, the data structure plus the fact that we know that enterprises still care very much about their data and care about the kind of transactional characteristics when they are dealing with their data. Traditionally Neo4j has been seen as part of that larger family of NoSQL databases, but whereas other NoSQL databases can surrender certain ACID properties in order to gain some other benefits: benefits around horizontal scalability or high availability; and they can do that very often because they are dealing with what Martin Fowler calls middle aggregates, simple key-value pairs, or discrete documents, and so on.

We are slightly different. When you are working with a graph you are working with an intrinsically connected structure and very often what you want to do is to create a sub-graph in its entirety: lots of nodes, lots of relationships that connect them, and all the properties attached to them. And when you persist that, you want to be confident that entire structure, not just the discrete elements, but the entire structure has been made durable on disk. Hence we have transactions; every chain to the database has to take place in the context of a transaction. And that way we can guarantee that when the control returns to the client, that entire sub-graph has been made durable on disk. And the advantage there is we slot very easily into enterprises that require of us all of those guarantees around the quality of their data and the durability of their data.

Charles: Presumably there are trade-offs though for having ACID compliance. 

Yes. Again, one of the benefits of relaxing some of those ACID properties and dealing with simple, aggregate oriented data - as you get in a key-value store or a column-orientated database - it makes it very, very easy to distribute that data and provide for very, very large horizontal scalability. With Neo4j today, it’s not so easy to distribute the graph data. We cluster, so Neo4j is a clustered database product, but every instance in that cluster has a full copy of the dataset. So, actually distributing the data across machine boundaries and querying across machine boundaries is a really, really tricky problem to solve in the general case. We know for sure that by far the best performance when you are creating and querying a graph structure is going to be achieved when you can work within the context of a single machine boundary; you know, within the context of a single machine. The ACID requirements, and the fact that we are dealing with an intrinsically connected structure, means it makes it more difficult for us to easily distribute that data across different machines. At the same time, we get the benefits of working locally within a single instance when we are satisfying a request and that can give us orders of magnitude improved performance over the equivalent kind of query in a relational database.

Charles: And the data replication itself is master-slave, I think.

Yes. So, we have a cluster of machines and one of those machines at any point in time will be acting as the master and coordinating all of the rights to the system, and then the other ma-chines will be serving as slaves and polling at frequent intervals, and pulling across recent tran-sactions and applying them locally. This means that if you write directly to the master, by the time the control returns to the client, that’s immediately consistent with regard to that particular interaction; but the overall system is eventually consistent, in the order of milliseconds. So you can configure that with a polling interval, but there is that window of eventual consistency where the slaves are pulling across the recent transactions and catching up.

8.Do you have any upper limit on the number of slaves that you can have?

I think we are probably talking here in terms of clusters in single digit, perhaps just touching the double digits. I’ve spun up clusters of 30-40 instances, but I only know of customers who have instances of 11-12 or so. But we are talking those kind of numbers.

9.We talked earlier on in the conversation about Facebook as an example of a use case. I’d like to look at some other examples. So, route finding and logistics I guess is one for you, is it?

Route finding goes back to the roots of graph theory itself: discovering routes and shortest routes through the system; so being able to calculate routes on a transport network, for example, this is one of the ways that graph databases and Neo4j is being employed today. In fact, the largest parcel network in Europe is using Neo4j today to route all of the parcels through that network; there are 2,000, at peak time 3,000, parcels per second entering the network, at post offices and other delivery centers. And at that point in time they have a window of up to a second to calculate the most effective or efficient route through the parcel network to the ultimate destination; because the parcel is actually travelling down a chute and at the bottom of the chute it’s going to get switched one way or the other depending on where we think it should be headed.

So, we’ve got that little window of time in which to calculate the route through the entire parcel network; and you are talking about millions of locations all connected by way of different delivery routes. And then the problem is made even more difficult because those delivery routes are time-sensitive. You’ve got more trucks on the road at Christmas time, when there are more parcels in the network, than you do during the summer, so when you are calculating the route you have to be aware of which routes are available today and in the next few days looking forwards. But as I say, Neo4j there is being used to effectively calculate the routes for 2,000 - 3,000 parcels per second and it has to do that within milliseconds; before that parcel has reached the bottom of the chute.

10.And do you have customers using Neo4j for network impact analysis?

Yes, several different telcos who are employing it in one form or another for doing network impact analysis. The nice thing here is in the past a lot of those telcos would know about the different parts of their network through having several different inventory systems; you’ve actually got a very heterogeneous set of elements in that network, from the customers themselves, through the applications and the services, the machines, the virtual machines and the data centers, and then all those lower level network elements: the fibers and the switches and the routers and so on. These companies now are using Neo4j to model the entirety of that network in the database and having done that they can then use it for impact analysis. And you can do two kinds of impact analysis: you can do top-down - given this important customer, which parts of the network do they actually depend upon, which applications and services and machines and data centers and so on; do we have redundancy throughout the network on behalf of these people and if we haven’t, how can we introduce it? That’s the kind of top-down analysis. The bottom-up analysis is, given this particular network element, this switch or this router, who is it going to impact if it fails or if we have to replace or repair it. So companies are using it to plan a lot of their repair programs so that they don’t knock out a path through the network that would impact their customers. And having done all of that, having modeled that in Neo4j, it’s helping reduce problem resolution time. In the past they would have been hours or days just to query all of those inventory systems, to identify the likely cause of the problem. Now you can identify the routes in seconds or milliseconds and that naturally improves the response times.

Another pattern I am seeing with some of those customers: they’ll have the model of the network which they will refresh at least daily, probably more often; but then they may have separate systems, complex event processes perhaps, dealing with the stream of very low level network events - hundreds of thousands of events per second coming off the network - and if those complex event processes identify a significant business event, having looked at that window of low level events will suddenly discover it looks like there is an outage in this particular region, they can then interrogate the graph to discover the likely impact and to anticipate what kind of mitigation they need to take.

11.Do you have financial services companies maybe doing fraud detection work with Neo, as well?

Yes. In fact, I think we’ve written recently something on the Neo4j blog about using Neo for fraud detection. One of the use cases there is something called first party fraud, where you have a group of people who work together to defraud financial institutions. You’ve got a fraud ring, a number of people constitute a fraud ring, and the pool of identity information, things like social security numbers, addresses, telephone numbers and so on. And those people will open multiple lines of credit, multiple accounts, credit cards and so on, picking different bits of identity in-formation from that pool in order to support their application for that credit card or whatever. No two applications need share exactly the same set of identity information, but there is a lot of overlap: you’ve got a large number of people using the pool of identity information to open mul-tiple lines of credit. And then for some period of time they will probably act like good citizens: they’ll pay their bills, they’ll pay off the credit card and so on. But at some point they’ll begin to extend those lines of credit, max out their credit cards, and then all of a sudden the entire ring will just close out, disappear off the radar and take all that money with them.

So, the graph there is being used to connect all those different things: the people, the accounts and the little bits of identity information, so that you can begin to identify the likelihood of dealing with a fraud ring. And you can apply this stuff at the point in time when someone is applying for a new account; you can take their current details, the identity details that they are submitting, see whether there are existing pieces in the graph, connect them to those existing pieces, introduce the new bits of identity where they are talking about something that is a property move, and so on. And then you can look at the graph that surrounds these bits of identity information and see how broad the span is. So, you can imagine if you‘ve got a family living at a particular address, a couple of the people there may have credit cards, bank accounts and so on; so you’d expect to see a small amount of shared information between several different accounts and several different people, but if the span is a lot broader, if you’ve got tens of people sharing that identity information, then the risk is that you are dealing with a fraud ring. So, the graph is being used there to help identify exposure and risk, and again, to help qualify applications at the point in time when someone is submitting an application online.

Charles: That’s very interesting. Ian, thank you very much.

时间: 2024-10-15 02:36:58

infoq - neo4j graph db的相关文章

Graph database_neo4j 底层存储结构分析(7)

3.7  Relationship 的存储 下面是neo4j graph db 中,Relationship数据存储对应的文件: neostore.relationshipgroupstore.db neostore.relationshipgroupstore.db.id neostore.relationshipstore.db neostore.relationshipstore.db.id neostore.relationshiptypestore.db neostore.relati

Graph database_neo4j 底层存储结构分析(5)

3.5 Property 的存储 下面是neo4j graph db 中,Property数据存储对应的文件: neostore.propertystore.db neostore.propertystore.db.arrays neostore.propertystore.db.arrays.id neostore.propertystore.db.id neostore.propertystore.db.index neostore.propertystore.db.index.id neo

Graph database_neo4j 底层存储结构分析(6)

3.6  Node 数据存储 neo4j 中, Node 的存储是由 NodeStore 和 ArrayPropertyStore 2中类型配合来完成的. node 的label 内容是存在ArrayPropertyStore这样的DynamicStore 中,如果长度超过一个block ,则分block存储,并将其在ArrayPropertyStore中的第1个block 的 block_id 保存到 NodeStore类型文件相应record 的labels字段中. 下面是neo4j gra

Graph database_neo4j 底层存储结构分析(1)

1       neo4j 中节点和关系的物理存储模型 1.1  neo4j存储模型 The node records contain only a pointer to their first property and their first relationship (in what is oftentermed the _relationship chain). From here, we can follow the (doubly) linked-list of relationshi

转发:Spring Boot 访问Neo4j

(原文:https://my.oschina.net/syic/blog/798104) 在Spring Boot中访问Neo4j,通过Spring Data Neo4j可以轻易地实现. 例如,现有如下图所示的数据模型: 这表示有两个节点:用户和组,并且这两个节点的关系是从属关系,即用户从属于组. 根据这个数据模型,使用Spring Data Neo4j进行建模. 用户节点建模: package com.demo.neo4j.domain; import org.neo4j.ogm.annota

图形数据库Neo4J简介

最近我在用图形数据库来完成对一个初创项目的支持.在使用过程中觉得这种图形数据库实际上挺有意思的.因此在这里给大家做一个简单的介绍. NoSQL数据库相信大家都听说过.它们常常可以用来处理传统的关系型数据库所难以解决的一系列问题.通常情况下,这些NoSQL数据库分为Graph,Document,Column Family以及Key-Value Store等四种.这四种类型的数据库分别使用了不同的数据结构来记录数据.因此它们所适用的场景也不尽相同. 其中最为特别的便是图形数据库了.可以说,它和其它的

开源软件:NoSql数据库 - 图数据库 Neo4j

转载自原文地址:http://www.cnblogs.com/loveis715/p/5277051.html 最近我在用图形数据库来完成对一个初创项目的支持.在使用过程中觉得这种图形数据库实际上挺有意思的.因此在这里给大家做一个简单的介绍. NoSQL数据库相信大家都听说过.它们常常可以用来处理传统的关系型数据库所难以解决的一系列问题.通常情况下,这些NoSQL数据库分为Graph,Document,Column Family以及Key-Value Store等四种.这四种类型的数据库分别使用

neo4j初次使用学习简单操作-cypher语言使用

Neo4j 使用cypher语言进行操作 Cypher语言是在学习Neo4j时用到数据库操作语言(DML),涵盖对图数据的增删改查  neo4j数据库简单除暴理解的概念: Neo4j中不存在表的概念,只有两类:节点(Node)和关联(Relation),可以简单理解为图里面的点和边.在数据查询中,节点一般用小括号(),关联用中括号[].当然也隐含路径的概念,是用节点和关联表示的,如:(a)-[r]->(b),表示一条从节点a经关联r到节点b的路径.  备份Neo4j的数据: 1)停掉数据库. 2

转:图数据库Neo4J的介绍

,图数据库Neo4J的介绍:http://www.neo4j.org/learn/neo4jNeo4J是个开源的图数据,很好用,轻便灵活,嵌入式,功能强大,而且相关资料比较齐全.按照Neo4J官方网站公布,该数据库可以支持数十亿的节点数:massively scalable, up to several billion nodes/relationships/properties而且支持分布式部署,Master,Slave.对Neo4J的性能和可用性的介绍:http://video.neo4j.