怎么用多模型数据库为复杂数据建模?--航空舰队实例

Editor’s note: Full disclosure — the author is a developer and software architect at ArangoDB GmbH, which leads the development of the open source multi-model database ArangoDB.

In recent years, the idea of “polyglot persistence” has emerged and become popular — for example, see Martin Fowler’s excellent blog post. Fowler’s basic idea can be interpreted that it is beneficial to use a variety of appropriate data models for different parts of the persistence layer of larger software architectures. According to this, one would, for example, use a relational database to persist structured, tabular data; a document store for unstructured, object-like data; a key/value store for a hash table; and a graph database for highly linked referential data. Traditionally, this means that one has to use multiple databases in the same project, which leads to some operational friction (more complicated deployment, more frequent upgrades) as well as data consistency and duplication issues.

Figure 1: tables, documents, graphs and key/value pairs: different data models. Image courtesy of Max Neunh?ffer.

This is the calamity that a multi-model database addresses. You can solve this problem by using a multi-model database that consists of a document store (JSON documents), a key/value store, and a graph database, all in one database engine and with a unifying query language and API that cover all three data models and even allow for mixing them in a single query. Without getting into too much technical detail, these three data models are specially chosen because an architecture like this can successfully compete with more specialised solutions on their own turf, both with respect to query performance and memory usage. The column-oriented data model has, for example, been left out intentionally. Nevertheless, this combination allows you — to a certain extent — to follow the polyglot persistence approach without the need for multiple data stores.

At first glance, the concept of a multi-model database might be a bit hard to swallow, so let me explain this idea briefly. Documents in a document collection usually have a unique primary key that encodes document identity, which makes a document store naturally into a key/value store, in which the keys are strings and the values are JSON documents. In the absence of secondary indexes, the fact that the values are JSON does not really impose a performance penalty and offers a good amount of flexibility. The graph data model can be implemented by storing a JSON document for each vertex and a JSON document for each edge. The edges are kept in special edge collections that ensure that every edge has “from” and “to” attributes that reference the starting and ending vertices of the edge respectively. Having unified the data for the three data models in this way, it only remains to devise and implement a common query language that allows users to express document queries, key/value lookups, “graphy queries,” and arbitrary mixtures of these. By “graphy queries,” I mean queries that involve the particular connectivity features coming from the edges, for example “ShortestPath,” “GraphTraversal,” and “Neighbors.”

  • Aircraft fleet maintenance: A case study

One area where the flexibility of a multi-model database is extremely well suited is the management of large amounts of hierarchical data, such as in an aircraft fleet. Aircraft fleets consists of several aircraft, and a typical aircraft consists of several million parts, which form subcomponents, larger and smaller components, such that we get a whole hierarchy of “items.” To organise the maintenance of such a fleet, one has to store a multitude of data at different levels of this hierarchy. There are names of parts or components, serial numbers, manufacturer information, maintenance intervals, maintenance dates, information about subcontractors, links to manuals and documentation, contact persons, warranty and service contract information, to name but a few. Every single piece of data is usually attached to a specific item in the above hierarchy.

This data is tracked in order to provide information and answer questions. Questions can include but are not limited to the following examples:

  1. What are all the parts in a given component?
  2. Given a (broken) part, what is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure?
  3. Which parts of this aircraft need maintenance next week?

A data model for an aircraft fleet

So, how do we model the data about our aircraft fleet if we have a multi-model database at our disposal?

There are probably several possibilities, but one good option here is the following (because it allows us to execute all required queries quickly): there is a JSON document for each item in our hierarchy. Due to the flexibility and recursive nature of JSON, we can store nearly arbitrary information about each item, and since the document store is schemaless, it is no problem that the data about an aircraft is completely different from the data about an engine or a small screw. Furthermore, we store containment as a graph structure. That is, the fleet vertex has an edge to every single aircraft vertex, an aircraft vertex has an edge to every top-level component it consists of, component vertices have edges to the subcomponents they are made of, and so on, until a small component has edges to every single individual part it contains. The graph that is formed in this way is in fact a directed tree:

Figure 2: A tree of items. Image courtesy of Max Neunh?ffer.

We can either put all items in a single (vertex) collection or sort them into different ones — for example, grouping aircraft, components, and individual parts respectively. For the graph, this does not matter, but when it comes to defining secondary indexes, multiple collections are probably better. We can ask the database for exactly those secondary indexes we need, such that the particular queries for our application are efficient.

Queries for aircraft fleet maintenance

We now come back to the typical questions we might ask of the data, and discuss which kinds of queries they might require. We will also look at concrete code examples for these queries using the ArangoDB Query Language (AQL).

  • What are all the parts in a given component?

This involves starting at a particular vertex in the graph and finding all vertices “below” — that is, all vertices that can be reached by following edges in the forward directions. This is a graph traversal, which is a typical graphy query.

Figure 3: Finding all parts in a component. Image courtesy of Max Neunh?ffer.

Here is an example of this type of query, which finds all vertices that can

be reached from “components/Engine765” by doing a graph traversal:

RETURN GRAPH_TRAVERSAL("FleetGraph",
                       "components/Engine765",
                       "outbound")
                       

In ArangoDB, one can define graphs by giving them a name and by specifying which document collections contain the vertices and which edge collections contain the edges. Documents, regardless of whether they are vertices or edges, are uniquely identified by their _id  attribute, which is a string that consists of the collection name, a slash “/” character and then the primary key. The call to GRAPH_TRAVERSAL thus only needs the graph name “FleetGraph”, the starting vertex, and “outbound” for the direction of the edges to be followed. You can specify further options, but that is not relevant here. AQL directly supports this type of graphy query.

  • **Given a (broken) part, what is the smallest component of the aircraftthat contains the part and for which there is a maintenance

procedure?**

This involves starting at a leaf vertex and searching upward in the tree until a component is found for which there is a maintenance procedure, which can be read off the corresponding JSON document. This is again a typical graphy query since the number of steps to go is not known a priori. This particular case is relatively easy since there is always a unique edge going upward.

Figure 4: Finding the smallest maintainable component. Image courtesy of Max Neunh?ffer.

For example, the following is an AQL query that finds the shortest path from *“parts/Screw56744”* to a vertex whose isMaintainable attribute has the boolean value true, following the edges in the “inbound” direction:

RETURN GRAPH_SHORTEST_PATH("FleetGraph",
                           "parts/Screw56744",
                           {isMaintainable: true},
                           {direction: "inbound",
                            stopAtFirstMatch: true})

Note that here, we specify the graph name, the _id of the start vertex and a pattern for the target vertex. We could have given a concrete _id instead, or could have given further options in addition to the direction of travel in the last argument. We see again that AQL directly supports this type of graphy query.

  • Which parts of this aircraft need maintenance next week?

This is a query that does not involve the graph structure at all: rather, the result tends to be nearly orthogonal to the graph structure. Nevertheless, the document data model with the right secondary index is a perfect fit for this query.

Figure 5: Query whose result is orthogonal to the graph structure. Image courtesy of Max Neunh?ffer.

With a pure graph database, we would be in trouble rather quickly for such a query. That is because we cannot use the graph structure in any sensible way, so we have to rely on secondary indexes — here, for example, on the attribute storing the date of the next maintenance. Obviously, a graph database could implement secondary indexes on its vertex data, but then it would essentially become a multi-model database.

To get our answer, we turn to a document query, which does not consider the graph structure. Here is one that finds the components that are due for maintenance:

FOR c IN components
  FILTER c.nextMaintenance <= "2015-05-15"
  RETURN {id: c._id,
          nextMaintenance: c.nextMaintenance}

What looks like a loop is AQL’s way to describe an iteration over the components collection. The query optimiser recognises the presence of a secondary index for the nextMaintenance attribute such that the execution engine does not have to perform a full collection scan to satisfy the FILTER condition. Note AQL’s way to specify projections by simply forming a new JSON document in the RETURN statement from known data. We see that the very same language supports queries usually found in a document store.

Using multi-model querying

To illustrate the potential of the multi-model approach, I’ll finally present an AQL query that mixes the three data models. The following query starts by finding parts with maintenance due, runs the above shortest path computation for each of them, and then performs a join operation with the contacts collection to add concrete contact information to the result:

FOR p IN parts
  FILTER p.nextMaintenance <= "2015-05-15"
  LET path = GRAPH_SHORTEST_PATH("FleetGraph", p._id,
                        {isMaintainable: true},
                        {direction: "inbound",
                         stopAtFirstMatch: true})
  LET pathverts = path[0].vertices
  LET c = DOCUMENT(pathverts[LENGTH(pathverts)-1])
  FOR person IN contacts
    FILTER person._key == c.contact
    RETURN {part: p._id, component: c, contact: person}
    

In AQL, the DOCUMENT function call performs a key/value lookup via the provided _id attribute; this is done for each vertex found as target of the shortest path computation. Finally, we can see AQL’s formulation for a join. The second FOR statement brings the contacts collection into play, and the query optimiser recognises that the FILTER statement can be satisfied best by doing a join, which in turn is very efficient because it can use the primary index of the contacts collection for a fast hash lookup.

This is a prime example for the potential of the multi-model approach. The query needs all three data models: documents with secondary indexes, graphy queries, and a join powered by fast key/value lookup. Imagine the hoops through which we would have to jump if the three data models would not reside in the same database engine, or if it would not be possible to mix them in the same query.

Even more importantly, this case study shows that the three different data models were indeed necessary to achieve good performance for all queries arising from the application. Without a graph database, the queries of a graphy nature with path lengths, which are not a priori known, notoriously lead to nasty, inefficient multiple join operations. However, a pure graph database cannot satisfy our needs for the document queries that we got efficiently by using the right secondary indexes. The efficient key/value lookups complement the picture by allowing interesting join operations that give us further flexibility in the data modeling. For example, in the above situation, we did not have to embed the whole contact information with every single path, simply because we could perform the join operation in the last query.

Lessons learned for data modeling

The case study of aircraft fleet maintenance reveals several important points about data modeling and multi-model databases.

  • JSON is very versatile for unstructured and structured data. The recursive nature of JSON allows embedding of subdocuments and variable length lists. Additionally, you can even store the rows of a table as JSON documents, and modern data stores are so good at compressing data that there is essentially no memory overhead in comparison to relational databases. For structured data, schema validation can be implemented as needed using an extensible HTTP API.
  • Graphs are a good data model for relations. In many real world cases, a graph is a very natural data model. It captures relations and can hold label information with each edge and with each vertex. JSON documents are a natural fit to store this type of vertex and edge data.
  • A graph database is particularly good for graphy queries. The crucial thing here is that the query language must implement routines like “shortest path” and “graph traversal”, the fundamental capability for these is to access the list of all outgoing or incoming edges of a vertex rapidly.
  • A multi-model database can compete with specialised solutions. The particular choice of the three data models: documents, key/value and graph, allows us to combine them in a coherent engine. This combination is no compromise, it can – as a document store – be as efficient as a specialised solution, and it can – as a graph database – be as efficient as a specialised solution (see this blog post for some benchmarks).
  • A multi-model database allows you to choose different data models with less operational overhead. Having multiple data models available in a single database engine alleviates some of the challenges of using different data models at the same time, because it means less operational overhead and less data synchronisation, and therefore allows for a huge leap in data modeling flexibility. You suddenly have the option to keep related data together in the same data store, even if it needs different data models. Being able to mix the different data models within a single query increases the options for application design and performance optimizations. And if you choose to split the persistence layer into several different database instances (even if they use the same data model), you still have the benefit of only having to deploy a single technology. Furthermore, a data model lock-in is prevented.
  • Multi-model has a larger solution space than relational. Considering all these possibilities for queries, the additional flexibility in data modeling and the benefits of polyglot persistence without the usually ensuing friction, the multi-model approach covers a solution space that is even larger than that of the relational model. This is all-the-more astonishing, since the relational model has dominated the database market as well as the database research for decades.

Further use cases for multi-model databases

Here are a few more use cases for which multi-model is well suited or even outright necessary:

  • Workflow management software often models the dependencies between tasks with a graph, some queries need these dependencies, others ignore them and only look at the remaining data.
  • Knowledge graphs are enormous data collections, most queries from expert systems use only the edges and graphy queries, but often enough you needs “orthogonal” queries only considering the vertex data.
  • E-commerce systems need to store customer and product data (JSON), shopping carts (key/value), orders and sales (JSON or graph) and data for recommendations (graph), and need a multitude of queries featuring all of these data items.
  • Enterprise hierarchies come naturally as graph data and rights management typically needs a mixture of graphy and document queries.
  • Social networks are the prime example for large, highly connected graphs and typical queries are graphy, nevertheless, actual applications need additionally queries which totally ignore the social relationship and thus need secondary indexes and possibly joins with key lookups.
  • Version management applications usually work with a directed acyclic graph, but also need graphy queries and others.
  • Any application that deals with complex, user-defined data structures benefits dramatically from the flexibility of a document store and has often good applications for graph data as well.

The future of multi-model databases

Currently there are only two products that are multi-model in the sense used above, making use of JSON, key/value, and graphs: ArangoDB and OrientDB. A few others are marketed under the term “multi-model” (for a complete overview, see the ranking at DB-engines), which support multiple data models, but none of them has graphs and targets the operational domain.

Other players, like MongoDB or Datastax, who have traditionally concentrated on a single data model, show signs of broadening their scope. MongoDB, which is a pure document store, made their storage engine pluggable with the 3.0 release in March 2015. Datastax, a company that produces a commercial product based on the column-based store Apache Cassandra, has recently acquired Aurelius, the company behind the distributed graph database TitanDB. Apple just acquired FoundationDB, a distributed key/value store with multiple “personalities” for different data models layered on top.

The arrival of the new players, as well as the moves of the more established ones, constitute a rather recent trend toward support for multiple data models. At the same time, more and more NoSQL solutions are appearing that rediscover the traditional virtues of relational databases, such as ACID transactions, joins, and relatively strong consistency guarantees.

These are golden times for data modelers and software architects. Stay tuned, watch the exciting new developments in the database market and enjoy the benefits of an unprecedented amount of choice.

from:https://www.oreilly.com/ideas...

时间: 2024-08-29 04:28:42

怎么用多模型数据库为复杂数据建模?--航空舰队实例的相关文章

Azure CosmosDB (13) CosmosDB数据建模

<Windows Azure Platform 系列文章目录> 我们在使用NoSQL的时候,如Azure Cosmos DB,可以非常快速的查询非结构化,或半结构化的数据.我们需要花一些时间,研究Cosmos DB的数据建模,来保证查询性能和可扩展性,同事降低成本. 阅读完这篇文章后,我们将学会: 1.什么是数据建模,为什么我们要关注数据建模 2.如何在Azure Cosmos DB进行数据建模,与传统关系型数据库有什么不同 3.如何在非关系型数据库中,保存关系型数据 4.什么时候执行嵌入(e

《Entity Framework 6 Recipes》翻译系列 (4) -----第二章 实体数据建模基础之从已存在的数据库创建模型 (转)

不知道对EF感兴趣的并不多,还是我翻译有问题(如果是,恳请你指正),通过前几篇的反馈,阅读这个系列的人不多.不要这事到最后成了吃不讨好的事就麻烦了,废话就到这里,直奔主题. 2-2 从已存在的数据库创建模型 问题 有一个存在的数据库,它拥有表.也许还有视图.外键.你想通过它来创建一个模型. 解决方案 让我们设想,你拥有一个描述诗人(Poet)以及他们的诗(Poem),还有他们之间关系的数据库.如图2-7所示. 图2-7 一个关于诗人及他们的诗的简单数据库 从上图可以看出,一个诗人可能是一首或多首

ThinkPHP 学习笔记 ( 三 ) 数据库操作之数据表模型和基础模型 ( Model )

//TP 恶补ing... 一.定义数据表模型 1.模型映射 要测试数据库是否正常连接,最直接的办法就是在当前控制器中实例化数据表,然后使用 dump 函数输出,查看数据库的链接状态.代码: public function testdb(){ $obj=M("User"); dump($obj); } 此时浏览器输出: object(Model)#5 (20) { ["_extModel:private"] => NULL ["db:protecte

python Django教程 之 模型(数据库)、自定义Field、数据表更改、QuerySet API

python  Django教程  之 模型(数据库).自定义Field.数据表更改.QuerySet API 一.Django 模型(数据库) Django 模型是与数据库相关的,与数据库相关的代码一般写在 models.py 中,Django 支持 sqlite3, MySQL, PostgreSQL等数据库,只需要在settings.py中配置即可,不用更改models.py中的代码,丰富的API极大的方便了使用. 本节的代码:(Django 1.6, Python 2.7 测试环境) 大

《Entity Framework 6 Recipes》翻译系列 (3) -----第二章 实体数据建模基础之创建一个简单的模型 (转)

第二章 实体数据建模基础 很有可能,你才开始探索实体框架,你可能会问“我们怎么开始?”,如果你真是这样的话,那么本章就是一个很好的开始.如果不是,你已经建模,并在实体分裂和继承方面感觉良好,那么你可以跳过本章. 本章将带你漫游使用实体框架建模的基本实例,建模是实体框架的核心特性,同时也是区别实体框架和微软早期的数据访问平台的特性.一旦建好模,你就可以面向模型编写代码,而不用面向关系数据库中的行和列. 本章以创建一个简单概念模型的实例开始,然后让实体框架创建底层的数据库,剩下的实例,将向你展示,如

Django模型-数据库操作

前言 前边记录的URLconf和Django模板全都是介绍页面展示的东西,也就是表现层的内容.由于Python先天具备简单而强大的数据库查询执行方法,Django 非常适合开发数据库驱动网站. 这篇开始,进入到了Django模型,也就是数据库操作. 自带 Sqlite3 数据库查询方式 为了简单,使用Python自带的Sqlite3数据库进行实例说明. 先看一个传统的数据库操作示例: 1 from django.shortcuts import render 2 import sqlite3 3

Java实现Excel导入数据库,数据库中的数据导入到Excel

实现的功能: Java实现Excel导入数据库,如果存在就更新 数据库中的数据导入到Excel 1.添加jxl.jar mysql-connector-java.1.7-bin.jar包到项目的lib目录下­ 2.Excel文件目录:D://book.xls 3.数据库名:javenforexcel 4.表名:stu 5.编写类:连接mysql的字符串方法.插入的方法.实体类­­ 表结构如下 : 连接数据库的工具类 package com.javen.db; import java.sql.Co

【转】mysql数据库读写分离数据同步

转载请注明来源:mysql数据库读写分离数据同步 mysql数据库读写分离数据同步 我是用了两个xp(一个主的,一个从的)的系统测试成功的,linux系统我也做测试了,没有成功,不过我想我所遇到的问题是同一个问题,xp下的可以成功,linux下的应该也可以成功,稍候会测试,然后更新结果! PS:刚测试了下linux 可以同步成功,主服务器是xp,从服务器是centos,可以成功. 例: A机器 192.168.0.2 B机器 192.168.0.3 两个机器可以ping通,互相访问 先配置主服务

java更改数据库中的数据

不废话,上代码 1 package com.ningmeng; 2 3 import java.sql.*; 4 5 /** 6 * 1:更改数据库中的数据 7 * @author biexiansheng 8 * 9 */ 10 public class Test04 { 11 12 public static void main(String[] args) { 13 // TODO Auto-generated method stub 14 try { 15 Class.forName("