Optimize Cube.js Performance with Pre-Aggregations

转自:https://cube.dev/blog/high-performance-data-analytics-with-cubejs-pre-aggregations/ 可以了解 Pre-Aggregations的处理流程

This is an advanced tutorial. If you are just getting started with Cube.js, I recommend checking this tutorial first and then coming back here.

One of the most powerful features of Cube.js is pre-aggregations. Coupled with data schema, it eliminates the need to organize, denormalize, and transform data before using it with Cube.js. The pre-aggregation engine builds a layer of aggregated data in your database during the runtime and maintains it to be up-to-date.

Upon an incoming request, Cube.js will first look for a relevant pre-aggregation. If it cannot find any, it will build a new one. Once the pre-aggregation is built, all the subsequent requests will go to the pre-aggregated layer instead of hitting the raw data. It could speed the response time by hundreds or even thousands of times.

Pre-aggregations are materialized query results persisted as tables. In order to start using pre-aggregations, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Cube.js also takes care of keeping the pre-aggregation up-to-date. It performs refresh checks and if it finds that a pre-aggregation is outdated, it schedules a refresh in the background.

Creating a Simple Pre-Aggregation

Let’s take a look at the example of how we can use pre-aggregations to improve query performance.

For testing purposes, we will use a Postgres database and will generate around ten million records using the generate_series function.

$ createdb cubejs_test

The following SQL creates a table, orders, and inserts a sample of generated records into it.

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  amount integer,
  created_at timestamp without time zone
);
CREATE INDEX orders_created_at_amount ON orders(created_at, amount);

INSERT INTO orders (created_at, amount)
SELECT
  created_at,
  floor((1000 + 500*random())*log(row_number() over())) as amount
FROM generate_series
  ( ‘1997-01-01‘::date
  , ‘2017-12-31‘::date
  , ‘1 minutes‘::interval) created_at

Next, create a new Cube.js application if you don’t have any.

$ npm install -g cube.js
$ cubejs create test-app -d postgres

Change the content of .env in the project folder to the following.

CUBEJS_API_SECRET=SECRET
CUBEJS_DB_TYPE=postgres
CUBEJS_DB_NAME=cubejs_test

Finally, generate a schema for the orders table and start the Cube.js server.

$  cubejs generate -t orders
$  npm run dev

Now, we can send a query to Cube.js with the Orders.count measure and Orders.createdAt time dimension with granularity set to month.

curl  -H "Authorization: EXAMPLE-API-TOKEN"  -G  --data-urlencode ‘query={
  "measures" : ["Orders.amount"],
  "timeDimensions":[{
    "dimension": "Orders.createdAt",
    "granularity": "month",
    "dateRange": ["1997-01-01", "2017-01-01"]
  }]
}‘  http://localhost:4000/cubejs-api/v1/load

Cube.js will respond with Continue wait, because this query takes more than 5 seconds to process. Let’s look at Cube.js logs to see exactly how long it took for our Postgres to execute this query.

Performing query completed:
{
   "queueSize":2,
   "duration":6514,
   "queryKey":[
      "
        SELECT
          date_trunc(‘month‘, (orders.created_at::timestamptz at time zone ‘UTC‘))            \"orders.created_at_month\",
          sum(orders.amount) \"orders.amount\"
        FROM
            public.orders AS orders
        WHERE (
          orders.created_at >= $1::timestamptz
          AND   orders.created_at <= $2::timestamptz
        )
        GROUP BY 1
        ORDER BY 1 ASC limit 10000
      ",
      [
         "2000-01-01T00:00:00Z",
         "2017-01-01T23:59:59Z"
      ],
      []
   ]
}

It took 6,514 milliseconds (6.5 seconds) for Postgres to execute the above query. Although we have an index on the created_at and amount columns, it doesn‘t help a lot in this particular case since we‘re querying almost all the dates we have. The index would help if we query a smaller date range, but still, it would be a matter of seconds, not milliseconds.

We can significantly speed it up by adding a pre-aggregation layer. To do this, add the following preAggregations block to src/Orders.js:

preAggregations: {
    amountByCreated: {
      type: `rollup`,
      measureReferences: [amount],
      timeDimensionReference: createdAt,
      granularity: `month`
    }
  }

The block above instructs Cube.js to build and use a rollup type of pre-aggregation when the “Orders.amount” measure and “Orders.createdAt” time dimension (with “month” granularity) are requested together. You can read more about pre-aggregation options in the documentation reference.

Now, once we send the same request, Cube.js will detect the pre-aggregation declaration and will start building it. Once it‘s built, it will query it and send the result back. All the subsequent queries will go to the pre-aggregation layer.

Here is how querying pre-aggregation looks in the Cube.js logs:

Performing query completed:
{
   "queueSize":1,
   "duration":5,
   "queryKey":[
      "
        SELECT
          \"orders.created_at_month\" \"orders.created_at_month\",
          sum(\"orders.amount\") \"orders.amount\"
        FROM
          stb_pre_aggregations.orders_amount_by_created
        WHERE (
          \"orders.created_at_month\" >= ($1::timestamptz::timestamptz AT TIME ZONE ‘UTC‘)
          AND
          \"orders.created_at_month\" <= ($2::timestamptz::timestamptz AT TIME ZONE ‘UTC‘)
        )
        GROUP BY 1 ORDER BY 1 ASC LIMIT 10000
      ",
      [
         "1995-01-01T00:00:00Z",
         "2017-01-01T23:59:59Z"
      ],
      [
        [
          "
            CREATE TABLE
                stb_pre_aggregations.orders_amount_by_created
            AS SELECT
                date_trunc(‘month‘, (orders.created_at::timestamptz AT TIME ZONE ‘UTC‘)) \"orders.created_at_month\",
                sum(orders.amount) \"orders.amount\"
            FROM
                public.orders AS orders
            GROUP BY 1
          ",
          []
        ]
      ]
   ]
}

As you can see, now it takes only 5 milliseconds (1,300 times faster) to get the same data. Also, you can note that SQL has been changed and now it queries data from stb_pre_aggregations.orders_amount_by_created, which is the table generated by Cube.js to store pre-aggregation for this query. The second query is a DDL statement for this pre-aggregation table.

Pre-Aggregations Refresh

Cube.js also takes care of keeping pre-aggregations up to date. Every two minutes on a new request Cube.js will initiate the refresh check.

You can set up a custom refresh check strategy by using refreshKey. By default, pre-aggregations are refreshed every hour.

If the result of the refresh check is different from the last one, Cube.js will initiate the rebuild of the pre-aggregation in the background and then hot swap the old one.

Next Steps

This guide is the first step to learning about pre-aggregations and how to start using them in your project. But there is much more you can do with them. You can find the pre-aggregations documentation reference here.

Also, here are some highlights with useful links to help you along the way.

Pre-aggregate queries across multiple cubes

Pre-aggregations work not only for measures and dimensions inside the single cube, but also across multiple joined cubes as well. If you have joined cubes, you can reference measures and dimensions from any part of the join tree. The example below shows how the Users.country dimension can be used with the Orders.count and Orders.revenue measures.

cube(`Orders`, {
  sql: `select * from orders`,

  joins: {
    Users: {
      relationship: `belongsTo`,
      sql: `${CUBE}.user_id = ${Users}.id`
    }
  },

  // …

  preAggregations: {
    categoryAndDate: {
      type: `rollup`,
      measureReferences: [count, revenue],
      dimensionReferences: [Users.country],
      timeDimensionReference: createdAt,
      granularity: `day`
    }
  }
});

Generate pre-aggregations dynamically

Since pre-aggregations are part of the data schema, which is basically a Javascript code, you can dynamically create all the required pre-aggregations. This guide covers how you can dynamically generate a Cube.js schema.

Time partitioning

You can instruct Cube.js to partition pre-aggregations by time using the partitionGranularity option. Cube.js will generate not a single table for the whole pre-aggregation, but a set of smaller tables. It can reduce the refresh time and cost in the case of BigQuery for example.

Time partitioning documentation reference.

preAggregations: {
    categoryAndDate: {
      type: `rollup`,
      measureReferences: [count],
      timeDimensionReference: createdAt,
      granularity: `day`,
      partitionGranularity: `month`
    }
  }

Data Cube Lattices

Cube.js can automatically build rollup pre-aggregations without the need to specify which measures and dimensions to use. It learns from query history and selects an optimal set of measures and dimensions for a given query. Under the hood it uses the Data Cube Lattices approach.

It is very useful if you need a lot of pre-aggregations and you don‘t know ahead of time which ones exactly. Using autoRollup will save you from coding manually all the possible aggregations.

You can find documentation for auto rollup here.

cube(`Orders`, {
  sql: `select * from orders`,

  preAggregations: {
    main: {
      type: `autoRollup`
    }
  }
});

原文地址:https://www.cnblogs.com/rongfengliang/p/10807552.html

时间: 2024-11-02 18:59:24

Optimize Cube.js Performance with Pre-Aggregations的相关文章

cube.js 学习(一)简单项目创建

cube.js 是一个很不错的模块化分析框架,基于schema生成sql 同时内置可代码生成,可以快速的搞定 web 分析应用的开发 安装cli 工具 npm install -g cubejs-cli 创建简单应用 使用cli cubejs create pg-demo -d postgres 准备pg 数据库 使用docker-compose version: "3" services: postgres: image: postgres:9.6.11 ports: - "

cube.js 学习(七)cube.js type 以及format 说明

cube.js 对于measure以及dimension 提供了丰富的数据类型,基本满足我们常见应用的开发,同时对于不同类型也提供了 格式化的操作 measure类型 number 格式 purchasesRatio: { sql: `${purchases} / ${count} * 100.0`, type: `number`, format: `percent` } count 格式 numerOfUsers: { type: `count`, // optional drillMembe

cube.js 学习(五)cube.js joins 说明

cube.js 也支持join, 参考格式 joins: { TargetCubeName: { relationship: `belongsTo` || `hasMany` || `hasOne`, sql: `SQL ON clause` } } 一个简单demo cube("Authors", { joins: { Books: { relationship: `hasMany`, sql: `${Authors}.id = ${Books}.author_id` } } });

cube.js 学习(八)backend部署模式

cube.js 从设计上就进行了系统上的分层,backend,frontend,backend 是cube.js 的核心 对于cube.js backend 的部署官方也提供了好多中方法 部署模型 serverless 目前主要是aws的应用场景 heroku 基于heroku 提供的pg 以及应用部署能力 docker 部署,单机.同时基于环境变量的配置 docker-compose 全家桶的方式,集成pg.redis,以及nodejs 运行时环境 kubernetes 的集成,实际上这个就很

cube.js 最近版本的一般更新

有一段时间没有关注cube.js 了,刚好晚上收到一封来自官方的更新介绍,这里简单说明下 更多的数据驱动支持 bigquey, clickhouse snowflake,presto (很棒),hive,oracle 支持的完整列表在docs 开发这的playground 有了很多的提升,可以查看查询的json 数据了 基于数据schema 级别的多租户方式支持,我们可以连接多个数据库了 rollups 有了很大的提升,可以在外部数据库创建了, 动态schema 生成更加灵活了 同时官方发布了一

cube.js 学习 cli 命令

平时经常用的cube.js 的命令主要是create 实际上还包含了其他方便的命令 create 生成cube.js 的脚手架app 命令使用 cubejs create APP-NAME -d DB-TYPE [-t TEMPLATE] 说明 -d 指定我们依赖的数据库类型 -t 指定模版 默认为express 类型的 generate 方便我们生成表的schema 命令使用 cubejs generate -t orders,customers token 可以方便的生成一个可用的jwt t

cube.js schema 学习二

cube.js 从那发布,到现在也已经有了很大的变动了,比如多租户,多数据源的支持,同时schema 也有了好多新的 类型支持,以下是基于新版本的一个学习 通用格式参考 cube(`Users`, {  sql: `select * from users`, ?  joins: {    Organizations: {      relationship: `belongsTo`,      sql: `${Users}.organization_id = ${Organizations}.i

cube.js 学习(六)cube.js segments 说明

segments 是你需要查询的数据的子集,实际上filter 也可以做类似的事情,但是,目前这个设计估计是为了更好的数据 查询吧,同时在操作界面上我们也可以看出来 参考格式 segments: { sfUsers: { sql: `location = 'San Francisco'` } } 参考资料 https://cube.dev/docs/segments 原文地址:https://www.cnblogs.com/rongfengliang/p/10804378.html

cube.js 学习(九)cube 的pre-aggregation

我们可以使用cube的pre-aggregation 加速数据的查询,以下为一张来自官方的pre-aggregation 架构 参考架构图 pre-aggregation schema preAggregations: { amountByCreated: { type: `rollup`, measureReferences: [amount], timeDimensionReference: createdAt, granularity: `month` } } 参考资料 https://c