Calcite - StreamingSQL

https://calcite.apache.org/docs/stream.html

 

Calcite’s SQL is an extension to standard SQL, not another ‘SQL-like’ language. The distinction is important, for several reasons:

  • Streaming SQL is easy to learn for anyone who knows regular SQL.
  • The semantics are clear, because we aim to produce the same results on a stream as if the same data were in a table.
  • You can write queries that combine streams and tables (or the history of a stream, which is basically an in-memory table).
  • Lots of existing tools can generate standard SQL.

If you don’t use the STREAM keyword, you are back in regular standard SQL.

只是对于标准sql的扩展,StreamingSQL只是多个Stream关键词

 

An example schema

Our streaming SQL examples use the following schema:

  • Orders (rowtime, productId, orderId, units) - a stream and a table
  • Products (rowtime, productId, name) - a table
  • Shipments (rowtime, orderId) - a stream

以简单的订单,商品,发货为例子

可以看到这里可以同时处理,流式表和静态表;对于order即是流式表也是静态表,意思是实时数据在流式表中,而历史数据在静态表中

简单的加上STREAM就可以对流式表Orders进行查询,结果是unbounded的;如果不带STREAM就是对静态表进行查询,结果是bounded

 

Windows支持

  • tumbling window (GROUP BY)
  • hopping window (multi GROUP BY)
  • sliding window (window functions)
  • cascading window (window functions)

 

Tumbling windows

在sql中,所谓window就是对于时间的group

比如下面的例子,以小时为时间窗口

How did Calcite know that the 10:00:00 sub-totals were complete at 11:00:00, so that it could emit them? It knows that rowtime is increasing, and it knows that CEIL(rowtime TO HOUR) is also increasing. So, once it has seen a row at or after 11:00:00, it will never see a row that will contribute to a 10:00:00 total.

A column or expression that is increasing or decreasing is said to bemonotonic.

有个问题是如何知道11点之前的数据都已经到,这个取决于rowtime是单调递增的

所以对于group by时,必须要有一个column是单调递增的,monotonic

If column or expression has values that are slightly out of order, and the stream has a mechanism (such as punctuation or watermarks) to declare that a particular value will never be seen again, then the column or expression is said to be quasi-monotonic.

当然rowtime可能不是严格单调的,所以我们可以用watermark来限定一个时间段,在这个时间范围上是单调的;这样称为quasi-monotonic,拟单调

更优雅些,我们可以使用TUMBLE关键字

上面的例子是30分钟的时间窗口,但是非整点,而是有12分钟的偏移,alignment time

 

Hopping windows

Hopping windows are a generalization of tumbling windows that allow data to be kept in a window for a longer than the emit interval.

其实就是滑动窗口

以1小时为滑动,3小时为窗口大小

 

HAVING

聚合后的过滤

 

Sliding windows

非groupby方式的sliding windows,

Standard SQL features so-called “analytic functions” that can be used in the SELECT clause.
Unlike GROUP BY, these do not collapse records. For each record that goes in, one record comes out. But the aggregate function is based on a window of many rows.

SELECT STREAM rowtime,
  productId,
  units,
  SUM(units) OVER (ORDER BY rowtime RANGE INTERVAL ‘1‘ HOUR PRECEDING) unitsLastHour
FROM Orders;

这个和groupby的区别在于,窗口触发的时机

对于groupby,时间整点触发,会将窗口里records计算成一个值

而OVER,是record by record,每来一条record都会触发一次计算,上面的例子是,对每条record都会触发一次前一个小时的sum

这里更加复杂,

先声明Window product,表示order by rowtime,partition by productId

再基于product,OVER生成7天和10分钟的AVG(units)

 

 

子查询

The previous HAVING query can be expressed using a WHERE clause on a sub-query:

having也可以实现成子查询的形式

 

Since then, SQL has become a mathematically closed language, which means that any operation you can perform on a table can also perform on a query.

The closure property of SQL is extremely powerful. Not only does it render HAVING obsolete (or, at least, reduce it to syntactic sugar), it makes views possible:

sql具有闭合特性,即任何可以在table上执行的操作,也同样可以在query上执行,因为query的结果也是一个关系表

所以上面通过create view创建子查询

Many people find that nested queries and views are even more useful on streams than they are on relations.

Streaming queries are pipelines of operators all running continuously, and often those pipelines get quite long. Nested queries and views help to express and manage those pipelines.

嵌套查询对于Streaming非常有用,因为流其实就是一组operators的pipelines;以嵌套查询或view的方式去表示会很方便

 

And, by the way, a WITH clause can accomplish the same as a sub-query or a view:

With关键词,用于实现子查询或view

 

Sorting

 

Joining streams to tables

A stream-to-table join is straightforward if the contents of the table are not changing.

这个很直接,但有个问题是,静态表是会变化的,当数据record流过来时,我们需要和record发生时静态表做join,但如果静态表已经变化了,我们只能取到最新值

要解决这个问题,我们需要为静态表,创建版本表,保存每个时间的版本

One way to implement this is to have a table that keeps every version with a start and end effective date, ProductVersions in the following example:

当前会从productVersion里面,根据record rowtime找出包含这个时间的版本

 

Joining streams to streams

 

DML

It’s not only queries that make sense against streams; it also makes sense to run DML statements (INSERT, UPDATE, DELETE, and also their rarer cousins UPSERT and REPLACE) against streams.

 

DML is useful because it allows you do materialize streams or tables based on streams, and therefore save effort when values are used often.

时间: 2024-10-13 00:34:04

Calcite - StreamingSQL的相关文章

Calcite中的流式SQL

Calcite中的流式SQL Calcite中的流式SQL总体设计思路 总体语法应该兼容SQL,这个是和目前流处理SQL的发展趋势是一致的. 如果部分功能标准SQL中没有包含,则尽量采用业界标杆(Oracle).比如模式匹配的功能,目前流处理中还没有针对语法达成共识,那么在设计上,就采用Oracle data warehouse的Match Recognize的方式.还有滑窗功能. 如果还有功能目前业界标杆都没有,那么就通过函数的方式拓展,翻滚窗口和跳动窗口,这两个窗口在标准SQL中都是不包含的

calcite 理论

https://www.slideshare.net/JordanHalterman/introduction-to-apache-calcite https://blog.csdn.net/yunlong34574/article/details/46375733 https://www.jianshu.com/p/a6134865adf6 https://blog.csdn.net/wangxingxing2006/article/details/78907278 https://datap

Calcite分析 - Rule

Calcite源码分析,参考: http://matt33.com/2019/03/07/apache-calcite-process-flow/ https://matt33.com/2019/03/17/apache-calcite-planner/ Rule作为Calcite查询优化的核心, 具体看几个有代表性的Rule,看看是如何实现的 最简单的例子,Join结合律,JoinAssociateRule 首先所有的Rule都继承RelOptRule类 /** * Planner rule

Apache Calcite 学习 (一)

关于 Apache Calcite 的简单介绍可以参考 Apache Calcite:Hadoop 中新型大数据查询引擎 这篇文章,Calcite 一开始设计的目标就是 one size fits all,它希望能为不同计算存储引擎提供统一的 SQL 查询引擎,当然 Calcite 并不仅仅是一个简单的 SQL 查询引擎,在论文 Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterog

org.apache.calcite.sql.parser.impl.ParseException: Encountered "create"

在用calcite解析oracle的建表语句时报这样的错: Caused by: org.apache.calcite.sql.parser.impl.ParseException: Encountered "CREATE" at line 3, column 2. Was expecting one of: "SET" ... "RESET" ... "ALTER" ... "WITH" ... &quo

OLAP引擎——Kylin介绍

Kylin是ebay开发的一套OLAP系统,与Mondrian不同的是,它是一个MOLAP系统,主要用于支持大数据生态圈的数据分析业务,它主要是通过预计算的方式将用户设定的多维立方体缓存到HBase中(目前还仅支持hbase),这段时间对mondrian和kylin都进行了使用,发现这两个系统是时间和空间的一个权衡吧,mondrian是一个ROLAP系统,所有的查询可以通过实时的数据库查询完成,而不会有任何的预计算,大大节约了存储空间的要求(但是会有查询结果的缓存,目前是缓存在程序内存中,很容易

SQL数据分析概览——Hive、Impala、Spark SQL、Drill、HAWQ 以及Presto+druid

转自infoQ! 根据 O'Reilly 2016年数据科学薪资调查显示,SQL 是数据科学领域使用最广泛的语言.大部分项目都需要一些SQL 操作,甚至有一些只需要SQL. 本文涵盖了6个开源领导者:Hive.Impala.Spark SQL.Drill.HAWQ 以及Presto,还加上Calcite.Kylin.Phoenix.Tajo 和Trafodion.以及2个商业化选择Oracle Big Data SQL 和IBM Big SQL,IBM 尚未将后者更名为"Watson SQL&q

Kafka - SQL 引擎

Kafka - SQL 引擎分享 1.概述 大多数情况下,我们使用 Kafka 只是作为消息处理.在有些情况下,我们需要多次读取 Kafka 集群中的数据.当然,我们可以通过调用 Kafka 的 API 来完成,但是针对不同的业务需求,我们需要去编写不同的接口,在经过编译,打包,发布等一系列流程.最后才能看到我们预想的结果.那么,我们能不能有一种简便的方式去实现这一部分功能,通过编写 SQL 的方式,来可视化我们的结果.今天,笔者给大家分享一些心得,通过使用 SQL 的形式来完成这些需求. 2.

大数据:从入门到XX(二)

想了解APACHE 项目是怎么分类,又或者想了解APACHE项目是用什么语言开发的,直接访问APACHE官网中的By Category和By Programming Language就可以了,但是如果想同时看到每个 项目的分类信息和开发语言,看看下面这张表就可以了.有几个小调整需要说一下: 1.原始数据中JavaScript.Javascript:NODE.JS.NODE.js都当作两种语言了(大小写不一致),在这张表里做了合并. 2.原始数据中有67个项目没有LANGUAGE相关的描述,在这张