一篇来自hasura graphql-engine 百万级别live query 的实践

转自:https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md

Scaling to 1 million active GraphQL subscriptions (live queries)

Hasura is a GraphQL engine on Postgres that provides instant GraphQL APIs with authorization. Read more at hasura.ioand on github.com/hasura/graphql-engine.

Hasura allows ‘live queries‘ for clients (over GraphQL subscriptions). For example, a food ordering app can use a live query to show the live-status of an order for a particular user.

This document describes Hasura‘s architecture which lets you scale to handle a million active live queries.

TL;DR:

The setup: Each client (a web/mobile app) subscribes to data or a live-result with an auth token. The data is in Postgres. 1 million rows are updated in Postgres every second (ensuring a new result pushed per client). Hasura is the GraphQL API provider (with authorization).

The test: How many concurrent live subscriptions (clients) can Hasura handle? Does Hasura scale vertically and/or horizontally?

single instance configuration No. of active live queries CPU load average
1xCPU, 2GB RAM 5000 60%
2xCPU, 4GB RAM 10000 73%
4xCPU, 8GB RAM 20000 90%

At 1 million live queries, Postgres is under about 28% load with peak number of connections being around 850.

Notes on configuration:

  • AWS RDS postgres, Fargate, ELB were used with their default configurations and without any tuning.
  • RDS Postgres: 16xCPU, 64GB RAM, Postgres 11
  • Hasura running on Fargate (4xCPU, 8GB RAM per instance) with default configurations

GraphQL and subscriptions

GraphQL makes it easy for app developers to query for precisely the data they want from their API.

For example, let’s say we’re building a food delivery app. Here’s what the schema might look like on Postgres:

For an app screen showing a “order status” for the current user for their order, a GraphQL query would fetch the latest order status and the location of the agent delivering the order.

Underneath it, this query is sent as a string to a webserver that parses it, applies authorization rules and makes appropriate calls to things like databases to fetch the data for the app. It sends this data, in the exact shape that was requested, as a JSON.

Enter live-queries: Live queries is the idea of being able to subscribe to the latest result of a particular query. As the underlying data changes, the server should push the latest result to the client.

On the surface this is a perfect fit with GraphQL because GraphQL clients support subscriptions that take care of dealing with messy websocket connections. Converting a query to a live query might look as simple as replacing the word query with subscription for the client. That is, if the GraphQL server can implement it.

Implementing GraphQL live-queries

Implementing live-queries is painful. Once you have a database query that has all the authorization rules, it might be possible to incrementally compute the result of the query as events happen. But this is practically challenging to do at the web-service layer. For databases like Postgres, it is equivalent to the hard problem of keeping a materialized view up to date as underlying tables change. An alternative approach is to refetch all the data for a particular query (with the appropriate authorization rules for the specific client). This is the approach we currently take.

Secondly, building a webserver to handle websockets in a scalable way is also sometimes a little hairy, but certain frameworks and languages do make the concurrent programming required a little more tractable.

Refetching results for a GraphQL query

To understand why refetching a GraphQL query is hard, let’s look at how a GraphQL query is typically processed:

The authorization + data fetching logic has to run for each “node” in the GraphQL query. This is scary, because even a slightly large query fetching a collection could bring down the database quite easily. The N+1 query problem, also common with badly implemented ORMs, is bad for your database and makes it hard to optimally query Postgres. Data loader type patterns can alleviate the problem, but will still query the underlying Postgres database multiple times (reduces to as many nodes in the GraphQL query from as many items in the response).

For live queries, this problem becomes worse, because each client’s query will translate into an independent refetch. Even though the queries are the “same”, since the authorization rules create different session variables, independent fetches are required for each client.

Hasura approach

Is it possible to do better? What if declarative mapping from the data models to the GraphQL schema could be used to create a single SQL query to the database? This would avoid multiple hits to the database, whether there are a large number of items in the response or the number of nodes in the GraphQL query are large.

Idea #1: “Compile” a GraphQL query to a single SQL query

Part of Hasura is a transpiler that uses the metadata of mapping information for the data models to the GraphQL schema to “compile” GraphQL queries to the SQL queries to fetch data from the database.

GraphQL query → GraphQL AST → SQL AST → SQL

This gets rid of the N+1 query problem and allows the database to optimise data-fetching now that it can see the entire query.

But this in itself isn‘t enough as resolvers also enforce authorization rules by only fetching the data that is allowed. We will therefore need to embed these authorization rules into the generated SQL.

Idea #2: Make authorization declarative

Authorization when it comes to accessing data is essentially a constraint that depends on the values of data (or rows) being fetched combined with application-user specific “session variables” that are provided dynamically. For example, in the most trivial case, a row might container a user_id that denotes the data ownership. Or documents that are viewable by a user might be represented in a related table, document_viewers. In other scenarios the session variable itself might contain the data ownership information pertinent to a row, for eg, an account manager has access to any account [1,2,3…] where that information is not present in the current database but present in the session variable (probably provided by some other data system).

To model this, we implemented an authorization layer similar to Postgres RLS at the API layer to provide a declarative framework to configure access control. If you’re familiar with RLS, the analogy is that the “current session” variables in a SQL query are now HTTP “session-variables” coming from cookies or JWTs or HTTP headers.

Incidentally, because we had started the engineering work behind Hasura many years ago, we ended up implementing Postgres RLS features at the application layer before it landed in Postgres. We even had the same bug in our equivalent of the insert returning clause that Postgres RLS fixed ?? https://www.postgresql.org/about/news/1614/

Because of implementing authorization at the application layer in Hasura (instead of using RLS and passing user-details via current session settings in the postgres connection) had significant benefits, which we’ll soon see.

To summarise, wow that authorization is declarative and available at a table, view or even a function (if the function returns SETOF) it is possible to create a single SQL query that has the authorization rules.

GraphQL query → GraphQL AST → Internal AST with authorization rules → SQL AST → SQL

Idea #3: Batch multiple live-queries into one SQL query

With only ideas #1, #2 implemented we would still result in the situation where a 100k connected clients could result in a proportional load of 100k postgres queries to fetch the latest data (let’s say if 100k updates happen, 1 update relevant to each client).

However, considering that we have all the application-user level session variables available at the API layer, we can actually create a single SQL query to re-fetch data for a number of clients all at once!

Let’s say that we have clients running a subscription to get the latest order status and the delivery agent location. We can create a “relation” in-query that contains all the query-variables (the order IDs) and the session variables (the user IDs) as different rows. We can then “join” the query to fetch the actual data with this relation to ensure that we get the latest data for multiple clients in a single response. Each row in this response contains the final result for each user. This will allow fetching the latest result for multiple users at the same time, even though the parameters and the session variables that they provide are completely dynamic and available only at query-time.

When do we refetch?

We experimented with several methods of capturing events from the underlying Postgres database to decide when to refetch queries.

  1. Listen/Notify: Requires instrumenting all tables with triggers, events consumed by consumer (the web-server) might be dropped in case of the consumer restarting or a network disruption.
  2. WAL: Reliable stream, but LR slots are expensive which makes horizontal scaling hard, and are often not available on managed database vendors. Heavy write loads can pollute the WAL and will need throttling at the application layer.

After these experiments, we’ve currently fallen back to interval based polling to refetch queries. So instead of refetching when there is an appropriate event, we refetch the query based on a time interval. There were two major reasons for doing this:

  1. Mapping database events to a live query for a particular client is possible to some extent when the declarative permissions and the conditions used in the live queries are trivial (like order_id = 1 and user_id = cookie.session_id) but becomes intractable for anything complicated (say the query uses ‘status‘ ILIKE ‘failed_%‘). The declarative permissions can also sometimes span across tables. We made significant investment in investigating this approach coupled with basic incremental updating and have a few small projects in production that takes an approach similar to this talk.
  2. For any application unless the write throughput is very small, you’ll end up throttling/debouncing events over an interval anyway.

The tradeoff with this approach is latency when write-loads are small. Refetching can be done immediately, instead after X ms. This can be alleviated quite easily, by tuning the refetch interval and the batch size appropriately. So far we have focussed on removing the most expensive bottleneck first, the query refetching. That said, we will continue to look at improvements in the months to come, especially to use event dependency (in cases where it is applicable) to potentially reduce the number of live queries that are refetched every interval.

Please do note that we have drivers internally for other event based methods and would love to work with you if you have a use-case where the current approach does not suffice. Hit us up on our discord, and we’ll help you run benchmarks and figure out the best way to proceed!

Testing

Testing scalability & reliability for live-queries with websockets has been a challenge. It took us a few weeks to build the testing suite and the infra automation tooling. This is what the setup looks like:

  1. A nodejs script that runs a large number of GraphQL live-query clients and logs events in memory which are later ingested into a database. [github]
  2. A script that creates a write load on the database that causes changes across all clients running their live queries (1 million rows are updated every second)
  3. Once the test suites finish running, a verification script runs on the database where the logs/events were ingested to verify that there were no errors and all events were received
  4. Tests are considered valid only if:
    • There are 0 errors in the payload received
    • Avg latency from time of creation of event to receipt on the client is less than 1000ms

Benefits of this approach

Hasura makes live-queries easy and accessible. The notion of queries is easily extended to live-queries without any extra effort on the part of the developer using GraphQL queries. This is the most important thing for us.

  1. Expressive/featureful live queries with full support for Postgres operators/aggregations/views/functions etc
  2. Predictable performance
  3. Vertical & Horizontal scaling
  4. Works on all cloud/database vendors

Future work:

Reduce load on Postgres by:

    1. Mapping events to active live queries
    2. Incremental computation of result set

原文地址:https://www.cnblogs.com/rongfengliang/p/10958858.html

时间: 2024-08-03 03:46:59

一篇来自hasura graphql-engine 百万级别live query 的实践的相关文章

JAVA使用POI如何导出百万级别数据

经常使用Excel的人应该都能知道excel2007及以上版本可以轻松实现存储百万级别的数据,但是系统中的大量数据如何能够快速准确的导入到excel中这好像是个难题,对于一般的web系统,我们为了解决成本,基本都是使用的入门级web服务器tomcat,jdk在32为系统中支持的内存不能超过2个G,但是在64为中没有限制,但是在64位的系统中,性能并不是太好,所以为了解决上诉问题,我们要针对我们的代码来解决我们要解决的问题,在POI3.8之后新增加了一个类,SXSSFWorkbook,它可以控制e

hasura graphql server 集成gatsby

hasura graphql server 社区基于gatsby-source-graphql 开发了gatsby-postgres-graphql 插件, 可以快速的开发丰富的网站 基本使用 安装hasura graphql server 我使用的Heroku 已经部署好了 https://rongfengliang.herokuapp.com/ 说明:后边可能会删了,测试的话,最好的自己搭建 添加表结构以及数据(hasura server) gastby 集成测试 package.json

中台架构50篇资料精选,阿里/腾讯/京东...中台建设实践汇集

中台架构50篇资料精选,阿里/腾讯/京东...中台建设实践汇集 内容包括7大类:阿里专家谈中台.行业专家解读中台.大厂中台架构实践.数据中台.技术中台.组织中台.中台建设方法论. 01 阿里架构专家,谈中台架构 1.阿里技术专家玄难:小前台大中台是什么? 2.阿里云架构总监谢纯良:企业盲目跟风做中台会不会死 3.阿里技术专家玄难:从平台到中台的演进 4.阿里架构师古谦:未来最核心的竞争壁垒在数据技术 5.阿里搜索中台:开发运维一体化实践 6.阿里技术中台:研发管理难题如何应对? 7.阿里架构总监

百万级别长连接,并发测试指南

前言 都说haproxy很牛x, 可是测试的结果实在是不算满意, 越测试越失望,无论是长连接还是并发, 但是测试的流程以及工具倒是可以分享分享.也望指出不足之处. 100w的长连接实在算不上太难的事情,不过对于网上关于测试方法以及测试工具的相关文章实在不甚满意,才有本文. 本文有两个难点,我算不上完全解决. 后端代码的性能. linux内核参数的优化. 环境说明 下面所有的测试机器都是基于openstack云平台,kvm虚拟化技术创建的云主机. 由于一个socket连接一般占用8kb内存,所以百

写个百万级别full-stack小型协程库——原理介绍

其实说什么百万千万级别都是虚的,下面给出实现原理和测试结果,原理很简单,我就不上图了: 原理:为了简单明了,只支持单线程,每个协程共享一个4K的空间(你可以用堆,用匿名内存映射或者直接开个数组也都是可以的,总之得保证4K页对齐的空间),每个协程自己有私有栈空间指针privatestackptr,每个时刻只有一个协程在运行,此时栈空间在这个4K共享空间中(当然除了main以外),当切换协程时,动态分配一个堆内存,大小为此时协程栈实际大小(一般都很小,小的只有几十个Bytes, 大的有几百个Byte

hasura graphql 角色访问控制

目前从官方文档以及测试可以看出不加任何header的请求访问的是所有的数据,对于具有访问 控制的请求需要添加请求头,实际生产的使用需要集合web hook 的实现访问控制. 参考配置 访问请求 目前数据只有id=1 不匹配的 匹配的 没有添加角色的(获取所有数据) 几张官方的参考图 配置 开发环境测试 生产使用 参考资料 https://docs.hasura.io/1.0/graphql/manual/auth/index.html 原文地址:https://www.cnblogs.com/r

hasura graphql server 集成gitlab

默认官方是提供了gitlab 集成的demo的,但是因为gitlab 一些版本的问题, 跑起来总有问题,所以查找相关资料测试了一个可以运行的版本 项目使用docker-compose 运行 参考 https://github.com/Trantect/docker-compose.yamls 环境准备 docker-compose 文件 version: '2' services: redis: image: sameersbn/redis:4.0.9-1 command: - --loglev

Arcgis apis for flex项目实例—开发篇(3):地图级别控制器

地图级别控制器俗称“鱼骨头”,这几乎是Web地图的标配了.ESRI的flex api自带了级别控制器叫做zoomSlider,我觉得是非常的难看,难看的和“鱼骨头”都不太沾边,给任何客户提供这样的一个组件都是非常不严肃的事情,我的选择是要么索性不提供这个工具,要么,就自己做一个.这是我做的组件的效果: 这个组件可以整合很多的地图工具,本节只讨论中间的那个滑块部分,就是从“+”到“-”之间的部分,其他的工具留给下一节. 先来看布局,一个Group把所有的部件框起来,放下两个按钮.两个Rect.两个

QLoo graphql engine 学习一 基本试用(docker&&docker-compose)

说明:使用docker-compose 进行安装 代码框架 使用命令行工具创建 qlooctl install docker qloo-docker 运行qloo&&gloo 启动 cd ./qloo-docker docker-compose up 效果 配置glooctl &&qlooctl工具 下载 https://github.com/solo-io/qloo/releases https://github.com/solo-io/gloo/releases 配置环