Kappa Architecture: A Different Way to Process Data

https://www.blue-granite.com/blog/a-different-way-to-process-data-kappa-architecture

Kappa architecture proposes an immutable data stream as the primary source of record. Unlike lambda, kappa mitigates the need to replicate code in multiple services. In my last post, I introduced the lambda architecture tooling options available in Microsoft Azure, sample reference architectures, and some limitations. In this post, I’ll discuss an alternative Big Data workload pattern: kappa architecture.

Below, I’ll give an overview of what kappa is, discuss some of the benefits and tradeoffs of implementing kappa versus lambda in Azure, and review a sample reference architecture. Finally, I’ll offer some added considerations when implementing enterprise-scale Big Data architectures.

Kappa Architecture: the Immutable, Persisted Log

Kappa architecture, attributed to Jay Kreps, CEO of Confluent, Inc. and co-creator of Apache Kafka, proposes an immutable data stream as the primary source of record, rather than point-in-time representations of databases or files. In other words, if a data stream containing all organizational data can be persisted indefinitely (or for as long as use cases might require), then changes to code can be replayed for past events as needed. This allows for unit testing and revisions of streaming calculations that lambda does not support. Kappa architecture also eliminates the need for a batch-based ingress process, as all data are written as events to the persisted stream. Kappa architecture is a novel approach to distributed-systems architecture, and I personally enjoy the design philosophy behind it.

Apache Kafka

Kafka is a streaming platform purposefully designed for kappa, which supports time-to-live (TTL) of indefinite time periods. Utilizing log compaction on the cluster, the kafka event stream can grow as large as you can add storage. There are petabyte-sized (imagine the U.S. Library of Congress) kafka clusters in production today. This sets kafka uniquely apart from other streaming and messaging platforms because it can replace databases as the system of record. Here are a few fascinating write-ups on kafka’s capabilities:

Lambda vs. Kappa

Let’s go with kappa architecture. What are we waiting for, right? Well, there’s no free lunch. Kappa offers newer capabilities compared with lambda, but you do pay a price when implementing leading-edge technologies – specifically, as of today, you’re going to have to roll in some of your own infrastructure to make this work.

No Managed-Service Options

You can’t support kappa architecture using native cloud services. Cloud providers, including Azure, didn’t design streaming services with kappa in mind. The cost of running streams with TTL greater than 24 hours is more expensive, and generally, the max TTL tops out around 7 days. If you want to run kappa, you’re going to have to run Platform as a Service (PaaS) or Infrastructure as a Service (IaaS), which adds more administration to your architecture. So, what might this look like in Azure?

Reference Architecture for Kappa with HDInsight

In this reference architecture, we are choosing to stream all organizational data into kafka. Applications can read and write directly to kafka as developed, and for existing event sources, listeners are used to stream writes directly from database logs (or datastore equivalents), eliminating the need for batch processing during ingress. In practice, a one-time historical load for existing batch data is required to initially populate the data lake.

Apache Spark is the sole processing engine for transforming and querying during stream ingestion. Further processing against the data lake store can be performed for machine learning or other analytics requiring historical representations of data. As requirements change, we can change code and “replay” the stream, writing to a new version of the existing time slice in the data lake (v2, v3, and so on). Since our lake no longer acts as an immutable datastore of record, we can simply replay and rebuild our time slices as needed.

With kappa in place, we can eliminate any potential swamp by repopulating our data lake as necessary. We also eliminate the requirement of lambda to reproduce code in both streaming and batch processing – all ingress events and transforms occur solely within stream processing.

Additional Considerations

Schemas and Governance

You still need a solid data governance program regardless of which architecture you choose. For lambda, services like Azure Data Catalog can auto-discover and document file and database systems. Kafka doesn’t align to this tooling, so supporting scaling to enterprise-sized environments strongly infers implementing confluent enterprise (available in the Azure Marketplace).

A key feature that confluent enterprise provides is schema registry. This allows for topics to be self-describing and provides compatibility warnings for applications publishing to specific topics, ensuring contracts with downstream applications are maintained. Running confluent enterprise brings in a third-party support relationship to your architecture and additional licensing cost, but is invaluable to successful enterprise-scale deployments.

原文地址:https://www.cnblogs.com/dadadechengzi/p/12639249.html

时间: 2024-08-03 08:07:45

Kappa Architecture: A Different Way to Process Data的相关文章

Lambda architecture and Kappa architecture

https://blog.csdn.net/hjw199089/article/details/84713095 Lambda architecture and kappa architecture. From Mastering Azure Analytics by Zoiner Tejada Getting Started with Kudu Lambda Architecture Lambda architecture was originally proposed by the crea

Flink应用案例:How Trackunit leverages Flink to process real-time data from industrial IoT devices

January 22, 2019Use Cases, Apache Flink Lasse Nedergaard     Recently there has been significant discussion about edge computing as a major technology trend in 2019. Edge computing brings computing capabilities away from the cloud, and rather close t

Optimizing subroutine calls based on architecture level of called subroutine

A technique is provided for generating stubs. A processing circuit receives a call to a called function. The processing circuit retrieves a called function property of the called function. The processing circuit generates a stub for the called functi

WCF - Architecture

WCF - Architecture WCF has a layered architecture that offers ample support for developing various distributed applications. The architecture is explained below in detail. wcf有一个分层的架构,为开发不同的分布式应用提供了足够的支持.架构图的细节如下图所示 Contracts   契约 The contracts layer

Logical partitioning and virtualization in a heterogeneous architecture

A method, apparatus, and computer usable program code for logical partitioning and virtualization in heterogeneous computer architecture. In one illustrative embodiment, a portion of a first set of processors of a first type is allocated to a partiti

Linux Process Virtual Memory

目录 1. 简介 2. 进程虚拟地址空间 3. 内存映射的原理 4. 数据结构 5. 对区域的操作 6. 地址空间 7. 内存映射 8. 反向映射 9.堆的管理 10. 缺页异常的处理 11. 用户空间缺页异常的校正 12. 内核缺页异常 13. 在内核和用户空间之间复制数据 1. 简介 用户层进程的虚拟地址空间是Linux的一个重要抽象,它向每个运行进程提供了同样的系统视图,这使得多个进程可以同时运行,而不会干扰到其他进程内存中的内容,此外,它容许使用各种高级的程序设计技术,如内存映射,学习虚

Cross-Domain Security For Data Vault

Cross-domain security for data vault is described. At least one database is accessible from a plurality of network domains, each network domain having a domain security level. The at least one database includes at least one partitioned data table tha

Kafka Connect Architecture

Kafka Connect's goal of copying data between systems has been tackled by a variety of frameworks, many of them still actively developed and maintained. This section explains the motivation behind Kafka Connect, where it fits in the design space, and

Commonly used terms in Data and Analytics

General terms Analytics as a Service (AaaS) The provision of analytics through Web-delivered technologies. These solutions offer businesses an alternative to developing internal hardware setups to perform business analytics. Artificial Intelligence (