ZooKeeper Recipes and Solutions 翻译

ZooKeeper 秘诀 与解决方案

A Guide to Creating Higher-level Constructs with ZooKeeper

In this article, you‘ll find guidelines for using ZooKeeper to implement higher order functions. All of them are conventions

implemented at the client and do not require special support from ZooKeeper. Hopfully the community will capture these

conventions in client-side libraries to ease their use and to encourage standardization.

在这里你可以找到使用Zookeeper实现高级功能指南。所以这些功能都在客户端实现。希望社区开发人员能在客户端库中遵循这些约定,以

提高易用性和标准化。

One of the most interesting things about ZooKeeper is that even though ZooKeeper uses asynchronous notifications, you

can use it to build synchronous consistency primitives, such as queues and locks. As you will see, this is possible because

ZooKeeper imposes an overall order on updates, and has mechanisms to expose this ordering.

最令人关注的事之一是:即使Zookeeper 使用异步通知,但你也可以使用它构建同步一致性原语,比如队列和锁。你将看到,这是可能的。

因为Zookeeper 强制update整体有序,有机制来保证这个顺序

Note that the recipes below attempt to employ best practices. In particular, they avoid polling, timers or anything else

that would result in a "herd effect", causing bursts of traffic and limiting scalability.

注意:下面的指南使用最佳实践。特别地,避免使用投票、定时器或其他会型成“羊群效应”,造成网络阻塞,限制扩展性的机制。

There are many useful functions that can be imagined that aren‘t included here - revocable read-write priority locks, as

just one example. And some of the constructs mentioned here - locks, in particular - illustrate certain points, even

though you may find other constructs, such as event handles or queues, a more practical means of performing the

same function. In general, the examples in this section are designed to stimulate thought.

Out of the Box Applications: Name Service, Configuration, Group Membership

Name service and configuration are two of the primary applications of ZooKeeper. These two functions are provided

directly by the ZooKeeper API. Another function directly provided by ZooKeeper is group membership. The group is

represented by a node. Members of the group create ephemeral nodes under the group node. Nodes of the members

that fail abnormally will be removed automatically when ZooKeeper detects the failure.

Name service 和 configuration 是Zookeeper的两个主要应用。这两个功能直接由Zookeeper API提供。 另一个功能是 group

membership。一个节点代表一个组。 组成员创建在组节点上临时节点。Zookeeper检测到失败时会移除异常节点 。

Barriers

Distributed systems use barriers to block processing of a set of nodes until a condition is met at which time all the

nodes are allowed to proceed. Barriers are implemented in ZooKeeper by designating a barrier node. The barrier is

in place if the barrier node exists. Here‘s the pseudo code:

  1. Client calls the ZooKeeper API‘s exists() function on the barrier node, with watch set to true.
  2. If exists() returns false, the barrier is gone and the client proceeds
  3. Else, if exists() returns true, the clients wait for a watch event from ZooKeeper for the barrier node.
  4. When the watch event is triggered, the client reissues the exists( ) call, again waiting until the barrier node is removed.

Double Barriers

Double barriers enable clients to synchronize the beginning and the end of a computation. When enough processes have

joined the barrier, processes start their computation and leave the barrier once they have finished. This recipe shows how

to use a ZooKeeper node as a barrier.

The pseudo code in this recipe represents the barrier node as b. Every client process p registers with the barrier node on

entry and unregisters when it is ready to leave. A node registers with the barrier node via the Enter procedure below, it

waits until x client process register before proceeding with the computation. (The x here is up to you to determine for your

system.)

Enter Leave
  1. Create a name n = b+“/”+p
  2. Set watch: exists(b + ‘‘/ready’’, true)
  3. Create child: create( n, EPHEMERAL)
  4. L = getChildren(b, false)
  5. if fewer children in L than x, wait for watch event
  6. else create(b + ‘‘/ready’’, REGULAR)
  1. L = getChildren(b, false)
  2. if no children, exit
  3. if p is only process node in L, delete(n) and exit
  4. if p is the lowest process node in L, wait on highest process node in P
  5. else delete(nif still exists and wait on lowest process node in L
  6. goto 1

On entering, all processes watch on a ready node and create an ephemeral node as a child of the barrier node. Each process but the last enters the barrier and waits for the ready node to appear at line 5. The process that creates the xth node, the last process, will see x nodes in the list of children and create the ready node, waking up the other processes. Note that waiting processes wake up only when it is time to exit, so waiting is efficient.

On exit, you can‘t use a flag such as ready because you are watching for process nodes to go away. By using ephemeral nodes, processes that fail after the barrier has been entered do not prevent correct processes from finishing. When processes are ready to leave, they need to delete their process nodes and wait for all other processes to do the same.

Processes exit when there are no process nodes left as children of b. However, as an efficiency, you can use the lowest process node as the ready flag. All other processes that are ready to exit watch for the lowest existing process node to go away, and the owner of the lowest process watches for any other process node (picking the highest for simplicity) to go away. This means that only a single process wakes up on each node deletion except for the last node, which wakes up everyone when it is removed.

Queues

Distributed queues are a common data structure. To implement a distributed queue in ZooKeeper, first designate a znode to hold the queue, the queue node. The distributed clients put something into the queue by calling create() with a pathname ending in "queue-", with the sequence and ephemeral flags in the create() call set to true. Because the sequence flag is set, the new pathnames will have the form _path-to-queue-node_/queue-X, where X is a monotonic increasing number. A client that wants to be removed from the queue calls ZooKeeper‘s getChildren( ) function, with watch set to true on the queue node, and begins processing nodes with the lowest number. The client does not need to issue another getChildren( ) until it exhausts the list obtained from the first getChildren( ) call. If there are are no children in the queue node, the reader waits for a watch notification to check the queue again.

Note

There now exists a Queue implementation in ZooKeeper recipes directory. This is distributed with the release -- src/recipes/queue directory of the release artifact.

Priority Queues

To implement a priority queue, you need only make two simple changes to the generic queue recipe . First, to add to a queue, the pathname ends with "queue-YY" where YY is the priority of the element with lower numbers representing higher priority (just like UNIX). Second, when removing from the queue, a client uses an up-to-date children list meaning that the client will invalidate previously obtained children lists if a watch notification triggers for the queue node.

Locks

Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.

Note

There now exists a Lock implementation in ZooKeeper recipes directory. This is distributed with the release -- src/recipes/lock directory of the release artifact.

Clients wishing to obtain a lock do the following:

  1. Call create( ) with a pathname of "_locknode_/lock-" and the sequence and ephemeral flags set.
  2. Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
  3. If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
  4. The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
  5. if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.

The unlock protocol is very simple: clients wishing to release a lock simply delete the node they created in step 1.

Here are a few things to notice:

  • The removal of a node will only cause one client to wake up since each node is watched by exactly one client. In this way, you avoid the herd effect.
  • There is no polling or timeouts.
  • Because of the way you implement locking, it is easy to see the amount of lock contention, break locks, debug locking problems, etc.

Shared Locks

You can implement shared locks by with a few changes to the lock protocol:

Obtaining a read lock: Obtaining a write lock:
  1. Call create( ) to create a node with pathname "_locknode_/read-". This is the lock node use later in the protocol. Make sure to set both the sequence andephemeral flags.
  2. Call getChildren( ) on the lock node without setting the watch flag - this is important, as it avoids the herd effect.
  3. If there are no children with a pathname starting with "write-" and having a lower sequence number than the node created in step 1, the client has the lock and can exit the protocol.
  4. Otherwise, call exists( ), with watch flag, set on the node in lock directory with pathname staring with "write-" having the next lowest sequence number.
  5. If exists( ) returns false, goto step 2.
  6. Otherwise, wait for a notification for the pathname from the previous step before going to step 2
  1. Call create( ) to create a node with pathname "_locknode_/write-". This is the lock node spoken of later in the protocol. Make sure to set bothsequence and ephemeral flags.
  2. Call getChildren( ) on the lock node without setting the watch flag - this is important, as it avoids the herd effect.
  3. If there are no children with a lower sequence number than the node created in step 1, the client has the lock and the client exits the protocol.
  4. Call exists( ), with watch flag set, on the node with the pathname that has the next lowest sequence number.
  5. If exists( ) returns false, goto step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.

Note

It might appear that this recipe creates a herd effect: when there is a large group of clients waiting for a read lock, and all getting notified more or less simultaneously when the "write-" node with the lowest sequence number is deleted. In fact. that‘s valid behavior: as all those waiting reader clients should be released since they have the lock. The herd effect refers to releasing a "herd" when in fact only a single or a small number of machines can proceed.

Recoverable Shared Locks

With minor modifications to the Shared Lock protocol, you make shared locks revocable by modifying the shared lock protocol:

In step 1, of both obtain reader and writer lock protocols, call getData( ) with watch set, immediately after the call to create( ).

If the client subsequently receives notification for the node it created in step 1, it does another getData( ) on that node,

with watch set and looks for the string "unlock", which signals to the client that it must release the lock. This is because,

according to this shared lock protocol, you can request the client with the lock give up the lock by calling setData() on the

lock node, writing "unlock" to that node.

Note that this protocol requires the lock holder to consent to releasing the lock. Such consent is important, especially if the

lock holder needs to do some processing before releasing the lock. Of course you can always implement Revocable Shared

Locks with Freaking Laser Beams by stipulating in your protocol that the revoker is allowed to delete the lock node if after

some length of time the lock isn‘t deleted by the lock holder.

请注意,这个协议要求锁定持有人同意释放锁。这种同意是很重要的,尤其是当锁持有者需要释放锁之前做一些处理。

Two-phased Commit

A two-phase commit protocol is an algorithm that lets all clients in a distributed system agree either to commit a transaction or abort.

In ZooKeeper, you can implement a two-phased commit by having a coordinator create a transaction node, say "/app/Tx", and one

child node per participating site, say "/app/Tx/s_i". When coordinator creates the child node, it leaves the content undefined. Once

each site involved in the transaction receives the transaction from the coordinator, the site reads each child node and sets a watch.

Each site then processes the query and votes "commit" or "abort" by writing to its respective node. Once the write completes, the

other sites are notified, and as soon as all sites have all votes, they can decide either "abort" or "commit". Note that a node can

decide "abort" earlier if some site votes for "abort".

在 Zookeeper 中,你可以让一个协调者创建一个事务节点/app/Tx,每个参与方一个子节点/app/Tx/s_i,从而实现一个两阶段提交协议。

当协调者创建了子节点时,子节点内容是未定义的,由于每个事务参与方都会从协调者接收事务,参与方读取每个子节点并设置监视。

然后每个参与方通过向与自身相关的节点写入数据来投票提交或中止事务。一旦写入完成,其他的参与方会被通知到,当所有的参与方都投完票后,

协调者就可以决定究竟是提交或中止事务。注意,如果某些参与方投票中止,节点是可以决定提前中止事务的。

An interesting aspect of this implementation is that the only role of the coordinator is to decide upon the group of sites, to create

the ZooKeeper nodes, and to propagate the transaction to the corresponding sites. In fact, even propagating the transaction can

be done through ZooKeeper by writing it in the transaction node.

There are two important drawbacks of the approach described above. One is the message complexity, which is O(n²). The second

is the impossibility of detecting failures of sites through ephemeral nodes. To detect the failure of a site using ephemeral nodes,

it is necessary that the site create the node.

To solve the first problem, you can have only the coordinator notified of changes to the transaction nodes, and then notify the

sites once coordinator reaches a decision. Note that this approach is scalable, but it‘s is slower too, as it requires all

communication to go through the coordinator.

To address the second problem, you can have the coordinator propagate the transaction to the sites, and have each site creating

its own ephemeral node.

Leader Election

A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes

that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode

"/election/n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence

number that is greater that any one previously appended to a child of "/election". The process that created the znode with

the smallest appended sequence number is the leader.

序列号最小的节点是leader

竞争创建子节点,节点序列号最小的获取锁,其他节点等待,但是等待在什么条件上是有讲究的,如果所有节点都等待最小节点的删除事件,

那么当最小节点释放锁的时候,就需要广播消息给所有其他等待的节点;换一个思路,如果每个等待节点只是等待比它序列号小的节点上,那么

就可以避免这种广播风暴,变成一个顺序唤醒的过程。

That‘s not all, though. It is important to watch for failures of the leader, so that a new client arises as the new leader in the

case the current leader fails. A trivial solution is to have all application processes watching upon the current smallest znode,

and checking if they are the new leader when the smallest znode goes away (note that the smallest znode will go away if

the leader fails because the node is ephemeral). But this causes a herd effect: upon of failure of the current leader, all other

processes receive a notification, and execute getChildren on "/election" to obtain the current list of children of "/election". If

the number of clients is large, it causes a spike on the number of operations that ZooKeeper servers have to process. To avoid

the herd effect, it is sufficient to watch for the next znode down on the sequence of znodes. If a client receives a notification that

the znode it is watching is gone, then it becomes the new leader in the case that there is no smaller znode. Note that this avoids

the herd effect by not having all clients watching the same znode.

Here‘s the pseudo code:

Let ELECTION be a path of choice of the application. To volunteer to be a leader:

  1. Create znode z with path "ELECTION/n_" with both SEQUENCE and EPHEMERAL flags;
  2. Let C be the children of "ELECTION", and i be the sequence number of z;
  3. Watch for changes on "ELECTION/n_j", where j is the smallest sequence number such that j < i and n_j is a znode in C;

Upon receiving a notification of znode deletion:

  1. Let C be the new set of children of ELECTION;
  2. If z is the smallest node in C, then execute leader procedure;
  3. Otherwise, watch for changes on "ELECTION/n_j", where j is the smallest sequence number such that j < i and n_j is a znode in C;

Note that the znode having no preceding znode on the list of children does not imply that the creator of this znode

is aware that it is the current leader. Applications may consider creating a separate to znode to acknowledge that

the leader has executed the leader procedure.

时间: 2024-10-25 15:00:00

ZooKeeper Recipes and Solutions 翻译的相关文章

ZooKeeper开发手册中文翻译

本文假设你已经具有一定分布式计算的基础知识.你将在第一部分看到以下内容: ZooKeeper数据模型 ZooKeeper Sessions ZooKeeper Watches 一致性保证(Consistency Guarantees) 接下来的4小节讲述了程序开发的实际应用: 创建模块--ZooKeeper操作指引 编程语言接口 简单示例演示程序的结构 常见问题和故障 本文的附录中包含和ZooKeeper相关的有用信息. ZooKeeper的数据模型 ZooKeeper有一个类似分布式文件系统的

《Entity Framework 6 Recipes》中文翻译系列 目录篇 -持续更新

为了方便大家的阅读和学习,也是响应网友的建议,在这里为这个系列做一个目录.在目录开始这前,我先来回答之前遇到的几个问题. 1.为什么要学习EF? 这个问题很简单,项目需要.这不像学校,没人强迫你学习! 我学习EF的原因主要是: a.EF是微软推荐的数据库访问技术: b.能提高我的开发效率,我不喜欢写那密密麻麻的SQL: c.比我写的SQL更合理,更快.目前EF生成的SQL的质量已经很高了.你比较熟悉SQL的话,那它在速度上肯定比不上你,新手的话就别跟我争快慢了,能写一像样的SQL就不错了.至少我

《Entity Framework 6 Recipes》中文翻译系列 (45) ------ 第八章 POCO之获取原始对象与手工同步对象图和变化跟踪器

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 8-6  获取原始对象 问题 你正在使用POCO,想从数据库获取原始对象. 解决方案 假设你有如图8-7所示的模型.你正在离线环境下工作,你想应用在获取客户端修改之前,使用Where从句和FirstDefault()方法从数据库中获取原始对象. 图8-7.包含一个单独实体Item的模型 按代码清单8-9的方式,在获取实体之后,使用新值更新实体并将其保存到数据库中. 代码清单8-9. 获取最新

《Entity Framework 6 Recipes》中文翻译系列 (42) ------ 第八章 POCO之使用POCO

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 第八章 POCO 对象不应该知道如何保存它们,加载它们或者过滤它们.这是软件开发中熟悉的口头禅,特别是在领域驱动设计中.这是一个聪明的做法,如果对象和持久化绑得太紧,以至于不能对领域对象进行单元测试.重构和复用.在ObjectContext上下对象中,实体框架为模型实体生成的类,高度依赖实体框架管道(Plumbing).对于一此开发人员来说,这些类对持久化机制知道得太多了.而且,它们与特定的

《Entity Framework 6 Recipes》中文翻译系列 (40) ------ 第七章 使用对象服务之从跟踪器中获取实体与从命令行生成模型(想解决EF第一次查询慢的,请阅读)

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 7-5  从跟踪器中获取实体 问题 你想创建一个扩展方法,从跟踪器中获取实体,用于数据保存前执行一些操作. 解决方案 假设你有如图7-7所示的模型. 图7-7. 包含实体Technician和ServiceCall的模型 在这个模型中,每个技术员(technician)都有一些业务服务请求(service call),业务服务请求包含联系人姓名,问题.使用代码清单7-4,创建一个扩展方法获取

《Entity Framework 6 Recipes》中文翻译系列 (33) ------ 第六章 继承与建模高级应用之TPH与TPT (2)

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 6-8  嵌套的TPH建模 问题 你想使用超过一层的TPH继承映射为一张表建模. 解决方案 假设你有一张员工(Employee)表,它包含各种类型的员工,比如,钟点工,雇员.如图6-10所示. 图6-10 包含各种类型的员工表 Employee表包含钟点工,雇员,提成员工,这是雇员下面的一个子类型.按下面的步骤,使用派生类型HourlyEmployee,SalariedEmployee和Sa

《Entity Framework 6 Recipes》中文翻译系列 (32) ------ 第六章 继承与建模高级应用之TPH与TPT (1)

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 6-6  映射派生类中的NULL条件 问题 你的表中,有一列允许为null.你想使用TPH创建一个模型,列值为null时,表示一个派生类型,不为null时,表示另一个派生类型. 解决方案 假设你有一张表,描述医学实验药物.这张表包含一列指示该药是什么时候批准生产的.药在批准生产之前都被认为是实验性的.一但批准生产,它就被认为是药物了.我们就以图6-7中Drug表开始我们这一小节的学习. 图6

《Entity Framework 6 Recipes》中文翻译系列 (30) ------ 第六章 继承与建模高级应用之多对多关联

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 第六章  继承与建模高级应用 现在,你应该对实体框架中基本的建模有了一定的了解,本章将帮助你解决许多常见的.复杂的建模问题,并解决你可能在现实中遇到的建模问题. 本章以多对多关系开始,这个类型的关系,无论是在现存系统还是新项目的建模中都非常普遍.接下来,我们会了解自引用关系,并探索获取嵌套对象图的各种策略.最后,本章以继承的高级建模和实体条件结束. 6-1  获取多对多关联中的链接表 问题

《Entity Framework 6 Recipes》中文翻译系列 (23) -----第五章 加载实体和导航属性之预先加载与Find()方法

翻译的初衷以及为什么选择<Entity Framework 6 Recipes>来学习,请看本系列开篇 5-2  预先加载关联实体 问题 你想在一次数据交互中加载一个实体和与它相关联实体. 解决方案 假设你有如图5-2所示的模型. 图5-2 包含Customer和与它相关联信息的实体 和5-1节一样,在模型中,有一个Customer实体,一个与它关联的CustomerType和多个与它关联的CustomerEamil.它与CustomerType的关系是一对多关系,这是一个实体引用(译注:Cu