berkeley db replica机制 - election algorithm

repmgr_method.c, __repmgr_start_int()

初始2个elect线程.

repmgr_elect.c, __repmgr_init_election()

__repmgr_elect_thread()

__repmgr_elect_main()

lease, preferred master mode,

rep_elect.c,   __repmgr_elect()

__rep_elect_init()

lockout,

if (rep->egen != egen)  // then out

tiebreaker

/* Use the last commit record as the LSN in the vote

__rep_write_egen

__rep_tally // tally our own vote

__rep_cmp_vote // 把我们自己预先记录为winner

__rep_send_vote() // -send vote1, our own vote, REP_VOTE1

phase1, wait...

if (rep->sites >= rep->nvotes) { // 满足进入phase2, 不满足就退出了

rep->sites - sites heard from.

rep->nvotes - Number of votes needed.

send vote2/ 或我们自己 是winner的情况, 投自己一票

我赢了么?

rep_record.c, __rep_process_message_int()

case REP_VOTE1:
ret = __rep_vote1(env, rp, rec, eid);
break;
case REP_VOTE2:
ret = __rep_vote2(env, rec, eid);

__rep_vote1()

我们自己是master, send REP_NEWMASTER, 退出

若收到以前egen的vote, send REP_ALIVE

若收到以后egen的vote, 终止当前vote, 更新egen

* Ignore vote1‘s if we‘re in phase 2.

__rep_tally - 记录下来, 如是新的vote site, rep->sites++

__rep_cmp_vote // 比较此vote1和我们已有的winner

如果已经得到所有site的vote1, 进入phase2

- 我们是winner, claim; 否则vote2 别人

如需要(full election?, 第一次拿到site的vote1), resend our vote1 到这个site

__rep_vote2()

/*
* Record this vote. In a VOTE2, the only valid entry
* in the vote information is the election generation.
*
* There are several things which can go wrong that we
* need to account for:
* 1. If we receive a latent VOTE2 from an earlier election,
* we want to ignore it.
* 2. If we receive a VOTE2 from a site from which we never
* received a VOTE1, we want to record it, because we simply
* may be processing messages out of order or its vote1 got lost,
* but that site got all the votes it needed to send it.
* 3. If we have received a duplicate VOTE2 from this election
* from the same site we want to ignore it.
* 4. If this is from the current election and someone is
* really voting for us, then we finally get to record it.
*/

rep_tally - 若 新的site发出的 vote2, rep->votes++

#define I_HAVE_WON(rep, winner) \
((rep)->votes >= (rep)->nvotes && winner == (rep)->eid)

rep->sites - sites heard from.

rep->nvotes - Number of votes needed.

rep->votes - Number of votes for this site.

rep->nsites - Number of sites in group.

/*
* We need to check sites == nsites, not more than half
* like we do in __rep_elect and the VOTE2 code. The
* reason is that we want to process all the incoming votes
* and not short-circuit once we reach more than half. The
* real winner‘s vote may be in the last half.
*/
#define IS_PHASE1_DONE(rep) \
((rep)->sites >= (rep)->nsites && (rep)->winner != DB_EID_INVALID)

u_int32_t egen; /* Replication election generation. */

REP_NEWMASTER -  我是新的master

REP_MASTER_REQ - 谁是master?

rep_util.c, __rep_new_master() 与新master同步

/*
* Election gen file name
* The file contains an egen number for an election this client has NOT
* participated in. I.e. it is the number of a future election. We
* create it when we create the rep region, if it doesn‘t already exist
* and initialize egen to 1. If it does exist, we read it when we create
* the rep region. We write it immediately before sending our VOTE1 in
* an election. That way, if a client has ever sent a vote for any
* election, the file is already going to be updated to reflect a future
* election, should it crash.
*/
#define REP_EGENNAME "__db.rep.egen"

typedef struct {
u_int32_t egen; /* Voter‘s election generation. */
int eid; /* Voter‘s ID. */
} REP_VTALLY;

rep_elect.c, __rep_tally()

* Ignore votes from earlier elections (i.e. we‘ve heard
* from this site in this election, but its vote from an
* earlier election got delayed and we received it now).
* However, if we happened to hear from an earlier vote
* and we recorded it and we‘re now hearin

__rep_cmp_vote()

/* Make ourselves the winner to start. */

rep->winner 记录下已知的winner

__rep_elect_done()

- 清elect flag, 清rep->votes,.. rep->egen++

时间: 2024-10-09 16:52:25

berkeley db replica机制 - election algorithm的相关文章

berkeley db replica机制 - 主从同步

repmgr/repmgr_net.c, __repmgr_send(): 做send_broadcast, 然后根据policy 对DB_REP_PERMANENT的处理 __repmgr_send_broadcast(): 对每个site, send_connection(). MASTER 发送 log/log_put.c, log_put(), 不接受 REP_CLIENT __rep_send_message(env, DB_EID_BROADCAST, - REP_NEWFILE, 

berkeley db replica机制 - 消息处理

repmgr_method.c, __repmgr_start_int()repmgr_method.c, __repmgr_start_msg_threads()repmgr_msg.c, __repmgr_msg_thread()message_loop() while ((ret = __repmgr_queue_get()... __repmgr_queue_get - while(m = available_work(env)) == NULL), wait 在msg_avail 上p

一个简单的NoSQL内存数据库—Berkeley DB基本操作的例子

一个简单的NoSQL内存数据库—Berkeley DB基本操作的例子 最近,由于云计算的发展,数据库技术也从结构式数据库发展到NoSQL数据库,存储模式从结构化的关系存储到现在如火如荼的key/value存储.其中Berkeley DB就是上述过程中的一个比较有代表性的内存数据库产品,数据库的操作是通过程序来实现的,而不是SQL语句.特别是当今数据不断动态增加的过程中,试图 通过数据切割来达到扩充的思路已经行不通了,因为事先不知道客户数据格式,因此服务提供商不可能进行数据切割.而无模式的key/

Berkeley DB

Berkeley DB基础教程 http://blog.csdn.net/jediael_lu/article/details/27534223 Berkeley DB教程之三:读写数据的几种方法的比较 http://www.micmiu.com/nosql/berkeley/berkeley-write-read-data/ 三.一个简单的BDB JE例子 http://blog.csdn.net/ms_x0828/article/details/5506324 Berkeley DB基础教程

Berkeley DB基础教程

一.Berkeley DB的介绍 (1)Berkeley DB是一个嵌入式数据库,它适合于管理海量的.简单的数据.如Google使用其来保存账户信息,Heritrix用其来保存froniter. (2)key/value是Berkeley DB用来管理数据的基础,每个key/value对代表一条记录. (3)Berkeley DB在底层实现采用B树,可以看成能够存储大量数据的HashMap. (4)它是Oracle公司的一个产品,C++版本最新出现,之后JAVA等版本也陆续出现.它不支持SQL语

berkeley db储存URL队列的简单实现增、删、查

 Berkeley DB(BDB)是一个高效的嵌入式数据库编程库,C语言.C++.Java.Perl.Python.Tcl以及其他很多语言都有其对应的API.Berkeley DB可以保存任意类型的键/值对(Key/Value Pair),而且可以为一个键保存多个数据.Berkeley DB支持让数千的并发线程同时操作数据库,支持最大256TB的数据,广泛用于各种操作系统,其中包括大多数类Unix操作系统.Windows操作系统以及实时操作系统. Berkeley DB在06年被 Oracl

了解 Oracle Berkeley DB 可以为您的应用程序带来 NoSQL 优势的原因及方式。

将 Oracle Berkeley DB 用作 NoSQL 数据存储 作者:Shashank Tiwari 2011 年 2 月发布 “NoSQL”是在开发人员.架构师甚至技术经理中新流行的一个词汇.尽管这个术语最近很流行,但令人惊讶的是,它并没有一个普遍认可的定义. 通常来说,任何非 RDBMS 且遵循无模式结构的数据库一般都不能完全支持 ACID 事务,并且因高可用性的承诺以及在横向伸缩环境中支持大型数据集而普遍被归类为“NoSQL 数据存储”.鉴于这些共同特征(与传统的 RDBMS 的特征

Berkeley DB Java Edition 简介

一.             简介 Berkeley DB Java Edition (JE)是一个完全用JAVA写的,它适合于管理海量的,简单的数据. l         能够高效率的处理1到1百万条记录,制约JE数据库的往往是硬件系统,而不是JE本身. l         多线程支持,JE使用超时的方式来处理线程间的死琐问题. l         Database都采用简单的key/value对应的形式. l         事务支持. l         允许创建二级库.这样我们就可以方便

BDB (Berkeley DB)数据库简单介绍(转载)

近期要使用DBD,于是搜了下相关的资料,先贴个科普性的吧: 转自http://www.javaeye.com/topic/202990 DB综述DB最初开发的目的是以新的HASH訪问算法来取代旧的hsearch函数和大量的dbm实现(如AT&T的dbm,Berkeley的ndbm,GNU项目的gdbm),DB的第一个发行版在1991年出现,当时还包括了B+树数据訪问算法.在1992年,BSD UNIX第4.4发行版中包括了DB1.85版.基本上觉得这是DB的第一个正式版.在1996年中期,Sle