mysql 5.6 binlog组提交1

[MySQL 5.6] MySQL 5.6 group commit 性能测试及内部实现流程

尽管Mariadb以及Facebook在long long time ago就fix掉了这个臭名昭著的问题,但官方直到 MySQL5.6 版本才Fix掉,本文主要关注三点:

1.MySQL 5.6的性能如何

2.在5.6中Group commit的三阶段实现流程

新参数

MySQL 5.6提供了两个参数来控制binlog group commit:

binlog_max_flush_queue_time

单位为微妙,用于从flush队列中取事务的超时时间,这主要是防止并发事务过高,导致某些事务的RT上升。

可以阅读函数MYSQL_BIN_LOG::process_flush_stage_queue 来理解其功能

binlog_order_commits

当设置为0时,事务可能以和binlog不相同的顺序被提交,从下面的测试也可以看出,这会稍微提升点性能,但并不是特别明显.

性能测试

老规矩,先测试看看性能

sysbench, 全内存操作,5个sbtest表,每个表1000000行数据

基本配置:

innodb_flush_log_at_trx_commit=1

table_open_cache_instances=5

metadata_locks_hash_instances = 32

metadata_locks_cache_size=2048

performance_schema_instrument = ‘%=on’

performance_schema=ON

innodb_lru_scan_depth=8192

innodb_purge_threads = 4

关闭Performance Schema consumer:

mysql> update setup_consumers set ENABLED = ‘NO‘;

Query OK, 4 rows affected (0.02 sec)

Rows matched: 12  Changed: 4  Warnings: 0

sysbench/sysbench –debug=off –test=sysbench/tests/db/update_index.lua  –oltp-tables-count=5  –oltp-point-selects=0 –oltp-table-size=1000000 –num-threads=1000 –max-requests=10000000000 –max-time=7200 –oltp-auto-inc=off –mysql-engine-trx=yes –mysql-table-engine=innodb  –oltp-test-mod=complex –mysql-db=test   –mysql-host=$HOST –mysql-port=3306 –mysql-user=xx run

update_index.lua

threads sync_binlog = 0 sync_binlog = 1 sync_binlog =1binlog_order_commits=0
1  900  610  620
20 13,800 7,000 7,400
60 20,000 14,500 16,000
120 25,100 21,054 23,000
200 27,900 25,400 27,800
400 33,100 30,700 31,300
600 32,800 31,500 29,326
1000 20,400 20,200 20,500

我的机器在压到1000个并发时,CPU已经几乎全部耗完。

可以看到,并发度越高,group commit的效果越好,在达到600以上并发时,设置sync_binlog=1或者0已经没有TPS的区别。

但问题是。我们的业务压力很少会达到这么高的压力,低负载下,设置sync_binlog=1依旧增加了单个线程的开销。

另外也观察到,设置binlog_max_flush_queue_time对TPS的影响并不明显。

实现原理

我们知道,binlog和innodb在5.1及以后的版本中采用类似两阶段提交的方式,关于group commit问题的前世今生,可以阅读MATS的博客,讲述的非常详细。嗯,评论也比较有意思。。。。。

以下集中在5.6中binlog如何做group commit。在5.6中,将binlog的commit阶段分为三个阶段:flush stage、sync stage以及commit stage。5.6的实现思路和Mariadb的思路类似,都是维护一个队列,第一个进入该队列的作为leader线程,否则作为follower线程。leader线程收集follower的事务,并负责做sync,follower线程等待leader通知操作完成。

这三个阶段中,每个阶段都会去维护一个队列:

Mutex_queue m_queue[STAGE_COUNTER];

不同session的THD使用the->next_to_commit来链接,实际上,在如下三个阶段,尽管维护了三个队列,但队列中所有的THD实际上都是通过next_to_commit连接起来了。

在binlog的XA_COMMIT阶段(MYSQL_BIN_LOG::commit),完成事务的最后一个xid事件后,,这时候会进入MYSQL_BIN_LOG::ordered_commit,开始3个阶段的流程:

int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit)
{
  DBUG_ENTER("MYSQL_BIN_LOG::ordered_commit");
  int flush_error= 0;
  my_off_t total_bytes= 0;
  bool do_rotate= false;

  /*
    These values are used while flushing a transaction, so clear
    everything.

    Notes:

    - It would be good if we could keep transaction coordinator
      log-specific data out of the THD structure, but that is not the
      case right now.

    - Everything in the transaction structure is reset when calling
      ha_commit_low since that calls st_transaction::cleanup.
  */
  thd->transaction.flags.pending= true;
  thd->commit_error= THD::CE_NONE;
  thd->next_to_commit= NULL;
  thd->durability_property= HA_IGNORE_DURABILITY;
  thd->transaction.flags.real_commit= all;
  thd->transaction.flags.xid_written= false;
  thd->transaction.flags.commit_low= !skip_commit;
  thd->transaction.flags.run_hooks= !skip_commit;
#ifndef DBUG_OFF
  /*
     The group commit Leader may have to wait for follower whose transaction
     is not ready to be preempted. Initially the status is pessimistic.
     Preemption guarding logics is necessary only when DBUG_ON is set.
     It won‘t be required for the dbug-off case as long as the follower won‘t
     execute any thread-specific write access code in this method, which is
     the case as of current.
  */
  thd->transaction.flags.ready_preempt= 0;
#endif

  DBUG_PRINT("enter", ("flags.pending: %s, commit_error: %d, thread_id: %lu",
                       YESNO(thd->transaction.flags.pending),
                       thd->commit_error, thd->thread_id));

  /*
    Stage #1: flushing transactions to binary log

    While flushing, we allow new threads to enter and will process
    them in due time. Once the queue was empty, we cannot reap
    anything more since it is possible that a thread entered and
    appointed itself leader for the flush phase.
  */
  if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))
  {
    DBUG_PRINT("return", ("Thread ID: %lu, commit_error: %d",
                          thd->thread_id, thd->commit_error));
    DBUG_RETURN(finish_commit(thd));
  }

  THD *wait_queue= NULL;
  flush_error= process_flush_stage_queue(&total_bytes, &do_rotate, &wait_queue);

  my_off_t flush_end_pos= 0;
  if (flush_error == 0 && total_bytes > 0)
    flush_error= flush_cache_to_file(&flush_end_pos);

  /*
    If the flush finished successfully, we can call the after_flush
    hook. Being invoked here, we have the guarantee that the hook is
    executed before the before/after_send_hooks on the dump thread
    preventing race conditions among these plug-ins.
  */
  if (flush_error == 0)
  {
    const char *file_name_ptr= log_file_name + dirname_length(log_file_name);
    DBUG_ASSERT(flush_end_pos != 0);
    if (RUN_HOOK(binlog_storage, after_flush,
                 (thd, file_name_ptr, flush_end_pos)))
    {
      sql_print_error("Failed to run ‘after_flush‘ hooks");
      flush_error= ER_ERROR_ON_WRITE;
    }

    signal_update();
    DBUG_EXECUTE_IF("crash_commit_after_log", DBUG_SUICIDE(););
  }

  /*
    Stage #2: Syncing binary log file to disk
  */
  bool need_LOCK_log= (get_sync_period() == 1);

  /*
    LOCK_log is not released when sync_binlog is 1. It guarantees that the
    events are not be replicated by dump threads before they are synced to disk.
  */
  if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue,
                   need_LOCK_log ? NULL : &LOCK_log, &LOCK_sync))
  {
    DBUG_PRINT("return", ("Thread ID: %lu, commit_error: %d",
                          thd->thread_id, thd->commit_error));
    DBUG_RETURN(finish_commit(thd));
  }
  THD *final_queue= stage_manager.fetch_queue_for(Stage_manager::SYNC_STAGE);
  if (flush_error == 0 && total_bytes > 0)
  {
    DEBUG_SYNC(thd, "before_sync_binlog_file");
    std::pair<bool, bool> result= sync_binlog_file(false);
    flush_error= result.first;
  }

  if (need_LOCK_log)
    mysql_mutex_unlock(&LOCK_log);

  /*
    Stage #3: Commit all transactions in order.

    This stage is skipped if we do not need to order the commits and
    each thread have to execute the handlerton commit instead.

    Howver, since we are keeping the lock from the previous stage, we
    need to unlock it if we skip the stage.
   */
  if (opt_binlog_order_commits)
  {
    if (change_stage(thd, Stage_manager::COMMIT_STAGE,
                     final_queue, &LOCK_sync, &LOCK_commit))
    {
      DBUG_PRINT("return", ("Thread ID: %lu, commit_error: %d",
                            thd->thread_id, thd->commit_error));
      DBUG_RETURN(finish_commit(thd));
    }
    THD *commit_queue= stage_manager.fetch_queue_for(Stage_manager::COMMIT_STAGE);
    DBUG_EXECUTE_IF("semi_sync_3-way_deadlock",
                    DEBUG_SYNC(thd, "before_process_commit_stage_queue"););
    process_commit_stage_queue(thd, commit_queue);
    mysql_mutex_unlock(&LOCK_commit);
    /*
      Process after_commit after LOCK_commit is released for avoiding
      3-way deadlock among user thread, rotate thread and dump thread.
    */
    process_after_commit_stage_queue(thd, commit_queue);
    final_queue= commit_queue;
  }
  else
    mysql_mutex_unlock(&LOCK_sync);

  /* Commit done so signal all waiting threads */
  stage_manager.signal_done(final_queue);

  /*
    Finish the commit before executing a rotate, or run the risk of a
    deadlock. We don‘t need the return value here since it is in
    thd->commit_error, which is returned below.
  */
  (void) finish_commit(thd);

  /*
    If we need to rotate, we do it without commit error.
    Otherwise the thd->commit_error will be possibly reset.
   */
  if (do_rotate && thd->commit_error == THD::CE_NONE)
  {
    /*
      Do not force the rotate as several consecutive groups may
      request unnecessary rotations.

      NOTE: Run purge_logs wo/ holding LOCK_log because it does not
      need the mutex. Otherwise causes various deadlocks.
    */

    DEBUG_SYNC(thd, "ready_to_do_rotation");
    bool check_purge= false;
    mysql_mutex_lock(&LOCK_log);
    int error= rotate(false, &check_purge);
    mysql_mutex_unlock(&LOCK_log);

    if (error)
      thd->commit_error= THD::CE_COMMIT_ERROR;
    else if (check_purge)
      purge();
  }
  DBUG_RETURN(thd->commit_error);
}

###flush stage

int
MYSQL_BIN_LOG::process_flush_stage_queue(my_off_t *total_bytes_var,
                                         bool *rotate_var,
                                         THD **out_queue_var)
{
  DBUG_ASSERT(total_bytes_var && rotate_var && out_queue_var);
  my_off_t total_bytes= 0;
  int flush_error= 1;
  mysql_mutex_assert_owner(&LOCK_log);

  my_atomic_rwlock_rdlock(&opt_binlog_max_flush_queue_time_lock);
  const ulonglong max_udelay= my_atomic_load32(&opt_binlog_max_flush_queue_time);
  my_atomic_rwlock_rdunlock(&opt_binlog_max_flush_queue_time_lock);
  const ulonglong start_utime= max_udelay > 0 ? my_micro_time() : 0;

  /*
    First we read the queue until it either is empty or the difference
    between the time we started and the current time is too large.

    We remember the first thread we unqueued, because this will be the
    beginning of the out queue.
   */
  bool has_more= true;
  THD *first_seen= NULL;
  while ((max_udelay == 0 || my_micro_time() < start_utime + max_udelay) && has_more)
  {
    std::pair<bool,THD*> current= stage_manager.pop_front(Stage_manager::FLUSH_STAGE);
    std::pair<int,my_off_t> result= flush_thread_caches(current.second);
    has_more= current.first;
    total_bytes+= result.second;
    if (flush_error == 1)
      flush_error= result.first;
    if (first_seen == NULL)
      first_seen= current.second;
  }

  /*
    Either the queue is empty, or we ran out of time. If we ran out of
    time, we have to fetch the entire queue (and flush it) since
    otherwise the next batch will not have a leader.
   */
  if (has_more)
  {
    THD *queue= stage_manager.fetch_queue_for(Stage_manager::FLUSH_STAGE);
    for (THD *head= queue ; head ; head = head->next_to_commit)
    {
      std::pair<int,my_off_t> result= flush_thread_caches(head);
      total_bytes+= result.second;
      if (flush_error == 1)
        flush_error= result.first;
    }
    if (first_seen == NULL)
      first_seen= queue;
  }

  *out_queue_var= first_seen;
  *total_bytes_var= total_bytes;
  if (total_bytes > 0 && my_b_tell(&log_file) >= (my_off_t) max_size)
    *rotate_var= true;
  return flush_error;
}

change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log)

|–>stage_manager.enroll_for(stage, queue, leave_mutex) //将当前线程加入到m_queue[FLUSH_STAGE]中,如果是队列的第一个线程,就被设置为leader,否则就是follower线程,线程会这其中睡眠,直到被leader唤醒(m_cond_done)

|–>leader线程持有LOCK_log锁,从change_state线程返回false.

flush_error= process_flush_stage_queue(&total_bytes, &do_rotate, &wait_queue); //只有leader线程才会进入这个逻辑

|–>首先读取队列,直到队列为空,或者超时(超时时间是通过参数binlog_max_flush_queue_time来控制)为止,对读到的每个线程做flush_thread_caches,将binlog刷到cache中。注意在出队列的时候,可能还有新的session被append到队列中,设置超时的目的也正在于此

|–>如果是超时,这时候队列中还有session的话,就取出整个队列的头部线程,并将原队列置空(fetch_queue_for),然后对取出的session进行flush_thread_caches

|–>判断总的写入binlog的byte数是否超过max bin log size,如果超过了,就设置rotate标记

flush_error= flush_cache_to_file(&flush_end_pos);

|–>将I/O Cache中的内容写到文件中

signal_update()  //通知dump线程有新的Binlog

###sync stage

change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log, &LOCK_sync)

|–>stage_manager.enroll_for(stage, queue, leave_mutex)  //当前线程加入到m_queue[SYNC_STAGE]队列中,释放lock_log锁;同样的如果是SYNC_STAGE队列的leader,则立刻返回,否则进行condition wait.

|–>leader线程加上Lock_sync锁

final_queue= stage_manager.fetch_queue_for(Stage_manager::SYNC_STAGE);  //从SYNC_STAGE队列中取出来,并清空队列,主要用于commit阶段

std::pair<bool, bool> result= sync_binlog_file(false);  //刷binlog 文件(如果设置了sync_binlog的话)

简单的理解就是,在flush stage阶段形成N批的组session,在SYNC阶段又会由这N批组产生出新的leader来负责做最耗时的sync操作

###commit stage

commit阶段受到参数binlog_order_commits限制

当binlog_order_commits关闭时,直接unlock LOCK_sync,由各个session自行进入Innodb commit阶段(随后调用的finish_commit(thd)),这样不会保证binlog和事务commit的顺序一致,如果你不关注innodb的ibdata中记录的binlog信息,那么可以关闭这个选项来稍微提高点性能

当打开binlog_order_commits时,才会进入commit stage,如下描述的

change_stage(thd, Stage_manager::COMMIT_STAGE,final_queue, &LOCK_sync, &LOCK_commit)

|–>进入新的COMMIT_STAGE队列,释放LOCK_sync锁,新的leader获取LOCK_commit锁,其他的session等待

THD *commit_queue= stage_manager.fetch_queue_for(Stage_manager::COMMIT_STAGE);  //取出并清空COMMIT_STAGE队列

process_commit_stage_queue(thd, commit_queue, flush_error)

|–>这里会遍历所有的线程,然后调用ha_commit_low->innobase_commit进入innodb层依次提交

完成上述步骤后,解除LOCK_commit锁

stage_manager.signal_done(final_queue);

|–>将所有Pending的线程的标记置为false(thd->transaction.flags.pending= false)并做m_cond_done广播,唤醒pending的线程

(void) finish_commit(the);  //如果binlog_order_commits设置为FALSE,就会进入这一步来提交存储引擎层事务; 另外还会更新grid信息

Innodb的group commit和mariadb的类似,都只有两次sync,即在prepare阶段sync,以及sync Binlog文件(双一配置),为了保证rotate时,所有前一个binlog的事件的redo log都被刷到磁盘,会在函数new_file_impl中调用如下代码段:
if (DBUG_EVALUATE_IF(“expire_logs_always”, 0, 1)
&& (error= ha_flush_logs(NULL)))
goto end;

ha_flush_logs 会调用存储引擎接口刷日志文件

参考文档

http://dimitrik.free.fr/blog/archives/2012/06/mysql-performance-binlog-group-commit-in-56.html

http://mysqlmusings.blogspot.com/2012/06/binary-log-group-commit-in-mysql-56.html

MySQL 5.6.10 source code

时间: 2024-09-29 16:14:51

mysql 5.6 binlog组提交1的相关文章

mysql 5.6 binlog组提交实现原理

mysql 5.6 binlog组提交实现原理 http://blog.itpub.net/15480802/viewspace-1411356 Redo组提交 Redo提交流程大致如下 lock log->mutex write redo log buffer to disk unlock log->mutex fsync Fsync写磁盘耗时较长且不占用log->mutex,也就是其执行期间其他线程可以write log buffer: 假定一次fsync需要10ms,而写buffe

MySQL binlog 组提交与 XA(两阶段提交)

1. XA-2PC (two phase commit, 两阶段提交 ) XA是由X/Open组织提出的分布式事务的规范(X代表transaction; A代表accordant?).XA规范主要定义了(全局)事务管理器(TM: Transaction Manager)和(局部)资源管理器(RM: Resource Manager)之间的接口.XA为了实现分布式事务,将事务的提交分成了两个阶段:也就是2PC (tow phase commit),XA协议就是通过将事务的提交分为两个阶段来实现分布

MySQL binlog 组提交与 XA(分布式事务、两阶段提交)【转】

概念: XA(分布式事务)规范主要定义了(全局)事务管理器(TM: Transaction Manager)和(局部)资源管理器(RM: Resource Manager)之间的接口.XA为了实现分布式事务,将事务的提交分成了两个阶段:也就是2PC (tow phase commit),XA协议就是通过将事务的提交分为两个阶段来实现分布式事务. 两阶段: 1)prepare 阶段 事务管理器向所有涉及到的数据库服务器发出prepare"准备提交"请求,数据库收到请求后执行数据修改和日志

MySQL崩溃恢复与组提交

Ⅰ.binlog与redo的一致性(原子) 由内部分布式事务保证 我们先来了解下,当一个commit敲下后,内部会发生什么? 步骤 操作 step1 InnoDB做prepare redo log(fsync) step2 Sever层写binlog(fsync) step3 InnoDB层commit redo log(fsync) 第一步写的redo file,写入的是trxid而不是page的变化(show binlog events in 'xxx'),准确的说写在undo页上 第三步写

并发复制系列 一:binlog组提交

http://blog.itpub.net/28218939/viewspace-1975809/ 作者:沃趣科技MySQL数据库工程师  麻鹏飞 MySQL  Binary log在MySQL 5.1版本后推出主要用于主备复制的搭建,我们回顾下MySQL 在开启/关闭 Binary Log功能时是如何工作的 . MySQL没有开启Binary log的情况下: InnoDB存储引擎通过redo和undo日志可以safe crash recovery数据库,当数据crash recovery时,

MYSQL组提交

组提交(group commit)是MYSQL处理日志的一种优化方式,主要为了解决写日志时频繁刷磁盘的问题.组提交伴随着MYSQL的发展不断优化,从最初只支持redo log 组提交,到目前5.6官方版本同时支持redo log 和binlog组提交.组提交的实现大大提高了mysql的事务处理性能,下文将以innodb 存储引擎为例,详细介绍组提交在各个阶段的实现原理. redo log的组提交 WAL(Write-Ahead-Logging)是实现事务持久性的一个常用技术,基本原理是在提交事务

(转)MySQL 日志组提交

原文:https://jin-yang.github.io/post/mysql-group-commit.html 组提交 (group commit) 是为了优化写日志时的刷磁盘问题,从最初只支持 InnoDB redo log 组提交,到 5.6 官方版本同时支持 redo log 和 binlog 组提交,大大提高了 MySQL 的事务处理性能. 下面将以 InnoDB 存储引擎为例,详细介绍组提交在各个阶段的实现原理. 简介 自 5.1 之后,binlog 和 innodb 采用类似两

MySQL5.7的组提交与并行复制

从MySQL5.5版本以后,开始引入并行复制的机制,是MySQL的一个非常重要的特性. MySQL5.6开始支持以schema为维度的并行复制,即如果binlog row event操作的是不同的schema的对象,在确定没有DDL和foreign key依赖的情况下,就可以实现并行复制. 社区也有引入以表为维度或者以记录为维度的并行复制的版本,不管是schema,table或者record,都是建立在备库slave实时解析row格式的event进行判断,保证没有冲突的情况下,进行分发来实现并行

MySQL如何记录binlog

--MySQL如何记录binlog   -------------------------------2014/07/08 binlog文件的内容 log event    MySQL的binlog文件中记录的是对数据库的各种修改操作,用来表示修改操作的数据结构是Log event.不同的修改操作对应的不同的log event.比较常用的几种log event有:Query event.Row event.Xid event等.其中Query event对应的是一条SQL语句,在DDL操作和ST