VENUS: The Root Cause of MSMR Performance Issue

(BTW: I use VENUS to mark the report to wish we will run to the goal smoothly.)

High-level Description

My mentor Heming and I have found that the root cause of MSMR performance issue is:

During the process of sending operation from proxy module to consensus module, Nagle algorithm and TCP Delayed Acknowledgement will cause a 40ms delay. So on proxy side, packets will wait for about 40ms until Delayed ACK from
consensus side arrives at proxy side because of 40ms timeout. While the reason why consensus side holds ACK for 40ms is that consensus side expects to receive more data. In fact, proxy side sometimes will only send a small amount of data to consensus side.
And only proxy side receives the ACK, then next data can be sent. (In details, you can see this link:
http://jerrypeng.me/2013/08/mythical-40ms-delay-and-tcp-nodelay/)

In Details

Firstly, through experiments, we make sure that in some case the process of sending operation from proxy to consensus spend too much time.

In our MSMR, when we increase concurrency, for example apache_ab.cfg, server_count = 1, client_count = 1, ab -n3000 -c6

I suppose that it is likely to just send a small amount of data to consensus side, while more data will be buffered by output evbuffer of proxy side. As the deeper reason why only a small amount of data is sent, I also don‘t
know, but maybe it is related to many connections (P_CONNECT, P_CLOSE, P_SEND) use bufferevent_write concurrently.

And through the following experiment data:

Warning: output
evbuffer has 72 bytes left when P_CLOSE

Warning: P_CLOSE timestamp: 1422506138.861455

Warning: output evbuffer has 144 bytes left when P_CLOSE

Warning: P_CLOSE timestamp: 1422506138.861485

Warning: output evbuffer has 72 bytes left when P_CONNECT

Warning: P_CONNECT timestamp: 1422506138.861739

Warning: output evbuffer has 72 bytes left when P_CONNECT

Warning: P_CONNECT timestamp: 1422506138.861895

Warning: output evbuffer has 144 bytes left when P_CLOSE

Warning: P_CLOSE timestamp: 1422506138.861979

Warning: output evbuffer has 72 bytes left when P_CONNECT

Warning: P_CONNECT timestamp: 1422506138.862039

Warning: output evbuffer has 80 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862140

Warning: output evbuffer has 242 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862165

Warning: output evbuffer has 80 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862276

Warning: output evbuffer has 242 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862300

Warning: output evbuffer has 80 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862317

Warning: output evbuffer has 162 bytes left when P_SEND

Warning: P_SEND timestamp: 1422506138.862353

Warning: replica_on_read time 1422506138.899324

Warning: consensus input evbuffer has 1158 bytes left

Warning from proxy to consensus: 37889, timestamp: 1422506138.899344

Warning from proxy to consensus: 38062, timestamp: 1422506138.899547

Warning from proxy to consensus: 37874, timestamp: 1422506138.899613

Warning from proxy to consensus: 37819, timestamp: 1422506138.899714

Warning from proxy to consensus: 37798, timestamp: 1422506138.899777

Warning from proxy to consensus: 37796, timestamp: 1422506138.899835

Warning from proxy to consensus: 37753, timestamp: 1422506138.899893

Warning from proxy to consensus: 37849, timestamp: 1422506138.900014

Warning from proxy to consensus: 37868, timestamp: 1422506138.900144

Warning from proxy to consensus: 37960, timestamp: 1422506138.900260

Warning from proxy to consensus: 38084, timestamp: 1422506138.900401

Warning from proxy to consensus: 38165, timestamp: 1422506138.900518

We could see the 12 abnormal operations all spend about 40ms from proxy to consensus. And according to the record data of the output evbuffer, we can see that on proxy side the 12 operations have left the output evbuffer at the
beginning during the about 40ms. And on consensus side, I set timeout EVENT, find that those operations which have left the output evbuffer of proxy side don‘t arrive at the input evbuffer of consensus side. And from the above data, we can only see one line
about input evbuffer of consensus side. So the problem lies in at-ground level reason.
Now the reason is Nagle algorithm and TCP Delayed Acknowledgement cause a 40ms delay.

Solution

Use TCP_NODELAY to disable Nagle.

And you can get the updated msmr project from the
bcmatrix branch of msmr on Heming‘s github

In proxy.c

change connect_consensus to

//void consensus_on_read(struct bufferevent* bev,void*);
void connect_consensus(proxy_node* proxy){
    // tom add 20150129
    evutil_socket_t fd;
    fd = socket(AF_INET, SOCK_STREAM, 0);
    proxy->con_conn = bufferevent_socket_new(proxy->base,fd,BEV_OPT_CLOSE_ON_FREE);
    // end tom add
    // proxy->con_conn = bufferevent_socket_new(proxy->base,-1,BEV_OPT_CLOSE_ON_FREE);
    bufferevent_setcb(proxy->con_conn,NULL,NULL,consensus_on_event,proxy);
    bufferevent_enable(proxy->con_conn,EV_READ|EV_WRITE|EV_PERSIST);
    bufferevent_socket_connect(proxy->con_conn,(struct sockaddr*)&proxy->sys_addr.c_addr,proxy->sys_addr.c_sock_len);
    // tom add 20150129
    int enable = 1;
    if(setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (void*)&enable, sizeof(enable)) < 0)
        printf("Proxy-side: TCP_NODELAY SETTING ERROR!\n");
    // end tom add

    return;
}

In replica.c

Add this code

    int enable = 1;
    if(setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (void*)&enable, sizeof(enable)) < 0)
        printf("Consensus-side: TCP_NODELAY SETTING ERROR!\n");

Results

apache_ab.cfg     s1c1 n3000c6    add TCP_NODELAY

apache_ab.cfg     s1c1 n3000c6    NO TCP_NODELAY

We could see when we add TCP_NODELAY, it has a good performance, and it is stable.

时间: 2024-12-19 01:29:25

VENUS: The Root Cause of MSMR Performance Issue的相关文章

遇到 Form 性能问题怎么办 performance issue

性能问题是比較复杂的问题. 一般由performance team 负责, 可是常见的情况是, 我们 INV team 定义的 view 不好, 导致查询性能较差. 这个必须由产品组和 performance team 一起来攻克了. 遇到性能问题的话, 几个经常使用的分析方法: 首先要找出性能较差的SQL, 这个要收集SQL trace, 然后转成 tkprof 文件来看. 把SQL 放到 pl/sql developer 里面, 查看运行计划. 非常多时候问题出在没有使用 index, 而是

scalar UDFs performance issue

refer from Inside Miscrsoft SQL Server 2008: T-SQL Programming. You should be aware that invoking scalar UDFs in queries has a high cost when you providethe function with attributes from the outer table as inputs. Even when the function only has a RE

performance ISSUE

https://blogs.technet.microsoft.com/netmon/2007/01/26/part-2-tcp-performance-expert-and-general-trouble-shooting/

我的港大申请之路(希望学弟学妹们不要出现我的结局)

In May 2014, I started to apply for HKU. Dr.Heming found me. He wanted to give me an interview. I was pleased to receive his tasks. From May to September 2014, I concentrated myself on his tasks about PARROT. And I did well. So finally he let me beco

The Accidental DBA:Troubleshooting Performance

最近重新翻看The Accidental DBA,将Troubleshooting Performance部分稍作整理,方便以后查阅.一.Baselines 网友提供的性能基线的含义:每天使用windows性能计数器定时(周期为一个月,具体需要根据自己的需求)收集服务器硬件信息,然后对硬件信息进行分析统计,计算平均值.最大值.最小值,用来与之后每天硬件信息进行比较,从而快速的估算服务器硬件状态. 之前对基线的理解一直停留在使用Perfmon收集几个计数器,然后拿收集到的数值和网上推荐的数值进行对

Performance Monitor Usage2:Peformance Counter

Performance Counter 是量化系统状态或活动的一个数值,对于不同的监控目的,需要设置不同的Performance Counter. Performance counters   are measurements of system state or activity. They can be included in the operating system or can be part of individual applications. Windows Performance

Optimizing Item Import Performance in Oracle Product Hub/Inventory

APPLIES TO: Oracle Product Hub - Version 12.1.1 to 12.1.1 [Release 12.1] Oracle Inventory Management - Version 12.1.1 to 12.1.1 [Release 12.1] Oracle Item Master - Version 12.0.6 to 12.0.6 [Release 12] Information in this document applies to any plat

转贴---Performance Counter(包含最全的Windows计数器解释)

http://support.smartbear.com/viewarticle/55773/ 这个Article中介绍了一个新工具,TestComplete,把其中涉及到性能计数器的部分摘抄出来了.分三部分 1, Windows Performance Counters Describes counters that the test engine retrieves from applications and computers running under a Windows operati

G1 gabage collector notes

Traditional: Eden Survivor 0 Survior 1   Old generationG1:   Various size regions   (Free/Occupied) Each region:  young(Eden or survivor)/old/humongous Humongous object: Object < 50% region size (normal allocation into eden)Object >= 50% region size