The Performance Analysis of MENCODER with PARROT

1 ENVIRONMENTS

Table 1 ENV A

CPU	Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz 10-core
Memory	100G
Linux Version	3.11.0
Ubuntu Version	12.04

2 RESULTS

*We could find that MENCODER with PARROT is faster than that without PARROT over four threads.

3 EXPLANATIONS

We need to explain why MENCODER with PARROT is faster than that without PARROT over four threads. [1] tells us that the Synchronization Context Switches with PARROT is much fewer than that without PARROT. And the reason is that
PARROT’s round-robin scheduling reduces contention and its synchronizations use a more efficient wait that combines spin- and block- waits. So we should do the following steps to find the detailed reasons.

(a) First, is Syn Context Switches the main factor that leads to the result?

(b) If it is, please see (c-e); otherwise jump to (f).

(c) What’ s the relationship between Syn Context Switches and threads contention(a.k.a. locks contention)?

(d) What causes the contention?

(e) How does PARROT’ s round-robin scheduling reduce contention?

(f) We need to find other factors.

3.1 Preparations

We can’t read the source code of MENCODER, and don’t know where PARROT works. So I want to find a simple model which is similar to MENCODER. It is located in xtern/apps/mplayer/, and contains speedup-example.cpp(with PARROT) and speedup-example-mutex-only.cpp(without PARROT).
We call it EXAMPLE, and I write the mk of speedup-example.cpp.

#!/bin/bash

cd $XTERN_ROOT/apps/mplayer

gcc speedup-example.cpp -o speedup-example -O2 -g \

-I$XTERN_ROOT/include -L$XTERN_ROOT/dync_hook -Wl,--rpath,$XTERN_ROOT/dync_hook

-lxtern-annot \

-lpthread

So we could run the model. Before this, however, we should prove that we could use EXAMPLE to explain MENCODER. We could regard MENCODER as the common producer-consumer idiom. The producer reads blocks from mp4, and the consumers convert into avi. When many consumers
strive for the same lock, it will causes performance loss. This is the key point. If we want to explore the reasons of the speedup of MENCODER with PARROT, we need to simplify it as the above. And EXAMPLE is a good choice. It lets many threads access the same
critical resource, and uses PARROT to get a speedup. Additionally, EXAMPLE and MENCODER both enjoy big speedups. So I think if we can find the detailed reasons of the speedup of the EXAMPLE with PARROT, the obtained reasons also apply to MENCODER.

3.2 Is Syn Context Switches the main factor that leads to the result?

Now we could use EXAMPLE to explain MENCODER. We run EXAMPLE over four threads, ENV A. Results are as follows.

Table 2 Results of EXAMPLE over four threads, ENV A

	WITHOUTPARROT	WITH PARROT
Elapsed Time	7.85s	4.36s

Very good. PARROT get a high speedup. Let’s run it over a single thread without PARROT.

Table 3 Results of EXAMPLE without PARROT over 1 thread, ENV A

	WITHOUTPARROT
Elapsed Time	1.38s

Now we try to explain the results. First, let’s see the source code.

speedup-example-mutex-only.cpp

void*thread_func(void*arg){
...
for (int j = 0; j < M; j++) {
pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the
mutex needs to be a little bit congested.
sum += i;
pthread_mutex_unlock(&mutex);
...
}
}
int main(int argc, char* argv[]){
...
for(unsigned i=0; i<N; ++i) {
ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
assert(!ret && "pthread_create() failed!");
}
...
exit(0);
}

According Amdahl‘s law,

the Elapsed Time of execution without PARROT over four threads should be four times that over a single thread. However it is more than four times. The execution with PARROT enjoys a good speedup, and the Elapsed Time is about four
times that over a single thread. Now let’s use VTune to find something.

Threads=1, without PARROT

Threads=4, without PARROT

Threads=4, with PARROT

Fig.2. and Fig.1. show that with the increase of the number of threads, Synchronization Context Switches is larger and larger. Fig.2. and Fig.3. show that PARROT reduces the number of Context Switches. Namely, with the increase
of Synchronization Context Switches, the Elapsed Time increases, and with the reduce of it, the Elapsed Time reduces. Above all, we could prove that Syn Context Switches is the main factor that leads to the result that we try to explain.

3.3 What’s the relationship between Syn Context Switches and threads contention?

We have known that Syn Context Switches is the main factor. And we also know that with the increase of the number of threads, there is also increased threads contention.What’s the relationship between Syn Context Switches and threads contention? Contention
not only causes serialization, but also brings the overhead of thread switches[2]. From VTune Fig.4., we could see that threads spend more Wait Time on the Mutex lock, which caused by threads contention.

So if we could reduce the contention, or use a more efficient wait, we will get a high speedup. So we need to know how it is generated.

3.4 What causes the contention?

Let’s see the source code of EXAMPLE. Every thread executes the following codes:

pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) sum += i;
pthread_mutex_unlock(&mutex);

When many threads strive for the same lock, contention happens. And only one thread could get the lock, while others should wait. So if we want to reduce contention, we should not let too many threads wait on the same lock. We should
take some scheduling policies rather than let threads run indefinitely.

3.5 How does PARROT’s round-robin scheduling reduce contention?

First, let’s see the Locks and Waits of EXAMPLE with PARROT and without PARROT from VTune.

With PARROT

Without PARROT

Without PARROT, threads spend more Wait Time on the Mutex lock. While with PARROT, threads spend more wait time on Condition Variable rather than Mutex. It is because PARROT adds

pthread_cond_wait(&cond,&mutex);............(2)

which adds threads to a wait queue, then uses

pthread_cond_broadcast(&cond); .............(3)

to wake up all threads together. Let’s try to explain the benefit. Thread A finishes its access, and unlock the shared resources. Next, there will be many threads striving for the lock, including A itself. While PARROT lets A wait by (2), and lets B, C, D wait subsequently.
So the contention has been reduced. When all four threads lie in the wait queue, (3) will let the four threads run again. The following graph describes the number of threads that strive for the lock as time goes on.

So PARROT could reduce contention by schedule threads in order.

4 CONCLUSION

PARROT tells thread A, B, C and D: When you finish a access, you should wait for others, because you are cooperators. So the enemy, Syn Context Switches, reduces gradually. Although they need to wait, in fact the performance
will be improved as a whole.

5 REFERENCES

[1] Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime for Deterministic, Stable, and Reliable Threads". Proceedings of the 24th ACM Symposium
on Operating Systems Principles (SOSP ’13)

[2] Weiming Zhou, Multi-core Computing and Programming, Wuhan: Huazhong University of Science and Technology Press, 2009(in Chinese)

The Performance Analysis of MENCODER with PARROT,布布扣,bubuko.com

时间： 2024-08-01 10:44:24

The Performance Analysis of MENCODER with PARROT的相关文章

Performance analysis of our own full blown HTTP

In previous post Let's do our own full blown HTTP server with Netty 4 you and I were excited by creation of our own web server. So far so good. But how good? Given ordinary notebook cat /proc/cpuinfo | grep model\ name model name : Intel(R) Core

How Basic Performance Analysis Saved Us Millions-------火焰图

ENGINEERING How Basic Performance Analysis Saved Us Millions Michael Malis May 19, 2017 9 min read This is the story of how I applied basic performance analysis techniques to find a small change that resulted in a 10x improvement in CPU use for our

Linux Performance Analysis and Tools(Linux性能分析和工具)

首先来看一张图: 上面这张神一样的图出自国外一个Lead Performance Engineer(Brendan Gregg)的一次分享,几乎涵盖了一个系统的方方面面,任何人,如果没有完善的计算系统知识,网络知识和操作系统的知识,这张图中列出的工具,是不可能全部掌握的. 出于本人对linux系统的极大兴趣,以及对底层知识的强烈渴望,并作为检验自己基础知识的一个指标,我决定将这里的所有工具学习一遍(时间不限),这篇文章将作为我学习这些工具的学习笔记.尽量组织得方便自己日后查阅,并希望对别人也有一

SQL Server ->> Memory Allocation Mechanism and Performance Analysis（内存分配机制与性能分析）之 -- Minimum server memory与Maximum server memory

Minimum server memory与Maximum server memory是SQL Server下配置实例级别最大和最小可用内存(注意不等于物理内存)的服务器配置选项.它们是管理SQL Server内存的途径之一. Minimum server memory与Maximum server memory Minimum server memory(MB): 最小服务器内存.一旦超过这个线就不会再把内存换回去.但是也不是说SQL Server一启动马上就申请这么多的内存. Maximum

The Performance Analysis of MENCODER with PARROT

The Performance Analysis of MENCODER with PARROT的相关文章

Performance analysis of our own full blown HTTP

How Basic Performance Analysis Saved Us Millions-------火焰图

Linux Performance Analysis and Tools(Linux性能分析和工具)

SQL Server ->> Memory Allocation Mechanism and Performance Analysis（内存分配机制与性能分析）之 -- Minimum server memory与Maximum server memory

Performance analysis of SQL Server — about CPU

Performance analysis of SQL server disk I / O

9 tools to help you with Java Performance Tuning

A Taxonomy for Performance

[转] Vmware vs Virtualbox vs KVM vs XEN: virtual machines performance comparison