Performance analysis of our own full blown HTTP

In previous post Let‘s do our own full blown HTTP server with Netty 4 you and I were excited by creation of our own web server. So far so good. But how good?
Given ordinary notebook

cat /proc/cpuinfo | grep model\ name
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
cat /proc/meminfo | grep MemTotal
MemTotal:        3956836 kB

(Ok, there are only 4 cores with hyperthreading.)
We have following numbers

build/default/weighttp -n 1000000 -k -c 100 http://localhost:9999/
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 1000000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 20 sec, 195 millisec and 236 microsec, 49516 req/s, 7251 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 149967800 bytes total, 145967800 bytes http, 4000000 bytes data

49516 req/s. Is this good? I‘d say very good. We need some baseline to compare to. As such baseline we‘ll use nginx serving small static file. ‘Why serving a file is faster than serving a piece of memory?‘ you ask. With sendfile turned on nginx after response header sent just instructs kernel to send file to socket, file after first hit is in linux buffer cache, thus kernel sends a page from memory to Ethernet card. Zero-copy. If you are about to send dynamically generated response from memory kernel needs to copy this piece of data from userspace to kernelspace. Thus example below should be fast.
Nginx config

worker_processes  4;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;
    sendfile        on;
    keepalive_timeout  65;

    server {
        listen       6666;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }
    }
}

The same test

build/default/weighttp -n 1000000 -k -c 100 http://localhost:6666/
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 1000000 total requests
...
progress: 100% done

finished in 11 sec, 402 millisec and 913 microsec, 87696 req/s, 72705 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 848950295 bytes total, 236950295 bytes http, 612000000 bytes data

87696 req/s. Very well, thus we are not so far behind.
Lets draw other baseline. Python is considered slow. But it is not as slow as you think. Add to your nginx config

http {
    upstream test {
        server unix:///home/adolgarev/uwsgi.sock;
    }
    ...
    server {
        location /test {
            uwsgi_pass  test;
            include     uwsgi_params;
        }
        ...
    }
}

uwsgi protocol is far more effective than http thus we are using uwsgi_pass instead of proxy_pass.
Start uwsgi with small WSGI test program

cat test.py
def application(env, start_response):
    start_response(‘200 OK‘, [(‘Content-Type‘,‘text/html‘)])
    return ["Hello World"]
uwsgi --socket uwsgi.sock --wsgi-file test.py --master --processes 4 --threads 2

The same test

build/default/weighttp -n 100000 -k -c 100 http://localhost:6666/test
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 100000 total requests
...
progress: 100% done

finished in 9 sec, 264 millisec and 26 microsec, 10794 req/s, 1844 kbyte/s
requests: 100000 total, 100000 started, 100000 done, 100000 succeeded, 0 failed, 0 errored
status codes: 100000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 17495280 bytes total, 15395280 bytes http, 2100000 bytes data

10794 req/s, I‘d say good enough as for python.
Another thing to compare to is GlassFish Server Open Source Edition 4.0 with Servlets 3.1 powered by Grizzly - a competitor of Netty.
Small Servlet

package test;

import java.io.IOException;
import java.io.PrintWriter;
import javax.json.Json;
import javax.json.stream.JsonGenerator;
import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

@WebServlet("/test")
public class TestServlet extends HttpServlet {

    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        resp.setContentType("text/plain");
        PrintWriter writer = resp.getWriter();
        try (JsonGenerator gen = Json.createGenerator(writer)) {
            gen.writeStartObject().write("res", "Ok").writeEnd();
        }
    }

}

And the same test

build/default/weighttp -n 1000000 -k -c 100 http://localhost:8080/WebApplication1/test
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 1000000 total requests
...
progress: 100% done

finished in 41 sec, 745 millisec and 917 microsec, 23954 req/s, 6855 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 293074157 bytes total, 281074157 bytes http, 12000000 bytes data

23954 req/s, somewhere in between. But surprisingly if we remove json dependency it runs hell fast.

@WebServlet("/test")
public class TestServlet extends HttpServlet {

    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        resp.setContentType("text/plain");
        try (PrintWriter writer = resp.getWriter()) {
            writer.print("Ok");
        }
    }

}

build/default/weighttp -n 1000000 -k -c 100 http://localhost:8080/WebApplication1/test
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 1000000 total requests
...
progress: 100% done

finished in 10 sec, 944 millisec and 231 microsec, 91372 req/s, 25169 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 282074100 bytes total, 280074100 bytes http, 2000000 bytes data

Amazing 91372 req/s, faster than nginx (due to the whole response is small enough to fit into one memory page and even one packet with MTU 1500).
I‘m still not satisfied. Profile our server.

Eliminate most obvious hot spots.

build/default/weighttp -n 1000000 -k -c 100 http://localhost:9999/
weighttp - a lightweight and simple webserver benchmarking tool

starting benchmark...
spawning thread #1: 100 concurrent requests, 1000000 total requests
...
progress: 100% done

finished in 9 sec, 298 millisec and 269 microsec, 107546 req/s, 10292 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 98000000 bytes total, 94000000 bytes http, 4000000 bytes data

107546 req/s, and we can do even better.

Conclusion? Our server is fast. But Servlets are fast too, taking into account asynchronous support (Servlets 3.0) and nonblocking I/O (Servlets 3.1) they are good choice for http. Our server has obvious advantage - one can change any piece of pipeline and even add support for various protocols.

Performance analysis of our own full blown HTTP

时间： 2024-10-11 22:00:17

Performance analysis of our own full blown HTTP的相关文章

The Performance Analysis of MENCODER with PARROT

1 ENVIRONMENTS Table 1 ENV A CPU Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz 10-core Memory 100G Linux Version 3.11.0 Ubuntu Version 12.04 2 RESULTS *We could find that MENCODER with PARROT is faster than that without PARROT over four threa

How Basic Performance Analysis Saved Us Millions-------火焰图

ENGINEERING How Basic Performance Analysis Saved Us Millions Michael Malis May 19, 2017 9 min read This is the story of how I applied basic performance analysis techniques to find a small change that resulted in a 10x improvement in CPU use for our

Linux Performance Analysis and Tools(Linux性能分析和工具)

首先来看一张图: 上面这张神一样的图出自国外一个Lead Performance Engineer(Brendan Gregg)的一次分享,几乎涵盖了一个系统的方方面面,任何人,如果没有完善的计算系统知识,网络知识和操作系统的知识,这张图中列出的工具,是不可能全部掌握的. 出于本人对linux系统的极大兴趣,以及对底层知识的强烈渴望,并作为检验自己基础知识的一个指标,我决定将这里的所有工具学习一遍(时间不限),这篇文章将作为我学习这些工具的学习笔记.尽量组织得方便自己日后查阅,并希望对别人也有一

SQL Server ->> Memory Allocation Mechanism and Performance Analysis（内存分配机制与性能分析）之 -- Minimum server memory与Maximum server memory

Minimum server memory与Maximum server memory是SQL Server下配置实例级别最大和最小可用内存(注意不等于物理内存)的服务器配置选项.它们是管理SQL Server内存的途径之一. Minimum server memory与Maximum server memory Minimum server memory(MB): 最小服务器内存.一旦超过这个线就不会再把内存换回去.但是也不是说SQL Server一启动马上就申请这么多的内存. Maximum

Performance analysis of our own full blown HTTP

Performance analysis of our own full blown HTTP的相关文章

The Performance Analysis of MENCODER with PARROT

How Basic Performance Analysis Saved Us Millions-------火焰图

Linux Performance Analysis and Tools(Linux性能分析和工具)

SQL Server ->> Memory Allocation Mechanism and Performance Analysis（内存分配机制与性能分析）之 -- Minimum server memory与Maximum server memory

Performance analysis of SQL Server — about CPU

Performance analysis of SQL server disk I / O

9 tools to help you with Java Performance Tuning

A Taxonomy for Performance

[转] Vmware vs Virtualbox vs KVM vs XEN: virtual machines performance comparison