Why malloc+memset is slower than calloc?

https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/

The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.

Understanding this requires a short tour of the memory system.

Quick tour of memory

There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...

Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you‘re asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I‘ve seen 1 MiB used as the threshold.

The kernel is responsible for allocating actual RAM to each process and making sure that processes don‘t interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it‘s the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can‘t just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.

The page table maps memory addresses to actual physical RAM. Your process‘s addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren‘t real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.

How it doesn‘t work

Here‘s how allocating 256 MiB does not work:

  1. Your process calls calloc() and asks for 256 MiB.
  2. The standard library calls mmap() and asks for 256 MiB.
  3. The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.
  4. The standard library zeroes the RAM with memset() and returns from calloc().
  5. Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.

How it actually works

The above process would work, but it just doesn‘t happen this way. There are three major differences.

  • When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap()guarantees that the new memory it returns is always zeroed.
  • There are a lot of programs out there that allocate memory but don‘t use the memory right away. Some times memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn‘t touch the page table at all and doesn‘t give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.
  • Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn‘t shared with any other programs.

The final process looks more like this:

  1. Your process calls calloc() and asks for 256 MiB.
  2. The standard library calls mmap() and asks for 256 MiB.
  3. The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns.
  4. The standard library knows that the result of mmap() is always filled with zeroes (or will beonce it actually gets some RAM), so it doesn‘t touch the memory, so there is no page fault, and the RAM is never given to your process.
  5. Your process eventually exits, and the kernel doesn‘t need to reclaim the RAM because it was never allocated in the first place.

If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.


This doesn‘t always work

Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.

This also won‘t always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc()could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.

Dispelling some wrong answers

Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages durign idle isn‘t enough to explain the large performance differences anyway.

The calloc() function is not using some special memory-aligned version of memset(), and that wouldn‘t make it much faster anyway. Most memset() implementations for modern processors look kind of like this:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it‘s just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

So you can see, memset() is very fast and you‘re not really going to get anything better for large blocks of memory.

The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).

Party trick

Instead of looping 10 times, write a program that allocates memory until malloc() or calloc()returns NULL.

What happens if you add memset()?

时间: 2024-10-11 04:05:22

Why malloc+memset is slower than calloc?的相关文章

入职培训笔记记录--day9(1、指针函数与函数指针、函数指针数组 2、malloc memset 3、递归函数 4、结构体 5、共用体---》大小端 6、枚举)

1.指针函数与函数指针.函数指针数组 指针函数:返回值为指针的函数 char *fun() { char str[] = "hello world"; return str; } int main() { char *p = fun(); puts(p); return 0; } 编译时,会出现警告,返回了一个已经被释放掉的内存空间的首地址解决方法:1.static 2.char *str = "hello world"; 3.malloc 注意:使用完后要free

动态分配内存函数:malloc(),calloc(),realloc(),以及memset(),free() 详细总结

以下资料大部分来源网络,个人进行了汇总和添加. 内存可分为下面几个类别: 堆栈区(stack):由编译器自动分配与释放,存放函数的参数值,局部变量,临时变量等等,它们获取的方式都是由编译器自动执行的,变量生命长度:函数结束即释放内存. 堆区(heap):一般由程序员分配与释放,即程序员不释放,程序结束时可能由操作系统回收(C/C++没有此等回收机制,Java/C#有),注意它与数据结构中的堆是两回事,分配方式倒是类似于链表. 全局区(静态区)(static):全局变量和静态变量的存储是放在一块儿

Linux C 堆内存管理函数malloc(),calloc(),realloc(),free()详解

C 编程中,经常需要操作的内存可分为下面几个类别: 堆栈区(stack):由编译器自动分配与释放,存放函数的参数值,局部变量,临时变量等等,它们获取的方式都是由编译器自动执行的 堆区(heap):一般由程序员分配与释放,基程序员不释放,程序结束时可能由操作系统回收(C/C++没有此等回收机制,Java/C#有),注意它与数据结构中的堆是两回事,分配方式倒是类似于链表. 全局区(静态区)(static):全局变量和静态变量的存储是放在一块儿的,初始化的全局变量和静态变量在一块区域,未初始化的全局变

论C语言的malloc,calloc,new,realloc,alloca的机制和区别

最近笔试老是遇到关于C语言的malloc,new之类的内存机制问题,作为一个做java开发的程序员不免有些郁闷,驾驭不了.乘空闲下来的这些时间,好好整理下C语言中各个内存函数的简单机制,作用和区别: C语言内存分配方式 (1) 从静态存储区域分配.内存在程序编译的时候就已经分配好,这块内存在程序的整个运行期间都存在.例如全局变量,static变量. (2) 在栈上创建.在执行函数时,函数内局部变量的存储单元都可以在栈上创建,函数执行结束时这些存储单元自动被释放.栈内存分配运算内置于处理器的指令集

Linux中brk()系统调用,sbrk(),mmap(),malloc(),calloc()的异同【转】

转自:http://blog.csdn.net/kobbee9/article/details/7397010 brk和sbrk主要的工作是实现虚拟内存到内存的映射.在GNUC中,内存分配是这样的:       每个进程可访问的虚拟内存空间为3G,但在程序编译时,不可能也没必要为程序分配这么大的空间,只分配并不大的数据段空间,程序中动态分配的空间就是从这一块分配的.如果这块空间不够,malloc函数族(realloc,calloc等)就调用sbrk函数将数据段的下界移动,sbrk函数在内核的管理

第十二章 分配内存: malloc ()与free () 及calloc()

 malloc () : 它接受一个参数,即所需内存字节数.如果成功,则返回该空间首地址,该空间没有初始化,如果失败,则返回NULL ,(但是找到的内存是匿名的) (分配类型 *)malloc(分配元素个数 *sizeof(分配类型)) 例子: double * ptb; ptb = (double * ) malloc (30 * sizeof (double )); 这段代码请求30个 double类型值的空间,并且把ptb指向该空间所在位置,注意ptb是作为一指向一个double类型值

关于new,delete,malloc,free的一些总结

首先,new,delete都是c++的关键字并不是函数,通过特定的语法组成表达式,new可以在编译的时候确定其返回值.可以直接使用string *p=new string("asdfgh");来直接赋值.这其中在调用new分配空间得时候的时候,系统其实直接调用了类或结构的构造函数来对对其进行赋值,这个过程就相当于是string p=string("asdfgh"); 或者string p("asdfgh");(其实上面的过程还是有一定的不同之处:

C语言中malloc函数的简介

malloc函数 (1)解释malloc函数作用 malloc的全称是memory allocation,中文叫动态内存分配. malloc函数是想系统申请分配指定size个字节的内存空间.malloc的返回类型是void*类型.void*表示为确定类型的指针.C/C++规定void*类型可以强制转换为任何其它类型的指针. (2)全名 void * malloc(size_t size); (3)原型 extern void *malloc(size_t size); (4)头文件 #inclu

C学习笔记——malloc内存分配

鉴于上次领导告诉一个解决方案,让我把它写成文档,结果自己脑子里知道如何操作和解决,但就是不知道如何用语言文字把它给描述出来.决定以后多写一些笔记和微博来锻炼自己的文字功底和培养逻辑思维,不然只会是一个敲代码的,永远到不了管理的层面. 把<C程序设计语言>细读了一遍后,到第8章UNIX系统接口的最后两节--"目录列表"和"存储分配程序",看了一遍都没看懂.智商不过高啊.把存储分配重新看了一遍,才有了眉头.这两天还要找时间把目录列表再看一遍,确保掌握.(前几