用 16G 内存存放 30亿数据(Java Map)转载

在讨论怎么去重,提出用 direct buffer 建 btree,想到应该有现成方案,于是找到一个好东西:

MapDB - MapDB : http://www.mapdb.org/

以下来自:kotek.net : http://kotek.net/blog/3G_map

3 billion items in Java Map with 16 GB RAM

One rainy evening I meditated about memory managment in Java and how effectively Java collections utilise memory. I made simple experiment, how much entries can I insert into Java Map with 16 GB of RAM?

Goal of this experiment is to investigate internal overhead of collections. So I decided to use small keys and small values. All tests were made on Linux 64bit Kubuntu 12.04. JVM was 64bit Oracle Java 1.7.0_09-b05 with HotSpot 23.5-b02. There is option to use compressed pointers (-XX:+UseCompressedOops), which is on by default on this JVM.

First is naive test with java.util.TreeMap. It inserts number into map, until it runs out of memory and ends with exception. JVM settings for this test was -Xmx15G

import java.util.*;
Map m = new TreeMap();
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This example ended at 172 milion entries. Near the end insertion rate slowed down thanks to excesive GC activity. On second run I replaced TreeMap with `HashMap, it ended at 182 milions.

Java default collections are not most memory efficient option. So lets try an memory-optimized . I choosed LongHashMap from MapDB, which uses primitive long keys and is optimized to have small memory footprint. JVM settings is again -Xmx15G

import org.mapdb.*
LongMap m = new LongHashMap();
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This time counter stopped at 276 million entries. Again near the end insertion rate slowed down thanks to excesive GC activity.
It looks like this is limit for heap-based collections, Garbage Collection simply brings overhead.

Now is time to pull out the big gun :-). We can always go of-heap where GC can not see our data. Let me introduce you to MapDB, it provides concurrent TreeMap and HashMap backed by database engine. It supports various storage modes, one of them is off-heap memory. (disclaimer: I am MapDB author).

So lets run previous example, but now with off-heap Map. First are few lines to configure and open database, it opens direct-memory store with transactions disabled. Next line creates new Map within the db.

import org.mapdb.*

DB db = DBMaker
   .newDirectMemoryDB()
   .transactionDisable()
   .make();

Map m = db.getTreeMap("test");
for(long counter=0;;counter++){
  m.put(counter,"");
  if(counter%1000000==0) System.out.println(""+counter);
}

This is off-heap Map, so we need different JVM settings: -XX:MaxDirectMemorySize=15G -Xmx128M. This test runs out of memory at 980 million records.

But MapDB can do better. Problem in previous sample is record fragmentation, b-tree node changes its size on each insert. Workaround is to hold b-tree nodes in cache for short moment before they are inserted. This reduces the record fragmentation to minimum. So lets change DB configuration:

DB db = DBMaker
     .newDirectMemoryDB()
     .transactionDisable()
     .asyncFlushDelay(100)
     .make();

Map m = db.getTreeMap("test");

This records runs out of memory with 1 738 million records. Speed is just amazing 1.7 bilion items are inserted within 31 minutes.

MapDB can do even better. Lets increase b-tree node size from 32 to 120 entries and enable transparent compression:

   DB db = DBMaker
            .newDirectMemoryDB()
            .transactionDisable()
            .asyncFlushDelay(100)
            .compressionEnable()
            .make();

   Map m = db.createTreeMap("test",120, false, null, null, null);

This example runs out of memory at whipping 3 315 million records. It is slower thanks to compression, but it still finishes within a few hours. I could probably make some optimization (custom serializers etc) and push number of entries to somewhere around 4 billions.

Maybe you wander how all those entries can fit there. Answer is delta-key compression. Also inserting incremental key (already ordered) into B-Tree is best-case scenario and MapDB is slightly optimized for it. Worst case scenario is inserting keys at random order:

UPDATE added latter: there was bit confusion about compression. Delta-key compression is active by default on all examples. In this example I activated aditional zlib style compression.

    DB db = DBMaker
            .newDirectMemoryDB()
            .transactionDisable()
            .asyncFlushDelay(100)
            .make();

    Map m = db.getTreeMap("test");

    Random r = new Random();
    for(long counter=0;;counter++){
        m.put(r.nextLong(),"");
        if(counter%1000000==0) System.out.println(""+counter);
    }

But even with random order MapDB handles to store 651 million records, nearly 4 times more then heap-based collections.

This little excersice does not have much purpose. It is just one of many I do to optimize MapDB. Perhaps most amazing is that insertion speed was actually wery good and MapDB can compete with memory based collections.

时间: 2024-10-11 21:09:29

用 16G 内存存放 30亿数据(Java Map)转载的相关文章

java内存结构(运行时数据区域)

java虚拟机规范规定的java虚拟机内存其实就是java虚拟机运行时数据区,其架构如下: 其中方法区和堆是由所有线程共享的数据区. Java虚拟机栈,本地方法栈和程序计数器是线程隔离的数据区. (1).程序计数器: 是一块较小的内存空间,其作用可以看作是当前线程所执行的字节码的行号指示器,字节码解析器工作时通过改变程序计数器的值来选取下一条需要执行的字节码指令.程序的分支.循环.跳转.异常处理以及线程恢复等基础功能都是依赖程序计数器来完成. Java虚拟机的多线程是通过线程轮流切换并分配处理器

海量数据面试题整理1.给定a、b两个文件,各存放50亿个url,每个url各占64字节,内存限制是

海量数据面试题整理 1. 给定a.b两个文件,各存放50亿个url,每个url各占64字节,内存限制是4G,让你找出a.b文件共同的url? 方案1:可以估计每个文件安的大小为50G×64=320G,远远大于内存限制的4G.所以不可能将其完全加载到内存中处理.考虑采取分而治之的方法. s 遍历文件a,对每个url求取,然后根据所取得的值将url分别存储到1000个小文件(记为)中.这样每个小文件的大约为300M. s 遍历文件b,采取和a相同的方式将url分别存储到1000各小文件(记为).这样

java内存结构(执行时数据区域)

java虚拟机规范规定的java虚拟机内存事实上就是java虚拟机执行时数据区,其架构例如以下: 当中方法区和堆是由全部线程共享的数据区. Java虚拟机栈.本地方法栈和程序计数器是线程隔离的数据区. (1).程序计数器: 是一块较小的内存空间,其作用能够看作是当前线程所运行的字节码的行号指示器,字节码解析器工作时通过改变程序计数器的值来选取下一条须要运行的字节码指令. 程序的分支.循环.跳转.异常处理以及线程恢复等基础功能都是依赖程序计数器来完毕. Java虚拟机的多线程是通过线程轮流切换并分

Java内存存放区域与内存溢出异常(一)

**Java内存存放区域与内存溢出异常(一)** Java虚拟机在执行Java程序的过程中会把它所管理的内存划分为若干个不同的数据区域,这些区域都有着各自的用途,以及创建和销毁的时间,有的区域随着虚拟机进程的启动而存在,有些区域则是依赖于用户进程的启动和结束而建立和销毁,java虚拟机所管理的内存将会包括以下几个运行时数据区域,如图一. 1.在这里先介绍程序计数器 程序计数器(Program Counter Register)是一块内存较小的内存空间,它的作用可以看作是当 前线程所执行的字节码的

给定a、b两个文件,各存放50亿个url,每个url各占用64字节,内存限制是4G,如何找出a、b文件共同的url?

给定a.b两个文件,各存放50亿个url,每个url各占用64字节,内存限制是4G,如何找出a.b文件共同的url? 可以估计每个文件的大小为5G*64=300G,远大于4G.所以不可能将其完全加载到内存中处理.考虑采取分而治之的方法. 遍历文件a,对每个url求取hash(url)%1000,然后根据所得值将url分别存储到1000个小文件(设为a0,a1,...a999)当中.这样每个小文件的大小约为300M.遍历文件b,采取和a相同的方法将url分别存储到1000个小文件(b0,b1...

JVM内存结构——运行时数据区

在Java虚拟机规范中将Java运行时数据划分为6种,分别为: PC寄存器(程序计数器) Java栈 堆 方法区 运行时常量池 本地方法栈 一.PC寄存器(程序计数器) PC寄存器(Program Counter Register)严格来说是一个数据结构,它用于保存当前正常执行的程序的内存地址. 线程私有. 每个线程启动的时候,都会创建一个PC(Program Counter,程序计数器)寄存器.PC寄存器里保存有当前正在执行的JVM指令的地址. 每个线程都需要一个独立的程序计数器,各条线程之间

大数据Java基础(一)int型与byte型数组的转换

为了在接下来的篇章中讲解用java实现文件的归档和解归档,需要先了解一下Java中int型与byte型数组之间的相互转换. 首先,我们先来看看int型转换成byte型数组. 我们知道,Java中,一个int型占用4个字节,一个byte型占用1个字节,所以,对于一个int型,我们需要一个长度为4的byte型数组来对其进行存储. 31位--24位 23位--16位 15位--8位 7位--0位 一个int型的4个字节如上图所示,假设用来存储的字节数组为bytes[],那么,我们可以用bytes[0]

Redis基本使用及百亿数据量中的使用技巧分享

作者:依乐祝 原文地址:https://www.cnblogs.com/yilezhu/p/9941208.html 作者:大石头 时间:2018-11-10 晚上20:00 地点:钉钉群(组织代码BKMV7685)QQ群:1600800 内容:Redis基本使用及百亿数据量中的使用技巧分享 记录人:依乐祝 热场准备 熟悉的开场白,大家晚上好啊,今天给大家分享的是Redis在大数据中的使用,可能真正讲的是一些redis的使用技巧,Redis基本的一些东西. 首先给大家个地址,源码以及实例都在里面

Redis基本使用及百亿数据量中的使用技巧分享(附视频地址及观看指南)

作者:依乐祝原文地址:https://www.cnblogs.com/yilezhu/p/9941208.html 主讲人:大石头 时间:2018-11-10 晚上20:00 地点:钉钉群(组织代码BKMV7685)QQ群:1600800 内容:Redis基本使用及百亿数据量中的使用技巧分享 记录人:依乐祝 热场准备 熟悉的开场白,大家晚上好啊,今天给大家分享的是Redis在大数据中的使用,可能真正讲的是一些redis的使用技巧,Redis基本的一些东西. 首先给大家个地址,源码以及实例都在里面