王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密

内容：

1、到底什么是Page；

2、Page具体的两种实现方式；

3、Page的使用的源码详解；

==========Tungsten中到底什么是Page============

1、在Spark中其实是不存在Page这个类的！！！实质上来说，Page是一种数据结构（类似于Stack、List等），从OS的层面来讲，Page代表了一个内存块在Page里面可以存放数据，在OS会存在很多不同的Page，当要获得数据的时候，首先要定位具体是哪个Page中的数据，找到该Page之后，从Page中根据特定的规则（例如说数据的offset和length等）取出数据；

2、那到底什么是Spark中的Page？在阅读源码的时候研究MemoryBlock.java，是从TaskMemoryManager.java进去看到的发现MemoryBlock就是Page

public class MemoryBlock extends MemoryLocation {

private final long length;

/**
* Optional page number; used when this MemoryBlock represents a page allocated by a
* TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager,
* which lives in a different package.
*/
public int pageNumber = -1;

public MemoryBlock(@Nullable Object obj, long offset, long length) {
super(obj, offset);
this.length = length;
}

/**
* Returns the size of the memory block.
*/
public long size() {
return length;
}

/**
* Creates a memory block pointing to the memory used by the long array.
*/
public static MemoryBlock fromLongArray(final long[] array) {
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, array.length * 8);
}
}

MemoryBlock表示Page，里面的数据可能是on-heap或者off-heap的，所以上面的构造函数的第一个参数，是可以为空，on-heap是有对象的，off-heap是无对象的

如果On-heap的方式，内存的分配是有HeapMemoryAllocator完成的

/**
* A simple {@link MemoryAllocator} that can allocate up to 16GB using a JVM long primitive array.
*/
public class HeapMemoryAllocator implements MemoryAllocator {

@GuardedBy("this")
private final Map<Long, LinkedList<WeakReference<MemoryBlock>>> bufferPoolsBySize =
new HashMap<>();

private static final int POOLING_THRESHOLD_BYTES = 1024 * 1024;

/**
* Returns true if allocations of the given size should go through the pooling mechanism and
* false otherwise.
*/
private boolean shouldPool(long size) {
// Very small allocations are less likely to benefit from pooling.
return size >= POOLING_THRESHOLD_BYTES;
}

@Override
public MemoryBlock allocate(long size) throws OutOfMemoryError {
if (shouldPool(size)) {
synchronized (this) {
final LinkedList<WeakReference<MemoryBlock>> pool = bufferPoolsBySize.get(size);
if (pool != null) {
while (!pool.isEmpty()) {
final WeakReference<MemoryBlock> blockReference = pool.pop();
final MemoryBlock memory = blockReference.get();
if (memory != null) {
assert (memory.size() == size);
return memory;
}
}
bufferPoolsBySize.remove(size);
}
}
}
long[] array = new long[(int) ((size + 7) / 8)];
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, size);
}

如果Off-heap的方式，内存的分配是有UnsafeMemoryAllocator完成的

/**
* A simple {@link MemoryAllocator} that uses {@code Unsafe} to allocate off-heap memory.
*/
public class UnsafeMemoryAllocator implements MemoryAllocator {

@Override
public MemoryBlock allocate(long size) throws OutOfMemoryError {
long address = Platform.allocateMemory(size);
return new MemoryBlock(null, address, size);
}

==========如何使用Page呢============

1、在TaskMemoryManager中通过封装Page来定位数据，定位的时候如果是on-heap的话，则先找到对象，然后再对象中通过offset来具体定位地址，而如果是off-heap，则直接定位；

2、一个关键的问题是，如何确定数据呢？这个时候就需要设计具体的算法

TaskMemoryManager中已经写好准备以后一台机器拥有32T大小的内存了

王家林老师名片：

中国Spark第一人

新浪微博：http://weibo.com/ilovepains

微信公众号：DT_Spark

博客：http://blog.sina.com.cn/ilovepains

手机：18610086859

QQ：1740415547

邮箱：[email protected]

时间： 2024-08-10 23:28:48

王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密

王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密的相关文章

王家林谈Spark性能优化第一季！(DT大数据梦工厂)

Spark性能优化之道——解决Spark数据倾斜（Data Skew）的N种姿势

Spark性能优化指南——高级篇

Spark性能优化指南——基础篇

美团Spark性能优化指南——基础篇

【转载】 Spark性能优化指南——基础篇

Spark性能优化指南——基础篇转

【转载】Spark性能优化指南——高级篇

Spark性能优化指南--基础篇