我们的HDFS生产环境是Hadoop-0.21,机器规模200台,block在7KW左右. 集群每运行几个月,NameNode就会频繁FGC,最后不得不restart NameNode. 因此怀疑NameNode存在内存泄漏问题,我们dump出了NameNode进程在重启前后的对象统计信息。
07-10重启前:
num #instances #bytes class name
----------------------------------------------
1: 59262275 3613989480 [Ljava.lang.Object;
...
10: 8549361 615553992 org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction
11: 5941511 427788792 org.apache.hadoop.hdfs.server.namenode.INodeFileUnderConstruction
...
07-10重启后:
num #instances #bytes class name
----------------------------------------------
1: 44188391 2934099616 [Ljava.lang.Object;
...
23: 721763 51966936 org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction
24: 620028 44642016 org.apache.hadoop.hdfs.server.namenode.INodeFileUnderConstruction
...
从上面的信息可以看出,NameNode节点重启前最占内存的对象是[Ljava.lang.Object、[C、org.apache.hadoop.hdfs.server.namenode.INodeFile、org.apache.hadoop.hdfs.server.namenode.BlockInfo、[B、org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction$ReplicaUnderConstruction等,它们的引用关系如下:
其中,根据NameNode节点的内部处理逻辑,INodeFileUnderConstruction和BlockInfoUnderConstruction都属于中间状态,当文件的写关闭之后,INodeFileUnderConstruction会变成INodeFile,BlockInfoUnderConstruction会变成BlockInfo,而集群的文件写压力不可能在100W/s级别,因此,NameNode节点可能存在内存泄漏。
文件在close时会调用NameNode的complete方法来关闭,此时BlocksMap的映射就会从BlockInfoUnderConstruction-->BlockInfoUnderConstruction 变成BlockInfo->BlockInfo. (我们暂且描述为oldBlock>oldBlock 替换为newBlock->newBlock) BlocksMap对这一状态转变的处理逻辑是:
BlockInfo replaceBlock(BlockInfo newBlock) { BlockInfo currentBlock = map.get(newBlock); assert currentBlock != null : "the block if not in blocksMap"; // replace block in data-node lists for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) { DatanodeDescriptor dn = currentBlock.getDatanode(idx); Log.info("Replace Block[" + newBlock + "] to Block[" + currentBlock + "] in DataNode[" + dn + "]"); dn.replaceBlock(currentBlock, newBlock); } // replace block in the map itself map.put(newBlock, newBlock); return newBlock; }
Block重写了hashCode和equals方法,使得newBlock和oldBlock有相同的hashCode,而且newBlock.equals(oldBlock)=true.
上述代码的原意是将map中的Entry(oldBlock,oldBlock)替换成(newBlock,newBlock). 而HashMap在处理put的时候,如果key相同(注意:这里的相同是指(newKey.hashCode==oldKey.hashCode && (oldKey==newKey || oldkey.equals(newKey))). 只会将对应的value替换,导致oldBlock->oldBlock被替换成oldBlock->newBlock, 也就是oldBlock依然没有被释放,也就是所谓的内存泄漏。
请参考HashMap的代码:
/** * Associates the specified value with the specified key in this map. * If the map previously contained a mapping for the key, the old * value is replaced. * * @param key key with which the specified value is to be associated * @param value value to be associated with the specified key * @return the previous value associated with <tt>key</tt>, or * <tt>null</tt> if there was no mapping for <tt>key</tt>. * (A <tt>null</tt> return can also indicate that the map * previously associated <tt>null</tt> with <tt>key</tt>.) */ public V put(K key, V value) { if (key == null) return putForNullKey(value); int hash = hash(key.hashCode()); int i = indexFor(hash, table.length); for (Entry<K,V> e = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null; }
建议将BlocksMap修复如下,已经提交patch,请见:https://issues.apache.org/jira/browse/HDFS-7592
BlockInfo replaceBlock(BlockInfo newBlock) { + /** + * change to fix bug about memory leak of NameNode by huahua.xu + * 2013-08-17 15:20 + */ BlockInfo currentBlock = map.get(newBlock); assert currentBlock != null : "the block if not in blocksMap"; // replace block in data-node lists for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) { DatanodeDescriptor dn = currentBlock.getDatanode(idx); dn.replaceBlock(currentBlock, newBlock); } // replace block in the map itself + BlockInfo currentBlock = map.remove(newBlock); map.put(newBlock, newBlock); return newBlock; }
截止到目前为止,该patch已经正式更新到线上集群,解决了内存泄漏问题。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
这是公司生产环境中遇到的较为严重的bug,目前已经提交社区:https://issues.apache.org/jira/browse/HDFS-7592 ,在此和大家share一下。如有问题可邮件联系QQ576072986.