BloomFilter在Hbase中的实现与应用

在HFILE文件中的存储

BloomFilterChunk

  /** Bytes (B) in the array. This actually has to fit into an int. */
  protected long byteSize;
  /** Number of hash functions */
  protected int hashCount;
  /** Hash type */
  protected final int hashType;
  /** Hash Function */
  protected final Hash hash;
  /** Keys currently in the bloom */
  protected int keyCount;
  /** Max Keys expected for the bloom */
  protected int maxKeys;
  /** Bloom bits */
  protected ByteBuffer bloom;
  /** The type of bloom */
  protected BloomType bloomType;

  public void allocBloom() {
    if (this.bloom != null) {
      throw new IllegalArgumentException("can only create bloom once.");
    }
    this.bloom = ByteBuffer.allocate((int)this.byteSize);
    assert this.bloom.hasArray();
  }

  public void add(Cell cell) {
    /*
     * For faster hashing, use combinatorial generation
     * http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf
     */
    int hash1;
    int hash2;
    HashKey<Cell> hashKey;
    if (this.bloomType == BloomType.ROW) {
      hashKey = new RowBloomHashKey(cell);
      hash1 = this.hash.hash(hashKey, 0);
      hash2 = this.hash.hash(hashKey, hash1);
    } else {
      hashKey = new RowColBloomHashKey(cell);
      hash1 = this.hash.hash(hashKey, 0);
      hash2 = this.hash.hash(hashKey, hash1);
    }
    setHashLoc(hash1, hash2);
  }

  private void setHashLoc(int hash1, int hash2) {
    for (int i = 0; i < this.hashCount; i++) {
      long hashLoc = Math.abs((hash1 + i * hash2) % (this.byteSize * 8));
      set(hashLoc);
    }

    ++this.keyCount;
  }

  void set(long pos) {
    int bytePos = (int)(pos / 8);
    int bitPos = (int)(pos % 8);
    byte curByte = bloom.get(bytePos);
    curByte |= BloomFilterUtil.bitvals[bitPos];
    bloom.put(bytePos, curByte);
  }

  static boolean get(int pos, ByteBuffer bloomBuf, int bloomOffset) {
    int bytePos = pos >> 3; //pos / 8
    int bitPos = pos & 0x7; //pos % 8
    // TODO access this via Util API which can do Unsafe access if possible(?)
    byte curByte = bloomBuf.get(bloomOffset + bytePos);
    curByte &= BloomFilterUtil.bitvals[bitPos];
    return (curByte != 0);
  }

使用ByteBuffer实际存储bit数组，因此get和set过程都需要进行相应的转换,计算byte[]的index再计算byte内bit的index。
由于hash函数个数是不定地，该类中使用一个hash函数通过不同的seed计算出hash1和hash2然后根据定义的hash函数的个数,按照公式hash1+i*hash2循环计算出n个hash值。

flush过程写入到HFILE

flush的过程中会为各个Cell设置布隆过滤器，再HFileclose的时候统一把Bloom Block和Bloom Meta 持久化到HFile中，调用时序图如下:

StoreFileWriter:

  @Override
  public void append(final Cell cell) throws IOException {
    appendGeneralBloomfilter(cell);
    appendDeleteFamilyBloomFilter(cell);
    writer.append(cell);
    trackTimestamps(cell);
  }

  public void close() throws IOException {
    boolean hasGeneralBloom = this.closeGeneralBloomFilter();
    boolean hasDeleteFamilyBloom = this.closeDeleteFamilyBloomFilter();

    writer.close();

    // Log final Bloom filter statistics. This needs to be done after close()
    // because compound Bloom filters might be finalized as part of closing.
    if (LOG.isTraceEnabled()) {
      LOG.trace((hasGeneralBloom ? "" : "NO ") + "General Bloom and " +
        (hasDeleteFamilyBloom ? "" : "NO ") + "DeleteFamily" + " was added to HFile " +
        getPath());
    }

  }

HFileWriterImpl:

  private void addBloomFilter(final BloomFilterWriter bfw,
      final BlockType blockType) {
    if (bfw.getKeyCount() <= 0)
      return;

    if (blockType != BlockType.GENERAL_BLOOM_META &&
        blockType != BlockType.DELETE_FAMILY_BLOOM_META) {
      throw new RuntimeException("Block Type: " + blockType.toString() +
          "is not supported");
    }
    additionalLoadOnOpenData.add(new BlockWritable() {
      @Override
      public BlockType getBlockType() {
        return blockType;
      }

      @Override
      public void writeToBlock(DataOutput out) throws IOException {
        bfw.getMetaWriter().write(out);
        Writable dataWriter = bfw.getDataWriter();
        if (dataWriter != null)
          dataWriter.write(out);
      }
    });
  }

get过程应用BF进行scanner过滤

get的时候会根据bloom filter类型对HFile进行过滤。调用时序图如下:

StoreFileReader.checkGeneralBloomFilter

        // Whether the primary Bloom key is greater than the last Bloom key
        // from the file info. For row-column Bloom filters this is not yet
        // a sufficient condition to return false.
        boolean keyIsAfterLast = (lastBloomKey != null);
        // hbase:meta does not have blooms. So we need not have special interpretation
        // of the hbase:meta cells.  We can safely use Bytes.BYTES_RAWCOMPARATOR for ROW Bloom
        if (keyIsAfterLast) {
          if (bloomFilterType == BloomType.ROW) {
            keyIsAfterLast = (Bytes.BYTES_RAWCOMPARATOR.compare(key, lastBloomKey) > 0);
          } else {
            keyIsAfterLast = (CellComparator.getInstance().compare(kvKey, lastBloomKeyOnlyKV)) > 0;
          }
        }

        if (bloomFilterType == BloomType.ROWCOL) {
          // Since a Row Delete is essentially a DeleteFamily applied to all
          // columns, a file might be skipped if using row+col Bloom filter.
          // In order to ensure this file is included an additional check is
          // required looking only for a row bloom.
          Cell rowBloomKey = PrivateCellUtil.createFirstOnRow(kvKey);
          // hbase:meta does not have blooms. So we need not have special interpretation
          // of the hbase:meta cells.  We can safely use Bytes.BYTES_RAWCOMPARATOR for ROW Bloom
          if (keyIsAfterLast
              && (CellComparator.getInstance().compare(rowBloomKey, lastBloomKeyOnlyKV)) > 0) {
            exists = false;
          } else {
            exists =
                bloomFilter.contains(kvKey, bloom, BloomType.ROWCOL) ||
                bloomFilter.contains(rowBloomKey, bloom, BloomType.ROWCOL);
          }
        } else {
          exists = !keyIsAfterLast
              && bloomFilter.contains(key, 0, key.length, bloom);
        }

备注

version:2.1.7

原文地址：https://www.cnblogs.com/andyhe/p/11732136.html

时间： 2024-10-15 00:15:46

BloomFilter在Hbase中的实现与应用的相关文章

Hbase中的BloomFilter（布隆过滤器）

(1) Bloomfilter在hbase中的作用 Hbase利用bloomfilter来提高随机读(get)的性能,对于顺序读(scan)而言,设置Bloomfilter是没有作用的(0.92版本以后,如果设置了bloomfilter为rowcol,对于执行了qualifier的scan有一定的优化) (2) Bloomfilter在hbase中的开销 Bloomfilter是一个列族(cf)级别的配置属性,如果在表中设置了bloomfilter,那么hbase会在生成sto

淘宝在hbase中的应用和优化

本文来自于NoSQLFan联合作者@koven2049,他在淘宝从事Hadoop及HBase相关的应用和优化.对Hadoop.HBase都有深入的了解,本文就是其在工作中对HBase的应用优化小结,分享给大家. 目录 [ - ] 前言原因应用情况部署.运维和监控测试与发布改进和优化将来计划前言 hbase是从 hadoop中分离出来的apache顶级开源项目.由于它很好地用java实现了google的bigtable系统大部分特性,因此在数据量猛增的今天非常受到欢迎.对于淘宝而言

talend 将hbase中数据导入到mysql中

首先,解决talend连接hbase的问题: 公司使用的机器是HDP2.2的机器,上面配置好Hbase服务,在集群的/etc/hbase/conf/hbase-site.xml下,有如下配置: <property> <name>zookeeper.znode.parent</name> <value>/hbase-unsecure</value> </property> 这个配置是决定, Hbase master在zookeeper中

Hbase中rowkey设计原则

Hbase中rowkey设计原则 1.热点问题在某一时间段,有大量的数据同时对一个region进行操作 2.原因对rowkey的设计不合理对rowkey的划分不合理 3.解决方式 rowkey是hbase的读写唯一标识最大长度是64KB. 4.核心原则设计必须按照业务需求进行设计 5.长度原则经验:10~100字节可以官方:16字节,因为操作系统时8字节进行存储 6.散列原则划分region是按照rowkey的头部进行划分. 有几种方式: )组合字段 id+timestamp )

kettle连接Hbase中数据导出（7）

http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HBase+to+Load+an+RDBMS 1)新建转换——Big Data——Hbase Input双击打开 2)在hbase中创建表 3)点击Get table names 4)创建Mapping 在下图中单击Get table names按钮,从下拉列表中选择需要创建MAPPING的表名,在Mapping name中输入Mapping名称,然后再设置key字段,如下图 4)创

kettle连接Hbase中数据导入（8）

http://wiki.pentaho.com/display/BAD/Loading+Data+into+HBase 1)下载样本文件到官网去下载 2)Hbase中建表 3)创建转换 3)配置Text file Input 5)配置content和Fields 6)添加HBase Output 7)配置HBase Output 8)配置Mapping 9)完成配置连接 10)保存运行 11)检查Hbase中数据

使用bulkload向hbase中批量写入数据

1.数据样式写入之前,需要整理以下数据的格式,之后将数据保存到hdfs中,本例使用的样式如下(用tab分开): row1 N row2 M row3 B row4 V row5 N row6 M row7 B 2.代码假设要将以上样式的数据写入到hbase中,列族为cf,列名为colb,可以使用下面的代码(参考) 1 package com.testdata; 2 3 import java.io.IOException; 4 import org.apache.hadoop.conf.Co

hbase中的位图索引--布隆过滤器

在hbase中,读业务是非常频繁的.很多操作都是客户端根据meta表定位到具体的regionserver然后再查询region中的具体的数据. 但是现在问题来了,一个region由一个memstore以及多个filestore组成,memstore类似缓存在服务器内存中,可以提高插入的效率,当memstore达到一定大小(由hbase.hregion.memstore.flush.size设置)或者说用户手动flush之后,就会固化存储在hdfs之类的磁盘系统上.也就是说一个region可以对应

【HBase】zookeeper在HBase中的应用

转自:http://support.huawei.com/ecommunity/bbs/10242721.html Zookeeper在HBase中的应用 HBase部署相对是一个较大的动作,其依赖于zookeeper cluster,hadoop HDFS. Zookeeper作用在于: 1.hbase regionserver 向zookeeper注册,提供hbase regionserver状态信息(是否在线). 2.hmaster启动时候会将hbase系统表-ROOT- 加载到 zook