1. HBase连接的方式概况

主要分为：

纯Java API读写HBase的方式；
Spark读写HBase的方式；
Flink读写HBase的方式；
HBase通过Phoenix读写的方式；

第一种方式是HBase自身提供的比较原始的高效操作方式，而第二、第三则分别是Spark、Flink集成HBase的方式，最后一种是第三方插件Phoenix集成的JDBC方式，Phoenix集成的JDBC操作方式也能在Spark、Flink中调用。

注意：

这里我们使用HBase2.1.2版本，flink1.7.2版本，scala-2.12版本。

2. Flink Streaming和Flink DataSet读写HBase

Flink上读取HBase数据有两种方式：

继承RichSourceFunction重写父类方法（flink streaming）
实现自定义TableInputFormat接口（flink streaming和flink dataSet）

Flink上将数据写入HBase也有两种方式：

继承RichSinkFunction重写父类方法（flink streaming）
实现OutputFormat接口（flink streaming和flink dataSet）

注意：

① Flink Streaming流式处理有上述两种方式；但是Flink DataSet批处理，读只有“实现TableInputFormat接口”一种方式，写只有”实现OutputFormat接口“一种方式。

②TableInputFormat接口是在flink-hbase-2.12-1.7.2里面的，而该jar包对应的hbase版本是1.4.3，而项目中我们使用HBase2.1.2版本，故需要对TableInputFormat重写。

2.1 Flink读取HBase的两种方式

注意：读取HBase之前可以先执行节点2.2.2实现OutputFormat接口：Flink dataSet 批处理写入HBase的方法，确保HBase test表里面有数据，数据如下：

2.1.1 继承RichSourceFunction重写父类方法：

package cn.swordfall.hbaseOnFlink

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.RichSourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
import org.apache.hadoop.hbase.{Cell, HBaseConfiguration, HConstants, TableName}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Scan, Table}
import org.apache.hadoop.hbase.util.Bytes
import scala.collection.JavaConverters._
/**
  * @Author: Yang JianQiu
  * @Date: 2019/2/28 18:05
  *
  * 以HBase为数据源
  * 从HBase中获取数据，然后以流的形式发射
  *
  * 从HBase读取数据
  * 第一种：继承RichSourceFunction重写父类方法
  */
class HBaseReader extends RichSourceFunction[(String, String)]{

  private var conn: Connection = null
  private var table: Table = null
  private var scan: Scan = null

  /**
    * 在open方法使用HBase的客户端连接
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
    val config: org.apache.hadoop.conf.Configuration = HBaseConfiguration.create()

    config.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
    config.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    config.setInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, 30000)
    config.setInt(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, 30000)

    val tableName: TableName = TableName.valueOf("test")
    val cf1: String = "cf1"
    conn = ConnectionFactory.createConnection(config)
    table = conn.getTable(tableName)
    scan = new Scan()
    scan.withStartRow(Bytes.toBytes("100"))
    scan.withStopRow(Bytes.toBytes("107"))
    scan.addFamily(Bytes.toBytes(cf1))
  }

  /**
    * run方法来自java的接口文件SourceFunction，使用IDEA工具Ctrl + o 无法便捷获取到该方法，直接override会提示
    * @param sourceContext
    */
  override def run(sourceContext: SourceContext[(String, String)]): Unit = {
    val rs = table.getScanner(scan)
    val iterator = rs.iterator()
    while (iterator.hasNext){
      val result = iterator.next()
      val rowKey = Bytes.toString(result.getRow)
      val sb: StringBuffer = new StringBuffer()
      for (cell:Cell <- result.listCells().asScala){
        val value = Bytes.toString(cell.getValueArray, cell.getValueOffset, cell.getValueLength)
        sb.append(value).append("_")
      }
      val valueString = sb.replace(sb.length() - 1, sb.length(), "").toString
      sourceContext.collect((rowKey, valueString))
    }
  }

  /**
    * 必须添加
    */
  override def cancel(): Unit = {

  }

  /**
    * 关闭hbase的连接，关闭table表
    */
  override def close(): Unit = {
    try {
      if (table != null) {
        table.close()
      }
      if (conn != null) {
        conn.close()
      }
    } catch {
      case e:Exception => println(e.getMessage)
    }
  }
}

调用继承RichSourceFunction的HBaseReader类，Flink Streaming流式处理的方式：

/**
  * 从HBase读取数据
  * 第一种：继承RichSourceFunction重写父类方法
  */
 def readFromHBaseWithRichSourceFunction(): Unit ={
   val env = StreamExecutionEnvironment.getExecutionEnvironment
   env.enableCheckpointing(5000)
   env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
   env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
   val dataStream: DataStream[(String, String)] = env.addSource(new HBaseReader)
   dataStream.map(x => println(x._1 + " " + x._2))
   env.execute()
 }

2.1.2 实现自定义的TableInputFormat接口：

由于版本不匹配，这里我们需要对flink-hbase-2.12-1.7.2里面的三个文件进行重写，分别是TableInputSplit、AbstractTableInputFormat、TableInputFormat

TableInputSplit重写为CustomTableInputSplit：

package cn.swordfall.hbaseOnFlink.flink172_hbase212;

import org.apache.flink.core.io.LocatableInputSplit;

/**
 * @Author: Yang JianQiu
 * @Date: 2019/3/19 11:50
 */
public class CustomTableInputSplit extends LocatableInputSplit {
    private static final long serialVersionUID = 1L;

    /** The name of the table to retrieve data from. */
    private final byte[] tableName;

    /** The start row of the split. */
    private final byte[] startRow;

    /** The end row of the split. */
    private final byte[] endRow;

    /**
     * Creates a new table input split.
     *
     * @param splitNumber
     *        the number of the input split
     * @param hostnames
     *        the names of the hosts storing the data the input split refers to
     * @param tableName
     *        the name of the table to retrieve data from
     * @param startRow
     *        the start row of the split
     * @param endRow
     *        the end row of the split
     */
    CustomTableInputSplit(final int splitNumber, final String[] hostnames, final byte[] tableName, final byte[] startRow,
                    final byte[] endRow) {
        super(splitNumber, hostnames);

        this.tableName = tableName;
        this.startRow = startRow;
        this.endRow = endRow;
    }

    /**
     * Returns the table name.
     *
     * @return The table name.
     */
    public byte[] getTableName() {
        return this.tableName;
    }

    /**
     * Returns the start row.
     *
     * @return The start row.
     */
    public byte[] getStartRow() {
        return this.startRow;
    }

    /**
     * Returns the end row.
     *
     * @return The end row.
     */
    public byte[] getEndRow() {
        return this.endRow;
    }
}

AbstractTableInputFormat重写为CustomeAbstractTableInputFormat：

package cn.swordfall.hbaseOnFlink.flink172_hbase212;

import org.apache.flink.addons.hbase.AbstractTableInputFormat;
import org.apache.flink.api.common.io.LocatableInputSplitAssigner;
import org.apache.flink.api.common.io.RichInputFormat;
import org.apache.flink.api.common.io.statistics.BaseStatistics;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.io.InputSplitAssigner;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.util.Pair;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * @Author: Yang JianQiu
 * @Date: 2019/3/19 11:16
 *
 * 由于flink-hbase_2.12_1.7.2 jar包所引用的是hbase1.4.3版本，而现在用到的是hbase2.1.2，版本不匹配
 * 故需要重写flink-hbase_2.12_1.7.2里面的AbstractTableInputFormat，主要原因是AbstractTableInputFormat里面调用的是hbase1.4.3版本的api，
 * 而新版本hbase2.1.2已经去掉某些api
 */
public abstract class CustomAbstractTableInputFormat<T> extends RichInputFormat<T, CustomTableInputSplit> {
    protected static final Logger LOG = LoggerFactory.getLogger(AbstractTableInputFormat.class);

    // helper variable to decide whether the input is exhausted or not
    protected boolean endReached = false;

    protected transient HTable table = null;
    protected transient Scan scan = null;

    /** HBase iterator wrapper. */
    protected ResultScanner resultScanner = null;

    protected byte[] currentRow;
    protected long scannedRows;

    /**
     * Returns an instance of Scan that retrieves the required subset of records from the HBase table.
     *
     * @return The appropriate instance of Scan for this use case.
     */
    protected abstract Scan getScanner();

    /**
     * What table is to be read.
     *
     * <p>Per instance of a TableInputFormat derivative only a single table name is possible.
     *
     * @return The name of the table
     */
    protected abstract String getTableName();

    /**
     * HBase returns an instance of {@link Result}.
     *
     * <p>This method maps the returned {@link Result} instance into the output type {@link T}.
     *
     * @param r The Result instance from HBase that needs to be converted
     * @return The appropriate instance of {@link T} that contains the data of Result.
     */
    protected abstract T mapResultToOutType(Result r);

    /**
     * Creates a {@link Scan} object and opens the {@link HTable} connection.
     *
     * <p>These are opened here because they are needed in the createInputSplits
     * which is called before the openInputFormat method.
     *
     * <p>The connection is opened in this method and closed in {@link #closeInputFormat()}.
     *
     * @param parameters The configuration that is to be used
     * @see Configuration
     */
    @Override
    public abstract void configure(Configuration parameters);

    @Override
    public void open(CustomTableInputSplit split) throws IOException {
        if (table == null) {
            throw new IOException("The HBase table has not been opened! " +
                    "This needs to be done in configure().");
        }
        if (scan == null) {
            throw new IOException("Scan has not been initialized! " +
                    "This needs to be done in configure().");
        }
        if (split == null) {
            throw new IOException("Input split is null!");
        }

        logSplitInfo("opening", split);

        // set scan range
        currentRow = split.getStartRow();
       /* scan.setStartRow(currentRow);
        scan.setStopRow(split.getEndRow());*/
        scan.withStartRow(currentRow);
        scan.withStopRow(split.getEndRow());

        resultScanner = table.getScanner(scan);
        endReached = false;
        scannedRows = 0;
    }

    @Override
    public T nextRecord(T reuse) throws IOException {
        if (resultScanner == null) {
            throw new IOException("No table result scanner provided!");
        }
        try {
            Result res = resultScanner.next();
            if (res != null) {
                scannedRows++;
                currentRow = res.getRow();
                return mapResultToOutType(res);
            }
        } catch (Exception e) {
            resultScanner.close();
            //workaround for timeout on scan
            LOG.warn("Error after scan of " + scannedRows + " rows. Retry with a new scanner...", e);
            /*scan.setStartRow(currentRow);*/
            scan.withStartRow(currentRow);
            resultScanner = table.getScanner(scan);
            Result res = resultScanner.next();
            if (res != null) {
                scannedRows++;
                currentRow = res.getRow();
                return mapResultToOutType(res);
            }
        }

        endReached = true;
        return null;
    }

    private void logSplitInfo(String action, CustomTableInputSplit split) {
        int splitId = split.getSplitNumber();
        String splitStart = Bytes.toString(split.getStartRow());
        String splitEnd = Bytes.toString(split.getEndRow());
        String splitStartKey = splitStart.isEmpty() ? "-" : splitStart;
        String splitStopKey = splitEnd.isEmpty() ? "-" : splitEnd;
        String[] hostnames = split.getHostnames();
        LOG.info("{} split (this={})[{}|{}|{}|{}]", action, this, splitId, hostnames, splitStartKey, splitStopKey);
    }

    @Override
    public boolean reachedEnd() throws IOException {
        return endReached;
    }

    @Override
    public void close() throws IOException {
        LOG.info("Closing split (scanned {} rows)", scannedRows);
        currentRow = null;
        try {
            if (resultScanner != null) {
                resultScanner.close();
            }
        } finally {
            resultScanner = null;
        }
    }

    @Override
    public void closeInputFormat() throws IOException {
        try {
            if (table != null) {
                table.close();
            }
        } finally {
            table = null;
        }
    }

    @Override
    public CustomTableInputSplit[] createInputSplits(final int minNumSplits) throws IOException {
        if (table == null) {
            throw new IOException("The HBase table has not been opened! " +
                    "This needs to be done in configure().");
        }
        if (scan == null) {
            throw new IOException("Scan has not been initialized! " +
                    "This needs to be done in configure().");
        }

        // Get the starting and ending row keys for every region in the currently open table
        final Pair<byte[][], byte[][]> keys = table.getRegionLocator().getStartEndKeys();
        if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
            throw new IOException("Expecting at least one region.");
        }
        final byte[] startRow = scan.getStartRow();
        final byte[] stopRow = scan.getStopRow();
        final boolean scanWithNoLowerBound = startRow.length == 0;
        final boolean scanWithNoUpperBound = stopRow.length == 0;

        final List<CustomTableInputSplit> splits = new ArrayList<CustomTableInputSplit>(minNumSplits);
        for (int i = 0; i < keys.getFirst().length; i++) {
            final byte[] startKey = keys.getFirst()[i];
            final byte[] endKey = keys.getSecond()[i];
            final String regionLocation = table.getRegionLocator().getRegionLocation(startKey, false).getHostnamePort();
            // Test if the given region is to be included in the InputSplit while splitting the regions of a table
            if (!includeRegionInScan(startKey, endKey)) {
                continue;
            }
            // Find the region on which the given row is being served
            final String[] hosts = new String[]{regionLocation};

            // Determine if regions contains keys used by the scan
            boolean isLastRegion = endKey.length == 0;
            if ((scanWithNoLowerBound || isLastRegion || Bytes.compareTo(startRow, endKey) < 0) &&
                    (scanWithNoUpperBound || Bytes.compareTo(stopRow, startKey) > 0)) {

                final byte[] splitStart = scanWithNoLowerBound || Bytes.compareTo(startKey, startRow) >= 0 ? startKey : startRow;
                final byte[] splitStop = (scanWithNoUpperBound || Bytes.compareTo(endKey, stopRow) <= 0)
                        && !isLastRegion ? endKey : stopRow;
                int id = splits.size();
                final CustomTableInputSplit split = new CustomTableInputSplit(id, hosts, table.getName().getName(), splitStart, splitStop);
                splits.add(split);
            }
        }
        LOG.info("Created " + splits.size() + " splits");
        for (CustomTableInputSplit split : splits) {
            logSplitInfo("created", split);
        }
        return splits.toArray(new CustomTableInputSplit[splits.size()]);
    }

    /**
     * Test if the given region is to be included in the scan while splitting the regions of a table.
     *
     * @param startKey Start key of the region
     * @param endKey   End key of the region
     * @return true, if this region needs to be included as part of the input (default).
     */
    protected boolean includeRegionInScan(final byte[] startKey, final byte[] endKey) {
        return true;
    }

    @Override
    public InputSplitAssigner getInputSplitAssigner(CustomTableInputSplit[] inputSplits) {
        return new LocatableInputSplitAssigner(inputSplits);
    }

    @Override
    public BaseStatistics getStatistics(BaseStatistics cachedStatistics) {
        return null;
    }
}

TableInputFormat重写为CustomTableInputFormat：

package cn.swordfall.hbaseOnFlink.flink172_hbase212;

import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.configuration.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;

/**
 * @Author: Yang JianQiu
 * @Date: 2019/3/19 11:15
 * 由于flink-hbase_2.12_1.7.2 jar包所引用的是hbase1.4.3版本，而现在用到的是hbase2.1.2，版本不匹配
 * 故需要重写flink-hbase_2.12_1.7.2里面的TableInputFormat
 */
public abstract class CustomTableInputFormat<T extends Tuple> extends CustomAbstractTableInputFormat<T> {

    private static final long serialVersionUID = 1L;

    /**
     * Returns an instance of Scan that retrieves the required subset of records from the HBase table.
     * @return The appropriate instance of Scan for this usecase.
     */
    @Override
    protected abstract Scan getScanner();

    /**
     * What table is to be read.
     * Per instance of a TableInputFormat derivative only a single tablename is possible.
     * @return The name of the table
     */
    @Override
    protected abstract String getTableName();

    /**
     * The output from HBase is always an instance of {@link Result}.
     * This method is to copy the data in the Result instance into the required {@link Tuple}
     * @param r The Result instance from HBase that needs to be converted
     * @return The appropriate instance of {@link Tuple} that contains the needed information.
     */
    protected abstract T mapResultToTuple(Result r);

    /**
     * Creates a {@link Scan} object and opens the {@link HTable} connection.
     * These are opened here because they are needed in the createInputSplits
     * which is called before the openInputFormat method.
     * So the connection is opened in {@link #configure(Configuration)} and closed in {@link #closeInputFormat()}.
     *
     * @param parameters The configuration that is to be used
     * @see Configuration
     */
    @Override
    public void configure(Configuration parameters) {
        table = createTable();
        if (table != null) {
            scan = getScanner();
        }
    }

    /**
     * Create an {@link HTable} instance and set it into this format.
     */
    private HTable createTable() {
        LOG.info("Initializing HBaseConfiguration");
        //use files found in the classpath
        org.apache.hadoop.conf.Configuration hConf = HBaseConfiguration.create();

        try {
            return null;
        } catch (Exception e) {
            LOG.error("Error instantiating a new HTable instance", e);
        }
        return null;
    }

    @Override
    protected T mapResultToOutType(Result r) {
        return mapResultToTuple(r);
    }
}

继承自定义的CustomTableInputFormat，进行hbase连接、读取操作：

package cn.swordfall.hbaseOnFlink

import java.io.IOException

import cn.swordfall.hbaseOnFlink.flink172_hbase212.CustomTableInputFormat
import org.apache.flink.api.java.tuple.Tuple2
import org.apache.flink.addons.hbase.TableInputFormat
import org.apache.flink.configuration.Configuration
import org.apache.hadoop.hbase.{Cell, HBaseConfiguration, HConstants, TableName}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes

import scala.collection.JavaConverters._
/**
  * @Author: Yang JianQiu
  * @Date: 2019/3/1 1:14
  *
  * 从HBase读取数据
  * 第二种：实现TableInputFormat接口
  */
class HBaseInputFormat extends CustomTableInputFormat[Tuple2[String, String]]{

  // 结果Tuple
  val tuple2 = new Tuple2[String, String]

  /**
    * 建立HBase连接
    * @param parameters
    */
  override def configure(parameters: Configuration): Unit = {
    val tableName: TableName = TableName.valueOf("test")
    val cf1 = "cf1"
    var conn: Connection = null
    val config: org.apache.hadoop.conf.Configuration = HBaseConfiguration.create

    config.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
    config.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    config.setInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, 30000)
    config.setInt(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, 30000)

    try {
      conn = ConnectionFactory.createConnection(config)
      table = conn.getTable(tableName).asInstanceOf[HTable]
      scan = new Scan()
      scan.withStartRow(Bytes.toBytes("001"))
      scan.withStopRow(Bytes.toBytes("201"))
      scan.addFamily(Bytes.toBytes(cf1))
    } catch {
      case e: IOException =>
        e.printStackTrace()
    }
  }

  /**
    * 对获取的数据进行加工处理
    * @param result
    * @return
    */
  override def mapResultToTuple(result: Result): Tuple2[String, String] = {
    val rowKey = Bytes.toString(result.getRow)
    val sb = new StringBuffer()
    for (cell: Cell <- result.listCells().asScala){
      val value = Bytes.toString(cell.getValueArray, cell.getValueOffset, cell.getValueLength)
      sb.append(value).append("_")
    }
    val value = sb.replace(sb.length() - 1, sb.length(), "").toString
    tuple2.setField(rowKey, 0)
    tuple2.setField(value, 1)
    tuple2
  }

  /**
    * tableName
    * @return
    */
  override def getTableName: String = "test"

  /**
    * 获取Scan
    * @return
    */
  override def getScanner: Scan = {
    scan
  }

}

调用实现CustomTableInputFormat接口的类HBaseInputFormat，Flink Streaming流式处理的方式：

/**
 * 从HBase读取数据
 * 第二种：实现TableInputFormat接口
 */
 def readFromHBaseWithTableInputFormat(): Unit ={
   val env = StreamExecutionEnvironment.getExecutionEnvironment
   env.enableCheckpointing(5000)
   env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
   env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)

   val dataStream = env.createInput(new HBaseInputFormat)
   dataStream.filter(_.f0.startsWith("10")).print()
   env.execute()
 }

而Flink DataSet批处理的方式为：

/**
 * 读取HBase数据方式：实现TableInputFormat接口
 */
 def readFromHBaseWithTableInputFormat(): Unit ={
   val env = ExecutionEnvironment.getExecutionEnvironment

   val dataStream = env.createInput(new HBaseInputFormat)
   dataStream.filter(_.f1.startsWith("20")).print()
 }

2.2 Flink写入HBase的两种方式

这里Flink Streaming写入HBase，需要从Kafka接收数据，可以开启kafka单机版，利用kafka-console-producer.sh往topic "test"写入如下数据：

100,hello,20
101,nice,24
102,beautiful,26

2.2.1 继承RichSinkFunction重写父类方法：

package cn.swordfall.hbaseOnFlink

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.hadoop.hbase.{HBaseConfiguration, HConstants, TableName}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes

/**
  * @Author: Yang JianQiu
  * @Date: 2019/3/1 1:34
  *
  * 写入HBase
  * 第一种：继承RichSinkFunction重写父类方法
  *
  * 注意：由于flink是一条一条的处理数据，所以我们在插入hbase的时候不能来一条flush下，
  * 不然会给hbase造成很大的压力，而且会产生很多线程导致集群崩溃，所以线上任务必须控制flush的频率。
  *
  * 解决方案：我们可以在open方法中定义一个变量，然后在写入hbase时比如500条flush一次，或者加入一个list，判断list的大小满足某个阀值flush一下
  */
class HBaseWriter extends RichSinkFunction[String]{

  var conn: Connection = null
  val scan: Scan = null
  var mutator: BufferedMutator = null
  var count = 0

  /**
    * 建立HBase连接
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
    val config:org.apache.hadoop.conf.Configuration = HBaseConfiguration.create
    config.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
    config.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    config.setInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, 30000)
    config.setInt(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, 30000)
    conn = ConnectionFactory.createConnection(config)

    val tableName: TableName = TableName.valueOf("test")
    val params: BufferedMutatorParams = new BufferedMutatorParams(tableName)
    //设置缓存1m，当达到1m时数据会自动刷到hbase
    params.writeBufferSize(1024 * 1024) //设置缓存的大小
    mutator = conn.getBufferedMutator(params)
    count = 0
  }

  /**
    * 处理获取的hbase数据
    * @param value
    * @param context
    */
  override def invoke(value: String, context: SinkFunction.Context[_]): Unit = {
    val cf1 = "cf1"
    val array: Array[String] = value.split(",")
    val put: Put = new Put(Bytes.toBytes(array(0)))
    put.addColumn(Bytes.toBytes(cf1), Bytes.toBytes("name"), Bytes.toBytes(array(1)))
    put.addColumn(Bytes.toBytes(cf1), Bytes.toBytes("age"), Bytes.toBytes(array(2)))
    mutator.mutate(put)
    //每满2000条刷新一下数据
    if (count >= 2000){
      mutator.flush()
      count = 0
    }
    count = count + 1
  }

  /**
    * 关闭
    */
  override def close(): Unit = {
    if (conn != null) conn.close()
  }
}

调用继承RichSinkFunction的HBaseWriter类，Flink Streaming流式处理的方式：

/**
  * 写入HBase
  * 第一种：继承RichSinkFunction重写父类方法
  */
 def write2HBaseWithRichSinkFunction(): Unit = {
   val topic = "test"
   val props = new Properties
   props.put("bootstrap.servers", "192.168.187.201:9092")
   props.put("group.id", "kv_flink")
   props.put("enable.auto.commit", "true")
   props.put("auto.commit.interval.ms", "1000")
   props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   env.enableCheckpointing(5000)
   env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
   env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
   val myConsumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, props)
   val dataStream: DataStream[String] = env.addSource(myConsumer)
   //写入HBase
   dataStream.addSink(new HBaseWriter)
   env.execute()
 }

2.2.2 实现OutputFormat接口：

package cn.swordfall.hbaseOnFlink

import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.configuration.Configuration
import org.apache.hadoop.hbase.{HBaseConfiguration, HConstants, TableName}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes

/**
  * @Author: Yang JianQiu
  * @Date: 2019/3/1 1:40
  *
  * 写入HBase提供两种方式
  * 第二种：实现OutputFormat接口
  */
class HBaseOutputFormat extends OutputFormat[String]{

  val zkServer = "192.168.187.201"
  val port = "2181"
  var conn: Connection = null
  var mutator: BufferedMutator = null
  var count = 0

  /**
    * 配置输出格式。此方法总是在实例化输出格式上首先调用的
    *
    * @param configuration
    */
  override def configure(configuration: Configuration): Unit = {

  }

  /**
    * 用于打开输出格式的并行实例，所以在open方法中我们会进行hbase的连接，配置，建表等操作。
    *
    * @param i
    * @param i1
    */
  override def open(i: Int, i1: Int): Unit = {
    val config: org.apache.hadoop.conf.Configuration = HBaseConfiguration.create
    config.set(HConstants.ZOOKEEPER_QUORUM, zkServer)
    config.set(HConstants.ZOOKEEPER_CLIENT_PORT, port)
    config.setInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, 30000)
    config.setInt(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, 30000)
    conn = ConnectionFactory.createConnection(config)

    val tableName: TableName = TableName.valueOf("test")

    val params: BufferedMutatorParams = new BufferedMutatorParams(tableName)
    //设置缓存1m，当达到1m时数据会自动刷到hbase
    params.writeBufferSize(1024 * 1024) //设置缓存的大小
    mutator = conn.getBufferedMutator(params)
    count = 0
  }

  /**
    * 用于将数据写入数据源，所以我们会在这个方法中调用写入hbase的API
    *
    * @param it
    */
  override def writeRecord(it: String): Unit = {

    val cf1 = "cf1"
    val array: Array[String] = it.split(",")
    val put: Put = new Put(Bytes.toBytes(array(0)))
    put.addColumn(Bytes.toBytes(cf1), Bytes.toBytes("name"), Bytes.toBytes(array(1)))
    put.addColumn(Bytes.toBytes(cf1), Bytes.toBytes("age"), Bytes.toBytes(array(2)))
    mutator.mutate(put)
    //每4条刷新一下数据，如果是批处理调用outputFormat，这里填写的4必须不能大于批处理的记录总数量，否则数据不会更新到hbase里面
    if (count >= 4){
      mutator.flush()
      count = 0
    }
    count = count + 1
  }

  /**
    * 关闭
    */
  override def close(): Unit = {
    try {
      if (conn != null) conn.close()
    } catch {
      case e: Exception => println(e.getMessage)
    }
  }
}

调用实现OutputFormat的HBaseOutputFormat类，Flink Streaming流式处理的方式：

/**
  * 写入HBase
  * 第二种：实现OutputFormat接口
  */
 def write2HBaseWithOutputFormat(): Unit = {
   val topic = "test"
   val props = new Properties
   props.put("bootstrap.servers", "192.168.187.201:9092")
   props.put("group.id", "kv_flink")
   props.put("enable.auto.commit", "true")
   props.put("auto.commit.interval.ms", "1000")
   props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   env.enableCheckpointing(5000)
   env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
   env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
   val myConsumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, props)
   val dataStream: DataStream[String] = env.addSource(myConsumer)
   dataStream.writeUsingOutputFormat(new HBaseOutputFormat)
   env.execute()
 }

而Flink DataSet批处理的方式为：

/**
  * 写入HBase方式：实现OutputFormat接口
  */
 def write2HBaseWithOutputFormat(): Unit = {
   val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

   //2.定义数据
   val dataSet: DataSet[String] = env.fromElements("103,zhangsan,20", "104,lisi,21", "105,wangwu,22", "106,zhaolilu,23")
   dataSet.output(new HBaseOutputFormat)
   //运行下面这句话，程序才会真正执行，这句代码针对的是data sinks写入数据的
   env.execute()
 }

注意：

　　如果是批处理调用的，应该要注意HBaseOutputFormat类的writeRecord方法每次批量刷新的数据量不能大于批处理的总记录数据量，否则数据更新不到hbase里面。

原文地址：https://www.cnblogs.com/pengblog2020/p/12186583.html

时间： 2024-10-28 09:55:49

flink与hbase交互

1. HBase连接的方式概况

2. Flink Streaming和Flink DataSet读写HBase

2.1 Flink读取HBase的两种方式

2.1.1 继承RichSourceFunction重写父类方法：

2.1.2 实现自定义的TableInputFormat接口：

2.2 Flink写入HBase的两种方式

2.2.1 继承RichSinkFunction重写父类方法：

2.2.2 实现OutputFormat接口：

flink与hbase交互的相关文章

hadoop浅尝 hadoop与hbase交互

HBase学习（十二）Java API 与HBase交互实例

通过Java Api与HBase交互(转)

Java Api与HBase交互实例

通过Java Api与HBase交互

Flink 从0到1学习 —— Flink 中如何管理配置？

Alex 的 Hadoop 菜鸟教程: 第8课 Hbase 的 java调用方法

HBase Shell操作

HBase性能优化 Java Api