搜索引擎倒排索引表压缩：gamma编码

当你每天打开电脑，在百度搜索框中输入你要搜索的内容，按下回车之后，你可能不会意识到，有无数台主机在飞速运转，对比了数百万条记录，经过初步结果集生成、相关度打分、结果排序、摘要生成之后，才最终在你的屏幕上打出了你想要的结果。这一切仅仅发生在几毫秒之间。

是什么保证了如此迅速的检索速度呢？良好的索引构建是其中的要素之一。通常情况下，搜索引擎内部会为每个网页或文章分配一个数字id，用这个id代表这个网页或者文章。构建索引的时候，采用分词工具将这些网页或者文章分成一个个词，并网页id存储在称为倒排索引表的数据结构中。

由于网络空间巨大，对应的倒排索引表所占的空间也很大，对倒排索引表进行压缩显得非常必要。由于倒排索引表中存储的全部都是数字，对其进行压缩有着专门的方法，Gamma编码就是其中的一种。Gamma编码是一种基于位的变长编码，介绍它之前先说一下一元编码。

一元编码：将 n 表示成 n 个1和最后一个0，

比如： 3的一元码是 1110

40的一元码是 11111111111111111111111111111111111111110

Gamma将数G字表示成长度（length）和偏移（offset）两部分；

offset部分对应G的二进制编码，只不过将首部的1去掉。
例如 13 → 1101 → 101 = 偏移；

length部分采用一元编码，表示偏移部分的位数。
例如G=13(偏移101)，偏移长度为3，一元编码1110

G的?编码就是将长度部分和偏移部分两者联接起来得到的结果。

以下代码实现了对倒排索引表进行Gamma编码：

@java
import java.util.Arrays;
import java.util.Iterator;

public class GammaBit {
    public static void main(String[] args) {
        int[] testNumbers = {1, 2, 3, 4, 9, 13, 24, 511, 1025};
        GammaBit gammaBitBuf = new GammaBit();
        for(int i = 0; i < testNumbers.length; i++){
            int n = testNumbers[i];
            GammaBit gammaBit = GammaBit.getGammaBit(n);
            gammaBitBuf.append(gammaBit);
        }
        System.out.println("Gamma Code:" + gammaBitBuf);
        Iterator<Integer> iterator = gammaBitBuf.iterator();
        System.out.print("original numbers are:");
        while(iterator.hasNext())
            System.out.print(" " + iterator.next());
        System.out.println();
    }
    private byte[] bytes;
    private int bitLength; // length of bits used
    private byte[] bitMap = {
        0x1, 0x2, 0x4, 0x8,
        0x10, 0x20, 0x40, (byte)0x80
    };
    public GammaBit(byte[] bytes, int bitLength){
        int newSize = 1;
        while(newSize <= bytes.length)
            newSize <<= 1;
        this.bytes = Arrays.copyOf(bytes, newSize);
        this.bitLength = bitLength;
    }
    public GammaBit(){
        bytes = new byte[8];
        Arrays.fill(bytes, (byte)0);
        bitLength = 0;
    }

    public int getBitLength(){
        return bitLength;
    }
    public int getInnerBytesLength(){
        return bytes.length;
    }
    public void append(GammaBit gammaBit){
        int lengthAll = gammaBit.bitLength + this.bitLength;
        if(lengthAll < 0)
            throw new RuntimeException("bitLength is too large!");
        int bytesNeeded = (lengthAll + 7) >>> 3;
        if(bytesNeeded >= Math.pow(2, 31))
            throw new RuntimeException("bitLength is too large!");
        if(bytesNeeded > this.bytes.length){
            int newSize = 1;
            while(newSize <= bytesNeeded)
                newSize <<= 1;
//          System.out.println("resize from " + this.bytes.length + " to " + newSize);
            this.bytes = Arrays.copyOf(this.bytes, newSize);
        }
        for(int i = 0; i < gammaBit.bitLength; i++){
            boolean value = gammaBit.getABit(i);
            this.setABit(i + this.bitLength, value);
        }
        this.bitLength += gammaBit.bitLength;
    }
    public static GammaBit getGammaBit(int n){
        if(n <= 0)
            throw new RuntimeException("number is not bigger than zero: " + n);
        GammaBit gammaBit = new GammaBit();
        int lenOffset = 0;
        while((n >= (1 << ++lenOffset)));
        lenOffset -= 1;
        int bitBegin = gammaBit.bitLength;
        int bitEnd = bitBegin + lenOffset;
        for(int i = bitBegin; i < bitEnd; i++){
            gammaBit.setABit(i, true);
        }
        gammaBit.setABit(bitEnd, false);
        gammaBit.bitLength += lenOffset + 1;
        String binaryStr = Integer.toBinaryString(n);
        for(int i = 1; i < lenOffset + 1; i++){
            boolean value = (binaryStr.charAt(i) == ‘1‘);
            gammaBit.setABit(i + gammaBit.bitLength - 1, value);
        }
        gammaBit.bitLength += lenOffset;
        return gammaBit;
    }
    private void setABit(int p, boolean isValueOne){
        if(isValueOne)
            this.bytes[p / 8] |= bitMap[p % 8];
    }
    private boolean getABit(int p){
        return (this.bytes[p / 8] & bitMap[p % 8]) != 0;
    }
    public String toString(){
        StringBuilder builder = new StringBuilder();
        for(int i = 0; i < this.bitLength; i++){
            if(getABit(i))
                builder.append(‘1‘);
            else
                builder.append(‘0‘);
        }
        return builder.toString();
    }
    public Iterator<Integer> iterator(){
        return new GammaBitIterator();
    }
    /**
     * @return the bytes used by bits
     */
    public byte[] getBytes(){
        int bytesUsed = (bitLength + 7) >>> 3;
        return Arrays.copyOf(bytes, bytesUsed);
    }
    private class GammaBitIterator implements Iterator<Integer>{
        private int nowBitPosition = 0;
        public boolean hasNext(){
            return nowBitPosition < GammaBit.this.bitLength;
        }
        public Integer next(){
            int i = 0;
            for(; i + nowBitPosition < GammaBit.this.bitLength
                    && getABit(i + nowBitPosition); i++);
            nowBitPosition += i + 1;
            int length = i;
            int offsetValue = 1;
            for(i = 0; i < length; i++){
                offsetValue <<= 1;
                if(getABit(nowBitPosition + i))
                    offsetValue += 1;
            }
            nowBitPosition += length;
            return offsetValue;
        }
    }
}

时间： 2024-10-25 23:03:26

搜索引擎倒排索引表压缩：gamma编码

搜索引擎倒排索引表压缩：gamma编码的相关文章

优化系列 | InnoDB引擎数据表压缩特性测试

oracle 11G表压缩

【SQL Server性能优化】SQL Server 2008该表压缩

SQL Server 表压缩

mysql修改数据库编码(数据库字符集)和表的字符编码的方法

Oracle 表压缩(Table Compression)技术介绍

【SQL Server性能优化】SQL Server 2008之表压缩

Oracle压缩功能小结2—预估表压缩的效果

MySQL数据库、表的字符编码