当你每天打开电脑,在百度搜索框中输入你要搜索的内容,按下回车之后,你可能不会意识到,有无数台主机在飞速运转,对比了数百万条记录,经过初步结果集生成、相关度打分、结果排序、摘要生成之后,才最终在你的屏幕上打出了你想要的结果。这一切仅仅发生在几毫秒之间。
是什么保证了如此迅速的检索速度呢?良好的索引构建是其中的要素之一。通常情况下,搜索引擎内部会为每个网页或文章分配一个数字id,用这个id代表这个网页或者文章。构建索引的时候,采用分词工具将这些网页或者文章分成一个个词,并网页id存储在称为倒排索引表的数据结构中。
由于网络空间巨大,对应的倒排索引表所占的空间也很大,对倒排索引表进行压缩显得非常必要。由于倒排索引表中存储的全部都是数字,对其进行压缩有着专门的方法,Gamma编码就是其中的一种。Gamma编码是一种基于位的变长编码,介绍它之前先说一下一元编码。
一元编码:将 n 表示成 n 个1和最后一个0,
比如: 3的一元码是 1110
40的一元码是 11111111111111111111111111111111111111110
- Gamma将数G字表示成长度(length)和偏移(offset)两部分;
- offset部分对应G的二进制编码,只不过将首部的1去掉。
例如 13 → 1101 → 101 = 偏移;
- length部分采用一元编码,表示偏移部分的位数。
例如G=13(偏移101),偏移长度为3,一元编码1110
- G的?编码就是将长度部分和偏移部分两者联接起来得到的结果。
以下代码实现了对倒排索引表进行Gamma编码:
@java
import java.util.Arrays;
import java.util.Iterator;
public class GammaBit {
public static void main(String[] args) {
int[] testNumbers = {1, 2, 3, 4, 9, 13, 24, 511, 1025};
GammaBit gammaBitBuf = new GammaBit();
for(int i = 0; i < testNumbers.length; i++){
int n = testNumbers[i];
GammaBit gammaBit = GammaBit.getGammaBit(n);
gammaBitBuf.append(gammaBit);
}
System.out.println("Gamma Code:" + gammaBitBuf);
Iterator<Integer> iterator = gammaBitBuf.iterator();
System.out.print("original numbers are:");
while(iterator.hasNext())
System.out.print(" " + iterator.next());
System.out.println();
}
private byte[] bytes;
private int bitLength; // length of bits used
private byte[] bitMap = {
0x1, 0x2, 0x4, 0x8,
0x10, 0x20, 0x40, (byte)0x80
};
public GammaBit(byte[] bytes, int bitLength){
int newSize = 1;
while(newSize <= bytes.length)
newSize <<= 1;
this.bytes = Arrays.copyOf(bytes, newSize);
this.bitLength = bitLength;
}
public GammaBit(){
bytes = new byte[8];
Arrays.fill(bytes, (byte)0);
bitLength = 0;
}
public int getBitLength(){
return bitLength;
}
public int getInnerBytesLength(){
return bytes.length;
}
public void append(GammaBit gammaBit){
int lengthAll = gammaBit.bitLength + this.bitLength;
if(lengthAll < 0)
throw new RuntimeException("bitLength is too large!");
int bytesNeeded = (lengthAll + 7) >>> 3;
if(bytesNeeded >= Math.pow(2, 31))
throw new RuntimeException("bitLength is too large!");
if(bytesNeeded > this.bytes.length){
int newSize = 1;
while(newSize <= bytesNeeded)
newSize <<= 1;
// System.out.println("resize from " + this.bytes.length + " to " + newSize);
this.bytes = Arrays.copyOf(this.bytes, newSize);
}
for(int i = 0; i < gammaBit.bitLength; i++){
boolean value = gammaBit.getABit(i);
this.setABit(i + this.bitLength, value);
}
this.bitLength += gammaBit.bitLength;
}
public static GammaBit getGammaBit(int n){
if(n <= 0)
throw new RuntimeException("number is not bigger than zero: " + n);
GammaBit gammaBit = new GammaBit();
int lenOffset = 0;
while((n >= (1 << ++lenOffset)));
lenOffset -= 1;
int bitBegin = gammaBit.bitLength;
int bitEnd = bitBegin + lenOffset;
for(int i = bitBegin; i < bitEnd; i++){
gammaBit.setABit(i, true);
}
gammaBit.setABit(bitEnd, false);
gammaBit.bitLength += lenOffset + 1;
String binaryStr = Integer.toBinaryString(n);
for(int i = 1; i < lenOffset + 1; i++){
boolean value = (binaryStr.charAt(i) == ‘1‘);
gammaBit.setABit(i + gammaBit.bitLength - 1, value);
}
gammaBit.bitLength += lenOffset;
return gammaBit;
}
private void setABit(int p, boolean isValueOne){
if(isValueOne)
this.bytes[p / 8] |= bitMap[p % 8];
}
private boolean getABit(int p){
return (this.bytes[p / 8] & bitMap[p % 8]) != 0;
}
public String toString(){
StringBuilder builder = new StringBuilder();
for(int i = 0; i < this.bitLength; i++){
if(getABit(i))
builder.append(‘1‘);
else
builder.append(‘0‘);
}
return builder.toString();
}
public Iterator<Integer> iterator(){
return new GammaBitIterator();
}
/**
* @return the bytes used by bits
*/
public byte[] getBytes(){
int bytesUsed = (bitLength + 7) >>> 3;
return Arrays.copyOf(bytes, bytesUsed);
}
private class GammaBitIterator implements Iterator<Integer>{
private int nowBitPosition = 0;
public boolean hasNext(){
return nowBitPosition < GammaBit.this.bitLength;
}
public Integer next(){
int i = 0;
for(; i + nowBitPosition < GammaBit.this.bitLength
&& getABit(i + nowBitPosition); i++);
nowBitPosition += i + 1;
int length = i;
int offsetValue = 1;
for(i = 0; i < length; i++){
offsetValue <<= 1;
if(getABit(nowBitPosition + i))
offsetValue += 1;
}
nowBitPosition += length;
return offsetValue;
}
}
}
时间: 2024-10-25 23:03:26