Data Compression Category

Data Compression is an approach to compress the origin dataset and save spaces. According to the Economist reports, the amount of digital dat in the world is growing explosively, which increase from 1.2 zettabytes to 1.8 zettabytes in 2010 and 2011. So how to compress data and manage storage cost-effectively is a challenging and important task.

Traditionally, we use compression algorithms to achieve data reduction. The main idea of data compression is "use the fewest number of bits to represent an information as accurately as possible". What we want to do is to represent the origin data information as accurately as possible, so it allows us to ignore some useless information when converting the encoded data to represented data. We can classify the classical compression approach into lossless compression and lossy compression. The difference between them is the loss of unnecessary information.

For lossless compression, it reduces data by identifying and eliminating statistical redundancy in reversible fashion. For removing redundant information. It can use statistical properties to build a new encoding system, like Huffman coding. Or it can use dictionary model, replacing the repeated strings with slide window algorithm. What a matter is that for a lossless compression, when we restore the data, we can get the origin data without losing any information.

For lossy compression, it reduces data by identifying unnecessary information and irretrievably removing it. For the removing unnecessary information, unnecessary information indeed has its own information, which may not be useful in some particular field. So it means lossy compression. In some filed, we just need useful information, and ignore useless information, so lossy compression methods works in Image, Audio, and Video. So we can‘t get the origin data when we use lossy compression algorithm.

For a lossless approach, when data become larger, eliminating statistical redundancy is unacceptable. Lossless approach needs data statistic information, counting all information. So for a large dataset, it must tradeoff between speed and compression ratio.

There are two methods to compress data, delta compression and data deduplication.

Delta compression is a new perspective to compress two very similar files. It compares two files, A and B, and calculates the delta A-B, so file B can be expressed as file A + delta A-B, which can save space. Delta compression is generally used in source code version, synchronization.

Data deduplication target large-scale system, which has a big granularity (file level or 8K kb size chunk level) the reason why using chunk-level instead of file level in data deduplication is chunk-level can achieve better compression performance. In general, data deduplication splits the back-up data into chunks, and identifies a chunk by its own cryptographically secure hash (SHA-1) signature. For some same chunks, it will remove the duplicate data chunks and store only one copy of that to achieve the goal (saving the space). It will only store the unique chunk, and file metadata, which can be used to reconstruct the origin file.

原文地址:https://www.cnblogs.com/wAther/p/11741973.html

时间: 2024-08-01 08:53:12

Data Compression Category的相关文章

dimensionality reduction动机---data compression

data compression可以使数据占用更少的空间,并且能使算法提速 什么是dimensionality reduction(维数约简)    例1:比如说我们有一些数据,它有很多很多的features,取其中的两个features,如上图所示,一个为物体的长度用cm来度量的,一个也是物体的长度是用inches来度量的,显然这两上features是相关的,画到上图中,近似于一条直线,之所以点不在一条直线上,是因为我们在对物体测量长度是会取整(对cm进行取整,对inches进行取整),这样的

Data Compression(1)

Supported ü  SQL SERVER 2008,2012 Enterprise, Developer Edition Notice :Backup compression is different of Data Compression. Backup compression was introduced in SQL Server 2008 Enterprise. Beginning in SQL Server 2008 R2, backup compression is suppo

Intent中的四个重要属性——Action、Data、Category、Extras

Intent作为联系各Activity之间的纽带,其作用并不仅仅只限于简单的数据传递.通过其自带的属性,其实可以方便的完成很多较为复杂的操作.例如直接调用拨号功能.直接自动调用合适的程序打开不同类型的文件等等.诸如此类,都可以通过设置Intent属性来完成. Intent主要有以下四个重要属性,它们分别为: Action:Action属性的值为一个字符串,它代表了系统中已经定义了一系列常用的动作.通过setAction()方法或在清单文件AndroidManifest.xml中设置.默认为:DE

<转>四个重要属性——Action、Data、Category、Extras

Intent作为联系各Activity之间的纽带,其作用并不仅仅只限于简单的数据传递.通过其自带的属性,其实可以方便的完成很多较为复杂的操作.例如直接调用拨号功能.直接自动调用合适的程序打开不同类型的文件等等.诸如此类,都可以通过设置Intent属性来完成. Intent主要有以下四个重要属性,它们分别为: Action:Action属性的值为一个字符串,它代表了系统中已经定义了一系列常用的动作.通过setAction()方法或在清单文件AndroidManifest.xml中设置.默认为:DE

Sql Server Data compression 预估和选择,以及查看成功压缩的数据页

Sql Server提供两种数据压缩的方式:row压缩和page压缩.两种压缩的内部原理暂且不论,只要知道压缩率越高,节省的disk space 更多即可.sql server 提供多种工具,供DBA查看压缩的效率. 1,查看表的压缩类型 在sys.partitions中的两个字段data_compression 和data_compression_desc ,Indicates the state of compression for each partition. 使用sys.allocat

SQL SERVER ->> Data Compression

最近做了一个关于数据压缩的项目,要把整个SQL SERVER服务器下所有的表对象要改成页压缩.于是趁此机会了解了一下SQL SERVER下压缩技术. 这篇文章几乎就是完全指导手册了 https://technet.microsoft.com/en-us/library/dd894051(v=sql.100).aspx 当然这里还有技术wiki page https://msdn.microsoft.com/en-us/library/cc280449.aspx 那看了这么多,这里总结一下: 1)

【转】The most comprehensive Data Science learning plan for 2017

I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had been following the blog for some time and liked the community, but did not know what to expect as an intern. The initial few days were good – all the in

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial (I - III)

ABSTRACT Recent technological advancement have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The

Data Types

原地址: Home / Database / Oracle Database Online Documentation 11g Release 2 (11.2) / Database Administration Data Types Each value manipulated by Oracle Database has a data type. The data type of a value associates a fixed set of properties with the va