Your Prediction Gets As Good As Your Data

Your Prediction Gets As Good As Your Data

May 5, 2015 by Kazem

In the past, we have seen software engineers and data scientists assume that they can keep increasing their prediction accuracy by improving their machine learning algorithm. Here, we want to approach the classification problem from a different angle where we recommend data scientists should analyze the distribution of their data first to measure information level in data. This approach can givesus an upper bound for how far one can improve the accuracy of a predictive algorithm and make sure our optimization efforts are not wasted!

Entropy and Information

In information theory, mathematician have developed a few useful techniques such as entropy to measure information level in data in process. Let‘s think of a random coin with a head probability of 1%.

If one filps such a coin, we will get more information when we see the head event since it‘s a rare event compared to tail which is more likely to happen. We can formualte the amount of information in a random variable with the negative logarithm of the event probability. This captures the described intuition. Mathmatician also formulated another measure called entropy by which they capture the average information in a random process in bits. Below we have shown the entropy formula for a discrete random variable:


      
      

For the first example, let‘s assume we have a coin with P(H)=0% and P(T)=100%. We can compute the entropy of the coin as follows:



      

For the second example, let‘s consider a coin where P(H)=1% and P(T)=1-P(H)=99%. Plugging numbers one can find that the entropy of such a coin is:


      
      

Finally, if the coin has P(H) = P(T) = 0.5 (i.e. a fair coin), its entropy is calculated as follows:


      
      

Entropy and Predictability

So, what these examples tell us? If we have a coin with head probability of zero, the coin‘s entropy is zero meaning that the average information in the coin is zero. This makes sense because flipping such a coin always comes as tail. Thus, the prediction accuracy is 100%. In other words, when the entropy is zero, we have the maximum predictibility.

In the second example, head probability is not zero but still very close to zero which again makes the coin to be very predictable with a low entropy.

Finally, in the last example we have 50/50 chance of seeing head/tail events which maximizes the entropy and consequently minimizes the predictability. In words, one can show that a fair coin has the meaximum entropy of 1 bit making the prediction as good as a random guess.

Kullback–Leibler divergence

As last example, it‘s important to give another example of how we can borrow ideas from information theory to measure the distance between two probability distributions. Let‘s assume we are modeling two random processes by their pmf‘s: P(.) and Q(.). One can use entropy measure to compute the distance between two pmf‘s as follows:


      
      

Above distance function is known as KL divergence which measures the distance of Q‘s pmf from P‘s pmf. The KL divergence can come handy in various problems such as NLP problems where we‘d like to measure the distance between two sets of data (e.g. bag of words).

Wrap-up

In this post, we showed that the entropy from information theory provides a way to measure how much information exists in our data. We also highlighted the inverse relationship between the entropy and the predictability. This shows that we can use the entropy measure to calculate an upper bound for the accuracy of the prediction problem in hand.

Feel free to share with us if you have any comments or questions in the comment section below.

You can also reach us at [email protected]

时间: 2024-08-10 16:41:23

Your Prediction Gets As Good As Your Data的相关文章

Lessons Learned from Developing a Data Product

Lessons Learned from Developing a Data Product For an assignment I was asked to develop a visual ‘data product’ that informed decisions on video game ratings taking as an indicator their ranking on the MetaCritic site. I decided to use RStudio’s Shin

IJCAI_论文-深度学习-Deep Learning for Event-Driven Stock Prediction

Deep Learning for Event-Driven Stock Prediction Reading time:2019/3/30-2019/4/12  Theme:Deep learning; CNN; NLP Abstract: We propose a deep learning method for eventdriven stock market prediction. First, events are extracted from news text, and repre

提升大数据数据分析性能的方法及技术(二)

上部分链接 致谢:因为我的文章之前是在word中写的,贴过来很多用mathtype编辑的公式无法贴过来,之前也没有经验. 参考http://www.cnblogs.com/haore147/p/3629895.html,一文完成公式的迁移. 同时,说一句,word中用mathtype写的公式用ALT+\可以转换成对应的latex语法公式. 5 数据流过滤技术 信息大爆炸时代的到来使得针对数据进行深层次的挖掘成为数据处理的核心任务[21].但是在上面已经提到了,源数据的来源和数据的组成格式都是各种

libsvm的使用(exe版和matlab版本)

libsvm是台湾国立大学的林智仁开发的svm工具箱,有matlab,C++,python,java的接口.本文对exe版本和matlab版本的使用进行说明. libsvm可以在 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 官网上进行下载,我下载的是libsvm-3.12 exe 版本 libsvm的exe版本在windows文件夹下,主要包含3个exe, svm-train.exe,  svm-scale.exe,  svm-predict.exe (1

A Brief Review of Supervised Learning

There are a number of algorithms that are typically used for system identification, adaptive control, adaptive signal processing, and machine learning. These algorithms all have particular similarities and differences. However, they all need to proce

学习笔记TF020:序列标注、手写小写字母OCR数据集、双向RNN

序列标注(sequence labelling),输入序列每一帧预测一个类别.OCR(Optical Character Recognition 光学字符识别). MIT口语系统研究组Rob Kassel收集,斯坦福大学人工智能实验室Ben Taskar预处理OCR数据集(http://ai.stanford.edu/~btaskar/ocr/ ),包含大量单独手写小写字母,每个样本对应16X8像素二值图像.字线组合序列,序列对应单词.6800个,长度不超过14字母的单词.gzip压缩,内容用T

OpenCV3 SVM ANN Adaboost KNN 随机森林等机器学习方法对OCR分类

转摘自http://www.cnblogs.com/denny402/p/5032839.html opencv3中的ml类与opencv2中发生了变化,下面列举opencv3的机器学习类方法实例: 用途是opencv自带的ocr样本的分类功能,其中神经网络和adaboost训练速度很慢,效果还是knn的最好: 1 #include <opencv2/opencv.hpp> 2 #include <iostream> 3 using namespace std; 4 using n

监控服务器Nagios之二 配置参数详解

    本篇主要介绍下安装完Nagios服务端后的配置目录及文件参数的解释,可供大家备忘! 首先解释下安装Nagios的时候make的含义如下: make all  //安装所有主程序 make install //来安装主程序,CGI和HTML文件 make install-init //在/etc/rc.d/init.d安装启动脚本 make install-config //来安装示例配置文件,安装的路径是/usr/local/nagios/etc make install-command

Understanding the Bias-Variance Tradeoff

Understanding the Bias-Variance Tradeoff When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to "variance". There is a tradeoff between a m