- 数据集admissions.csv包含了1000个申请者的信息,特征如下:
gre - Graduate Record Exam(研究生入学考试), a generalized test for prospective graduate students(一个通用的测试未来的研究生), continuous between 200 and 800.
gpa - Cumulative grade point average(累积平均绩点), continuous between 0.0 and 4.0.
admit - Binary variable, 0 or 1, where 1 means the applicant was admitted to the program.
Use Linear Regression To Predict Admission
- 这是原本的数据,admit的值是0或者1
import pandas
import matplotlib.pyplot as plt
admissions = pandas.read_csv("admissions.csv")
plt.scatter(admissions["gpa"], admissions["admit"])
- 这是通过线性回归模型预测的admit的值,发现取值范围较大,甚至有负值,不是我们想要的。
# The admissions DataFrame is in memory
# Import linear regression class
from sklearn.linear_model import LinearRegression
# Initialize a linear regression model
model = LinearRegression()
# Fit model[[‘gre‘, ‘gpa‘]], admissions["admit"])
# Prediction of admission
admit_prediction = model.predict(admissions[[‘gre‘, ‘gpa‘]])
# Plot Estimated Function
plt.scatter(admissions["gpa"], admit_prediction)
- 因此我们期望构造一个模型,能够给我们一个接受(admission)的概率,并且这个概率取值在[0~1],然后我们根据银行信用卡批准——模型评估ROC&AUC这篇文章的方法来选择合适的阈值进行分类。
The Logit Function
- logit function是逻辑回归的基础,这个函数的形式如下:
- 观察一下logit function的样子:
# Logistic Function
def logit(x):
# np.exp(x) raises x to the exponential power, ie e^x. e ~= 2.71828
return np.exp(x) / (1 + np.exp(x))
# Linspace is as numpy function to produced evenly spaced numbers over a specified interval.
# Create an array with 50 values between -6 and 6 as t
t = np.linspace(-6,6,50, dtype=float)
# Get logistic fits
ylogit = logit(t)
# plot the logistic function
plt.plot(t, ylogit, label="logistic")
plt.title("Logistic Function")
a = logit(-10)
b = logit(10)
The Logistic Regression
- 逻辑回归就是将线性回归的输出当做Logit Function的输入然后产生一个输出当做最终的概率。其中β0是截距,其他的βi是斜率,也是特征的系数。
- 与线性模型一样,我们想要找到最优的βi的值使得预测值与真实值之间的误差最小。通常用来最小化误差的方法是最大似然法和梯度下降法。
Model Data
- 下面进行逻辑回归实验,每次进行训练测试集划分之前,需要将样本数据进行洗牌,这样抽样具有随机性。看到最后的gre和预测值的关系发现,当gre越大时,被接受的概率越大,这是符合实际情况的。
from sklearn.linear_model import LogisticRegression
# Randomly shuffle our data for the training and test set
admissions = admissions.loc[np.random.permutation(admissions.index)]
# train with 700 and test with the following 300, split dataset
num_train = 700
data_train = admissions[:num_train]
data_test = admissions[num_train:]
# Fit Logistic regression to admit with gpa and gre as features using the training set
logistic_model = LogisticRegression()[[‘gpa‘, ‘gre‘]], data_train[‘admit‘])
# Print the Models Coefficients
[[ 0.38004023 0.00791207]]
# Predict the chance of admission from those in the training set
fitted_vals = logistic_model.predict_proba(data_train[[‘gpa‘, ‘gre‘]])[:,1]
fitted_test = logistic_model.predict_proba(data_test[[‘gpa‘, ‘gre‘]])[:,1]
plt.scatter(data_test["gre"], fitted_test)
Predictive Power
- 这里有个用法需要提一下,accuracy_train = (predicted == data_train[‘admit’]).mean()中predicted == data_train[‘admit’]得到是一个布尔型array,在计算mean()时,会将True记作1,False记作0,然后求均值。但是在list中是不行的,list对象的布尔型数据没有mean()这个函数。
# .predict() using a threshold of 0.50 by default
predicted = logistic_model.predict(data_train[[‘gpa‘,‘gre‘]])
# The average of the binary array will give us the accuracy
accuracy_train = (predicted == data_train[‘admit‘]).mean()
# Print the accuracy
print("Accuracy in Training Set = {s}".format(s=accuracy_train))
# 这种输出方式也很好
Accuracy in Training Set = 0.7785714285714286
# Percentage of those admitted
percent_admitted = data_test["admit"].mean() * 100
# Predicted to be admitted
predicted = logistic_model.predict(data_test[[‘gpa‘,‘gre‘]])
# What proportion of our predictions were true
accuracy_test = (predicted == data_test[‘admit‘]).mean()
- sklearn中的逻辑回归的阈值默认设置为0.5
Admissions ROC Curve
- 逻辑回归中的predict_proba这个函数返回的不是类标签,而是接受的概率,这可以允许我们自己修改阈值。首先我们需要作出它的ROC曲线来观察合适阈值:
from sklearn.metrics import roc_curve, roc_auc_score
# Compute the probabilities predicted by the training and test set
# predict_proba returns probabilies for each class. We want the second column
train_probs = logistic_model.predict_proba(data_train[[‘gpa‘, ‘gre‘]])[:,1]
test_probs = logistic_model.predict_proba(data_test[[‘gpa‘, ‘gre‘]])[:,1]
# Compute auc for training set
auc_train = roc_auc_score(data_train["admit"], train_probs)
# Compute auc for test set
auc_test = roc_auc_score(data_test["admit"], test_probs)
# Difference in auc values
auc_diff = auc_train - auc_test
# Compute ROC Curves
roc_train = roc_curve(data_train["admit"], train_probs)
roc_test = roc_curve(data_test["admit"], test_probs)
# Plot false positives by true positives
plt.plot(roc_train[0], roc_train[1])
plt.plot(roc_test[0], roc_test[1])
时间: 2024-10-10 00:14:46