1. When we match a set of data with duplicated values in a column, and we want to use this column as an unify column which is sharing for each database. We are going to filter them into a DataFrame we want.
class_size = data["class_size"]
class_size= class_size[class_size["GRADE "] == "09-12" ]
class_size= class_size[class_size["PROGRAM TYPE"]=="GEN ED"]
2. Once we filtered the column ,we want to condence the duplicated column into one by using groupby() and agg function.
import numpy as np
group_by = class_size.groupby(‘DBN‘) #group_by is a special type of data called GroupBy
class_size = group_by.aggregate(np.mean) # we use aggregate function to deal with the GroupBy types of data .At his moment, the index of class_size will change to the grouped by value (DBN).
class_size.reset_index(inplace = True) # reset_index allows us to reset the index as a row number - 1
data[‘class_size‘] = class_size
3. Numeric all the number string by using pd.numeric() function:
cols = [‘AP Test Takers ‘, ‘Total Exams Taken‘, ‘Number of Exams with scores 3 4 or 5‘]
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col],errors = "coerce")
4. After cleanning each dataset, we could like to combine them together so that we can plot them. Normally we use merge() function to combine two dataset.
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"],how = "left")
combined = combined.merge(data["graduation"],how = "inner")
print(combined.shape)
5. At last, we want to extract some number form certain rows by using apply() function:
index = combined.index
def get_first_two_char(data):
return data[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_char)#usually once we need to use for loop in the DataFrame, we would like to use apply function to simplieze it.