Preparing Data for SQL:
Sometimes we would like to stroe data into SQL server. However , the dataset need to be cleaned before it is sent. So here we use pandas to deal with dataset(.csv)file.
1. read_csv, set encoding:
file = pd.read_csv("academy_awards.csv",encoding = ‘ISO-8859-1‘)
2. Use str function to read first 4 letters of all the strings in the column.
file["Year"] = file["Year"].str[0:4]
3. Use .isin function to get the target rows I need:
award_categories = ["Actor -- Leading Role","Actor -- Supporting Role",‘Actress -- Leading Role‘,‘Actress -- Supporting Role‘]
nominations = later_than_2000[later_than_2000.isin(award_categories)[‘Category‘]]
4. Use .map() function to replace all the element in the column as I need:
won_dic = {
‘NO‘ : 0,
‘YES‘: 1
}
nominations.is_copy = False #Attention, here we can not directly modify the copied dataframe, we have to run this line to make copied dataframe changable.
nominations["Won?"] = nominations["Won?"].map(won_dic)
5. Use .drop() function to get rid of columns I do not need:
final_nominations = nominations.drop(delete_list,axis = 1)
6. Use vectorized string method to modify each string in a column in the dataframe:
additional_info_one = final_nominations["Additional Info"].str.rstrip("‘}") #rstrip is to get rid of all the strings on the right side of the target string in the bracket.
additional_info_two = additional_info_one.str.split("{.")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
7. Use to_sql request to save the dataset into the sql:
final_nominations.to_sql("nominations",conn,index = False)