Data Cleaning 2

1. When we match a set of data with duplicated values in a column, and we want to use this column as an unify column which is sharing for each database. We are going to filter them into a DataFrame we want.

  class_size = data["class_size"]
  class_size= class_size[class_size["GRADE "] == "09-12" ]
  class_size= class_size[class_size["PROGRAM TYPE"]=="GEN ED"]

2. Once we filtered the column ,we want to condence the duplicated column into one by using groupby() and agg function.

  import numpy as np
  group_by = class_size.groupby(‘DBN‘) #group_by is a special type of data called GroupBy
  class_size = group_by.aggregate(np.mean) # we use aggregate function to deal with the GroupBy types of data .At his moment, the index of class_size will change to the grouped by value (DBN).
  class_size.reset_index(inplace = True) # reset_index allows us to reset the index as a row number - 1
  data[‘class_size‘] = class_size

3. Numeric all the number string by using pd.numeric() function:

  cols = [‘AP Test Takers ‘, ‘Total Exams Taken‘, ‘Number of Exams with scores 3 4 or 5‘]

  for col in cols:
  data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col],errors = "coerce")

4. After cleanning each dataset, we could like to combine them together so that we can plot them. Normally we use merge() function to combine two dataset.

  combined = data["sat_results"]

  combined = combined.merge(data["ap_2010"],how = "left")
  combined = combined.merge(data["graduation"],how = "inner")
  print(combined.shape)

5. At last, we want to extract some number form certain rows by using apply() function:

  index = combined.index

  def get_first_two_char(data):
    return data[0:2]

  combined["school_dist"] = combined["DBN"].apply(get_first_two_char)#usually once we need to use for loop in the DataFrame, we would like to use apply function to simplieze it.

时间: 2024-11-08 02:29:20

Data Cleaning 2的相关文章

Quick Guide: Steps To Perform Text Data Cleaning in Python

Quick Guide: Steps To Perform Text Data Cleaning in Python Introduction Twitter has become an inevitable channel for brand management. It has compelled brands to become more responsive to their customers. On the other hand, the damage it would cause

data cleaning

Cleaning data in Python Table of Contents Set up environments Data analysis packages in Python Clean data in Python Load dataset into Spyder Subset Drop data Transform data Create new variables Rename variables Merge two datasets Handle missing value

Data Cleaning 4

1. Read the data: 1.1 If the data is not in .csv file. We have to search for the special read method all_survey = pandas.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252') # read http://kunststube.net/encoding/

Data Cleaning 1

1. Read mutiple data files; import pandas as pd data_files = [ "ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv", "hs_directory.csv", "sat_results.csv" ] data = {} for f in da

Data Cleaning 5

1. Histogram vs. Bar chart  With bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable.Which means we can change the order of categories in

Data Cleaning 3

1. Find correlations for each type of data by using corr() correlations = combined.corr(method = "pearson") print(correlations["sat_score"]) note: The value of correlation is from -1 to 1. If the data close to 1, they are positive corr

【Repost】A Practical Intro to Data Science

Are you a interested in taking a course with us? Learn about our programs or contact us at [email protected]. There are plenty of articles and discussions on the web about what data science is, what qualitiesdefine a data scientist, how to nurture th

Data Visualizations 3

Data Cleaning and visualization: 1.Before cleaning a set of data, we need to inspect the data by using shape(),head(),dtype(),decribe() function. 2.First, we are going to deal with the missing data.(by using dropna() or loc[]) 3.Second, we are going

Data mapping-数据映射

数据映射:根据数据的结构信息建立数据间的映射操作机制. 数据映射的要素: 一.数据 1.源数据: 2.目标数据: 3.数据间关系: 4.数据的元数据(结构信息). 5.元素类型的对应关系. 二.元数据的获取: 1.描述文件:coredata的momd文件,数据库的表结构: 2.结构信息:使用运行时的反射或格式信息的内存读取获取. 三.映射操作: 1.硬编码进行格式转换. 2.根据元数据信息直接内存写入: 3.根据元数据信息kvc写入: 四.非匹配映射 1.映射的数据间是一一对应的关系,但是键值不