Galaxy Classification

10.3 Data Preparation

After removing a large number of the columns from the raw SDSS dataset, introducing a number of derived features, and generating two target features, Jocelyn generated an ABT containing 327 descriptive features and two target features. We lists these features.

Once Jocelyn had populated the ABT, she generated a data quality report (the initial data quality report covered the data in the raw SDSS dataset only, so a second one was required that covered the actual ABT) and performed an in-depth analysis of the characteristics of each descriptive feature.

The magnitude of the maximum values for the FIBER2FLUXIVAR_U feature in comparison t the median and 3rd quartile value was unusual and suggested the presence of outliers. The difference between the mean and median values for the SKYIVAR_R feature also suggested the presence of outliers. Similarly, the difference between the mean and median values for the LNLSTAR_R feature suggested that the distribution of this feature was heavily skewed and also suggested the presence of outliers.

With Edwin‘s help, Jocelyn investigated the actual data in the ABT to determine whether the extreme values in the feature displaying significant skew or the presence of outliers were due to valid outliers or invalid outliers. In all cases the extreme values were determined to be valid outliers. Jocelyn decided to use the clamp transformation to change the values of these outliers to something closer to the central tendency of the features. Any values beyond the 1st quartile value plus 2.5 times the inter-quartile range were reduced to this value. The standard value of 1. times the inter-quartile range was changed to 2.5 to slightly reduce the impact of this operation.

Jocelyn also made the decision to normalize all the descriptive features into standard scores. The differences in the ranges of values of the set of descriptive features in the ABT was huge. For example, DEVAB_R had a range as small as [0.05, 1.00] while APERFLUX7IVAR_U had a range as large as [-265,862, 15,274]. Standardizing the descriptive feature in this way was likely to improve the accuracy of the final predictive models. The only draw-back to standardization is that the models become less interpretable. Interpretability, however, was not particularly important for the SDSS scenario (the model built would be added to the existing SDSS pipeline and process thousands of galaxy objects per day), so standardization was appropriate.

Jocelyn also performed a simple first-pass feature selection using the 3rd-level model to see which features might stand out as predictive of galaxy morphology. Jocelyn used the information gain measure to rank the predictiveness of the different features in the dataset (for this analysis, missing values were simply omitted). The columns identified as being most predictive of galaxy morphology were expRad_g (0.3908), expRad_r (0.3649), deVRad_g (0.3607), expRad_i (0.3509), deVRad_r (0.3467), expRad_z (0.3457), and mRrCc_g (0.3365). Jocelyn generated histograms for all these features compared to the target feature - for example, we show the histograms for the EXPRAD_R feature. It was encouraging that in many cases distinct distributions for each galaxy type were apparent in the histograms. We show small multiple box plots divided by galaxy type for a selection of features from the ABT. The differences between the three box plots in each plot gives an indication of the likely predictiveness of each feature. The presence of large numbers of outliers can also be seen.

10.4 Modeling

The descriptive features in the SDSS dataset are primarily continuous. For this reason, Jocelyn considered trying a similarity-based model, the k nearest neighbor, and two error-based models, the logistic regression model and the support vector machine. Jocelyn began by constructing a simple baseline model using the 3-level target feature.

10.4.1 Baseline Models

Because of the size of the ABT, Jocelyn decided to split the dataset into a training set and a large hold-out test set. Subsets of the training set would be also used for validation during the model building process. The training set consisted of of the data in the ABT (approximately 200,000 instances), and the test set consisted of the remaining (approximately 450,000 instances). Using the training set, Jocelyn performed a 10-fold cross validation experiment on models trained to use the full set of descriptive features to predict the 3-level target. These would act as baseline performance scores that she would try to improve upon. The classification accuracies achieved during the cross validation experiment were , , and by the k nearest neighbor, logistic regression, and support vector machine model respectively.

These initial baseline results were promising; however, one key issue did emerge. It was clear that the performance of the models trained using the SDSS data was severely affected by the target level imbalance in the data-there were many more example of the elliptical target level than either the spiral or, especially, the other target level.

10.4.2 Feature Selection

In the SDSS dataset, many of the features are presented multiple times for each of the five different photometric bands, and this made Jocelyn suspect that many of these features might be redundant and so ripe for removal from the dataset. Feature selection approaches that search through subsets of features (known as wrapper approaches) are better at removing redundant features than rank and prune approaches because they consider groups of features together. For this reason, Jocelyn chose to use a step-wise sequential search for feature selection for each of the three model types. In all cases overall classification accuracy was used as th fitness function that drove the search. After feature selection, the classification accuracy of the model on the test set were , , and for the k nearest neighbor, logistic regression, and support vector machine models respectively. In all cases performance of the models improved with feature selection. the best performing model is the logistic regression model. For this model, just 31 out of the total 327 features were selected. this was not surprising given the large amount of redundancy within the feature set.

Based on these results, Jocelyn determined that the logistic regression model trained using the reduced set of features was the best models to use for galaxy classification. This model gave the best prediction accuracy and offered the potential for very fast classification times, which was attractive for integration into the SDSS pipeline. Logistic regression models also produce confidences along with the predictions, which was attractive to Edwin as it meant that he could build tests into the pipeline that would redirect galaxies with low confidence classifications for manual confirmation of the predictions made by the automated system.

时间: 2025-01-02 12:14:03

Galaxy Classification的相关文章

Sentiment Analysis(1)-Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables

The content is from this paper: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables, by Tetsuji Nakagawa. A typical approach for sentiment classification is to use supervised machine learning algorithms with bag-of-words a

ThoughtWorks笔试题之Merchant's Guide To The Galaxy解析

一.背景 在某网站上看到ThoughtWorks在武汉招人,待遇在本地还算不错,就投递了简历.第二天HR就打开电话,基本了解了一下情况(工作环境不错,男人妹子比例:1:1,双休,六险一金,满一年年假15天,病假8天,月薪1W--2W).然后立马收到一封:Coding Assignment的笔试题目.网上搜索了一下,发现这个公司还是挺大的,公司面试流程是出了名的繁杂和苛刻.据说有8轮:电话面试=>笔试=>Homework=>结对编程(中午管饭)=>技术面试=>PM面试=>

HDU 5073 Galaxy (数学)

Galaxy Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 262144/262144 K (Java/Others)Total Submission(s): 4991    Accepted Submission(s): 1215Special Judge Problem Description Good news for us: to release the financial pressure, the government

Logistic Regression & Classification (1)

一.为什么不使用Linear Regression 一个简单的例子:如果训练集出现跨度很大的情况,容易造成误分类.如图所示,图中洋红色的直线为我们的假设函数 .我们假定,当该直线纵轴取值大于等于0.5时,判定Malignant为真,即y=1,恶性肿瘤:而当纵轴取值小于0.5时,判定为良性肿瘤,即y=0. 就洋红色直线而言,是在没有最右面的"×"的训练集,通过线性回归而产生的.因而这看上去做了很好的分类处理,但是,当训练集中加入了右侧的"×"之后,导致整个线性回归的结

Random Forest Classification of Mushrooms

There is a plethora of classification algorithms available to people who have a bit of coding experience and a set of data. A common machine learning method is the random forest, which is a good place to start. This is a use case in R of the randomFo

cdmc2016数据挖掘竞赛题目Android Malware Classification

http://www.csmining.org/cdmc2016/ Data Mining Tasks Description Task 1: 2016 e-News categorisation For this year, the dataset is sourced from 6 online news media: The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.t

Support Vector Machines for classification

Support Vector Machines for classification To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng: “SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learni

HDU 5073 Galaxy(居然是暴力)

Problem Description Good news for us: to release the financial pressure, the government started selling galaxies and we can buy them from now on! The first one who bought a galaxy was Tianming Yun and he gave it to Xin Cheng as a present. To be fashi

Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification

引入 Recently SVMs using spatial pyramid matching (SPM) kernel have been highly successful in image classification. Despite its popularity, these nonlinear SVMs have a complexity in training and O(n) in testing, where n is the training size, implying t