cdmc2016数据挖掘竞赛题目Android Malware Classification

http://www.csmining.org/cdmc2016/

Data Mining Tasks Description

Task 1: 2016 e-News categorisation

For this year, the dataset is sourced from 6 online news media:

The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).

Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.

Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.

The statistical information of the training dataset is summarised as below:

Topic No. of News
Business 361
Entertainment 343
Sport 363
Technology 356
Travel 362

Task 2: UniteCloud Operation Log for Anomaly Detection

UniteCloud is a resilient private Cloud infrastructure created in New Zealand Unitec Institute of Technology using OpenNebula for cloud orchestration and KVM for virtualization.

This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.

The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.

The statistical information of this dataset is summarized as:

No. of Sample No. of Features No. of Classes
No. of Training


No. of Testing

82,363 243 8 57,654 24,709

Task 3: Android Malware Classification

This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems use an .exe file for installing software,Android use APK files for installing software on the Android operating system.

The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.

To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.

The statistical information of the dataset is summarized as:

No. of APK files No. of Permissions No. of Classes No. of Training No. of Testing
61,730 up to 583 2 30,920 30,810

Also, the MD5 hash is provided if you may need for checksum:
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)

时间: 2024-12-23 22:23:42

cdmc2016数据挖掘竞赛题目Android Malware Classification的相关文章

CIKM Competition数据挖掘竞赛夺冠算法陈运文

CIKM Competition数据挖掘竞赛夺冠算法陈运文 背景 CIKM Cup(或者称为CIKM Competition)是ACM CIKM举办的国际数据挖掘竞赛的名称.CIKM全称是International Conference on Information and Knowledge Management,属于信息检索和数据挖掘领域的国际著名学术会议,由ACM SIGIR分会(ACM Special Interest Group on Information Retrieval)主办.

C语言的很吊的printf-----来自一个C语言竞赛题目

一个C语言竞赛题目: 1 #include <stdio.h> 2 int main() 3 { 4 const int unix=1; 5 printf(&unix["\021%six\012\0"], (unix)["have"] + "fun" - 0x60); 6 return 0; 7 } 输出: unix 知识点 :1. \012 是一个字节对应 回车2. printf("%c",2["

数学建模竞赛题目

建模意义 思考方法 数学建模是一种数学的思考方法,是运用数学的语言和方法,通过抽象.简化建立能近似刻画并"解决"实际问题的一种强有力的数学手段. 数学建模就是用数学语言描述实际现象的过程.这里的实际现象既包涵具体的自然现象比如自由落体现象,也包含抽象的现象比如顾客对某种商品所取的价值倾向.这里的描述不但包括外在形态,内在机制的描述,也包括预测,试验和解释实际现象等内容. 我们也可以这样直观地理解这个概念:数学建模是一个让纯粹数学家(指只研究数学而不管数学在实际中的应用的数学家)变成物理

neural network for Malware Classification(Reprinted)

catalogue 0. 引言 1. Byte-sequence N-grams 2. Opcodes N-grams 3. API and Functions calls 4. Use of registers 5. Call Graphs 6. Malware as an Image 7. Detection of malware using dynamic behavior and Windows audit logs 8. 其他方法: Novel Feature Extraction,

第三届泰迪杯数据挖掘竞赛试题讲解

学习目标 深入了解第三届泰迪杯大学生数据挖掘竞赛试题(基于电商平台家电设备的消费者需求及产品数据挖掘分析.基于数据挖掘技术的市财政收入分析预测模型.城市供水处理混凝投药过程的建模与控制)的出题背景.项目需求及解题思路. 课程目标:深入了解第三届泰迪杯大学生数据挖掘竞赛试题的出题背景.项目需求及解题思路.课程特色:出题者在线答疑,以类似项目案例分析试题适用人群:所有数据挖掘竞赛参赛人员及指导老师优惠方式:免费课程详情:深入了解第三届泰迪杯大学生数据挖掘竞赛试题(基于电商平台家电设备的消费者需求及产

数据挖掘竞赛kaggle初战——泰坦尼克号生还预测

1.题目 这道题目的地址在https://www.kaggle.com/c/titanic,题目要求大致是给出一部分泰坦尼克号乘船人员的信息与最后生还情况,利用这些数据,使用机器学习的算法,来分析预测另一部分人员最后是否生还.题目练习的要点是语言和数据分析的基础内容(比如python.numpy.pandas等)以及二分类算法. 数据集包含3个文件:train.csv(训练数据).test.csv(测试数据).gender_submission.csv(最后提交结果的示例,告诉大家提交的文件长什

Kaggle竞赛题目之——Titanic: Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy s

Android Malware Analysis

A friend of mine asked me help him to examine his Android 5.0 smartphone. He did not say what's wrong with his phone, and he just wonder why his wife know everything he chat on the phone, and where he has been. I'd like to help him to figure out if a

Kaggle竞赛题目之——Digit Recognizer

Classify handwritten digits using the famous MNIST data This competition is the first in a series of tutorial competitions designed to introduce people to Machine Learning. The goal in this competition is to take an image of a handwritten single digi