机器学习样本标记 示意代码

目标:根据各个字段数据的分布(例如srcIP和dstIP的top 10)以及其他特征来进行样本标注,最终将几类样本分别标注在black/white/ddos/mddos/cdn/unknown几类。

效果示意:

-------------choose one--------------
sub domain: DNSQueryName(N)
ip: srcip(S) or dstip(D)
length: DNSRequestLength(R1) or DNSReplyLength(R2)
length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)
port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)
code: DNSReplyCode(C2) or DNSRequestRRType(C1)
other: DNSRRClass(RR) or DNSReplyIPv4(V)
-------------label or quit------------
black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)
next(Q) or exit(E)?
***************************************
domain: workgroup. flow count: 206
***************************************
------------srcip-----------------
count                 206
unique                  9
top       162.105.129.122
freq                  150
Name: sourceIP, dtype: object
--------------destip---------------
count             206
unique             12
top       199.7.83.42
freq               82
Name: destIP, dtype: object

代码:

import sys
import json
import os
import pandas as pd
import tldextract
# import numpy as np

medata_field = ‘‘‘
3 = sourceIP
4 = destIP
5 = sourcePort
6 = destPort
7 = protocol
12 = flowStartSeconds
13 = flowEndSecond
54 = DNSReplyCode
55 = DNSQueryName
56 = DNSRequestRRType
57 = DNSRRClass
58 = DNSDelay
59 = DNSReplyTTL
60 = DNSReplyIPv4
61 = DNSReplyIPv6
62 = DNSReplyRRType
77 = DNSReplyName
81 = payload
88 = DNSRequestLength
89 = DNSRequestErrLength
90 = DNSReplyLength
91 = DNSReplyErrLength
‘‘‘

medata_field_num = []
medata_field_info = []
for l in medata_field.split("\n"):
    if len(l) == 0: continue
    num, info = l.split(" = ")
    medata_field_num.append(int(num)-1)
    medata_field_info.append(info)
print medata_field_num
print medata_field_info

def extract_domain(domain):
    try:
        ext = tldextract.extract(domain)
        subdomain = ext.subdomain
        if ext.domain == "":
            mdomain = ext.suffix
        else:
            mdomain = ".".join(ext[1:])
        return mdomain
    except Exception,e:
        print "extract_domain error:", e
        return "unknown"

def parse_metadata(path):
    df = pd.read_csv(path, sep="^", header=None)
    dns_df = df.iloc[:, medata_field_num].copy()
    dns_df.columns = medata_field_info
    # print dns_df.tail()

    dns_df["mdomain"] = dns_df["DNSQueryName"].apply(extract_domain)
    # print dns_df.groupby(‘mdomain‘).describe()
    # print dns_df.groupby(‘mdomain‘).groups
    return dns_df.groupby(‘mdomain‘)

def get_data_dist(df, col="sourceIP"):
    # group count by ip dist
    grouped = df.groupby(by=col)
    # print grouped.head(10)[col]
    print type(grouped.size())
    size = grouped.size()
    print size
    print "-----------top 10-------------"
    print size.nlargest(10)

def move_to(srcpath, domain, dst_path):
    with open(dst_path, "w") as w:
        with open(srcpath) as r:
            for line in r:
                if extract_domain(line.split("^")[55-1]) == domain:
                    w.write(line)

def main():
    history_op = {}
    if os.path.exists("history_op.json"):
        with open("history_op.json") as h:
            history_op = json.load(h)
            print history_op
    for day in range(15, 17):
        for hour in range(0, 24):
            path = "/home/bonelee/latest_metadata_sample/black_all/black-medata_wanted-2017-08-%d-%d.txt" % (day, hour)
            print path, "running..."
            try:
                domains_info = parse_metadata(path)
            except IOError, e:
                print e
                continue
            for domain, group in domains_info:
                print "***************************************"
                print "domain:", domain, "flow count:", len(group)
                print "***************************************"
                # print type(group) #<class ‘pandas.core.frame.DataFrame‘>
                print "------------srcip-----------------"
                print group["sourceIP"].describe()
                print "--------------destip---------------"
                print group["destIP"].describe()

                has_judged = False
                need_break = False
                while True:
                    print "-------------choose one--------------"
                    print "sub domain: DNSQueryName(N)"
                    print "ip: srcip(S) or dstip(D)"
                    print "length: DNSRequestLength(R1) or DNSReplyLength(R2)"
                    print "length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)"
                    print "port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)"
                    print "code: DNSReplyCode(C2) or DNSRequestRRType(C1)"
                    print "other: DNSRRClass(RR) or DNSReplyIPv4(V)"
                    dist_dict = {"R1": "DNSRequestLength",
                     "R2": "DNSReplyLength",
                     "R3": "DNSRequestErrLength",
                     "R4": "DNSReplyErrLength",
                     "P1": "sourcePort",
                     "P2": "destPort",
                     "T": "DNSReplyTTL",
                     "C2": "DNSReplyCode",
                     "C1": "DNSRequestRRType",
                     "RR": "DNSRRClass",
                     "V": "DNSReplyIPv4",
                     "S": "sourceIP",
                     "D": "destIP",
                     "N": "DNSQueryName"
                     }

                    print "-------------label or quit------------"
                    print "black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)"
                    print "next(Q) or exit(E)?"
                    domain = domain.lower()
                    if "win" == domain[-len("win"):] or "site" == domain[-len("site"):] or "vip" == domain[-len("vip"):]:
                        check = "U"
                        need_break = True
                    elif "lan" in domain or "local" in domain or "dhcp" in domain or "workgroup" in domain or "home" in domain:
                        check = "DDOS"
                        need_break = True
                    elif "cdn" in domain:
                        check = "CDN"
                        need_break = True
                    else:
                        if domain in history_op and not has_judged:
                            print "found history op:", history_op[domain]
                            if not raw_input("OK(Enter for Y)?"):
                                check = history_op[domain]
                                need_break = True
                            else:
                                check = raw_input("Input:")
                        else:
                            check = raw_input("Input:")
                    has_judged = True
                    if check == "Q":
                        print path, "next OK!"
                        break
                    elif check == "E":
                        print path, "Exit!"
                        with open("history_op.json", "w") as f:
                            json.dump(history_op, f)
                            print "saved history_op.json"
                        sys.exit()
                    elif check == "B":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_black/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "B"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "W":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "W"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "L":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white_like/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "L"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "CDN":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_cdn/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "CDN"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "DDOS":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_ddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "DDOS"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "M":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_mddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "M"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "U":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_unknown/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "U"
                        print "Saved OK!"
                        if need_break: break
                    else:
                        if check in dist_dict:
                            get_data_dist(group, dist_dict[check])
                        else:
                            print "unknown input!Choose the following one:"
            print "*******************************"
            print path, "check over..."
            print "*******************************"

if __name__ == "__main__":
    main()
时间: 2024-10-09 19:30:57

机器学习样本标记 示意代码的相关文章

机器学习_决策树Python代码详解

决策树优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据: 决策树缺点:可能会产生过度匹配问题. 决策树的一般步骤: (1)代码中def 1,计算给定数据集的香农熵: 其中n为类别数,D为数据集,每行为一个样本,pk  表示当前样本集合D中第k类样本所占的比例,Ent(D)越小,D的纯度越高,即表示D中样本大部分属于同一类:反之,D的纯度越低,即数据集D中的类别数比较多. (2)代码中def 2,选择最好的数据集划分方式,即选择信息增益最大的属性: 其中 这里V

python入门机器学习,3行代码搞定线性回归

本文着重是重新梳理一下线性回归的概念,至于几行代码实现,那个不重要,概念明确了,代码自然水到渠成. "机器学习"对于普通大众来说可能会比较陌生,但是"人工智能"这个词简直是太火了,即便是风云变化的股市中,只要是与人工智能.大数据.云计算相关的概念股票都会有很好的表现.机器学习是实现人工智能的基础,今天早上看了美国著名演员威尔斯密斯和世界最顶级的机器人进行对话的视频,视频中的机器人不论从语言还是表情都表达的非常到位,深感人工智能真的离我们越来越近了,所以学习人工智能前

机器学习-样本不均衡问题处理

在机器学习中,我们获取的数据往往存在一个问题,就是样本不均匀.比如你有一个样本集合,正例有9900个,负例100个,训练的结果往往是很差的,因为这个模型总趋近于是正例的. 就算全是正那么,也有99%的准确率,看起来挺不错的,但是我们要预测的负样本很可能一个都预测不出来. 这种情况,在机器学习中有三个处理办法,过采样.欠采样.再平衡(再缩放) 过采样:增加一些数据数据,使得正反例数量一致,比如这里,我们增加负例9800个,若单纯复制这100个负例,则很可能会导致多重共线性问题,所以实际的处理方法一

分享《自然语言处理理论与实战》PDF及代码+唐聃+《深入浅出Python机器学习》PDF及代码+段小手+《深度学习实践:计算机视觉》PDF+缪鹏+《最优化理论与算法第2版》高清PDF+习题解答PDF+《推荐系统与深度学习》PDF及代码学习

<自然语言处理理论与实战>高清PDF,362页,带书签目录,文字可以复制:配套源代码.唐聃等著. <大数据智能互联网时代的机器学习和自然语言处理技术>PDF,293页,带书签目录,文字可以复制,彩色配图.刘知远等著.  下载: https://pan.baidu.com/s/1waP6C086-32_Lv0Du3BbNw 提取码: 1ctr <自然语言处理理论与实战>讲述自然语言处理相关学科知识和理论基础,并介绍使用这些知识的应用和工具,以及如何在实际环境中使用它们.由

[解决]JS失效,提示HTML1114: (UNICODE 字节顺序标记)的代码页 utf-8 覆盖(META 标记)的冲突的代码页 utf-8

上网找了找,木有找到相关的解决办法,索性自己试了试. 原页面是这样写的: <html> <head> <meta http-equiv="Content-Type" content="text/html charset=UTF-8" /> <script type="text/javascript" src="js1.js"></script> <script

机器学习实战-第二章代码+注释-KNN

#-*- coding:utf-8 -*- #https://blog.csdn.net/fenfenmiao/article/details/52165472 from numpy import * #科学计算包 import operator #运算符模块 import matplotlib import matplotlib.pyplot as plt #matplotlib.pyplot是一些命令行风格函数的集合 from os import listdir #列出给定目录的文件名 de

机器学习(一) 效果图实现代码

xmin, xmax = data[:,0].min(), data[:,0].max() ymin, ymax = data[:,1].min(), data[:,1].max() x = np.linspace(xmin,xmax, 1000) y = np.linspace(ymin, ymax, 1000) X,Y = np.meshgrid(x,y) X_test = np.c_[X.ravel(), Y.ravel()] y_logistic = logistic.predict(X

视觉机器学习------K-means算法

K-means(K均值)是基于数据划分的无监督聚类算法. 一.基本原理       聚类算法可以理解为无监督的分类方法,即样本集预先不知所属类别或标签,需要根据样本之间的距离或相似程度自动进行分类.聚类算法可以分为基于划分的方法.基于联通性的方法.基于概率分布模型的方法等,K-means属于基于划分的聚类方法. 基于划分的方法是将样本集组成的矢量空间划分为多个区域{Si}i=1k,每个区域都存在一个区域相关的表示{ci}i=1k,通常称为区域中心.对于每个样本,可以建立一种样本到区域中心的映射q

[转]如何处理机器学习中的不平衡类别

如何处理机器学习中的不平衡类别 原文地址:How to Handle Imbalanced Classes in Machine Learning 原文作者:elitedatascience 译文出自:掘金翻译计划 本文永久链接:github.com/xitu/gold-m- 译者:RichardLeeH 校对者:lsvih, lileizhenshuai 如何处理机器学习中的不平衡类别 不平衡类别使得"准确率"失去意义.这是机器学习 (特别是在分类)中一个令人惊讶的常见问题,出现于每