使用awk+sort+uniq进行文本分析

1、uniq命令
uniq - report or omit repeated lines
介绍：uniq对指定的ASCII文件或标准输入进行唯一性检查，以判断文本文件中重复出现的行。常用于系统排查及日志分析
命令格式：
uniq [OPTION]... [File1 [File2]]
uniq从已经排序好的文本文件File1中删除重复的行，输出到标准标准输出或File2。常作为过滤器，配合管道使用。
在使用uniq命令之前，必须确保操作的文本文件已经过sort排序，若不带参数运行uniq，将删除重复的行。
常见参数：
-c, --count prefix lines by the number of occurrences 去重后计数
2、实战演练

测试数据：

[[email protected] ~]# cat uniq.txt 
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.9

a、直接接文件，不加任何参数，只对相邻的相同内容去重：

[[email protected] ~]# uniq uniq.txt 
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.8
10.0.0.9

b、sort命令让重复的行相邻（-u参数也可完全去重），然后用uniq进行完全去重

[[email protected] ~]# sort uniq.txt 
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.8
10.0.0.9
10.0.0.9
[[email protected] ~]# sort -u uniq.txt 
10.0.0.7
10.0.0.8
10.0.0.9
[[email protected] ~]# sort uniq.txt|uniq
10.0.0.7
10.0.0.8
10.0.0.9

c、sort配合uniq去重后计数

[[email protected] ~]# sort uniq.txt|uniq -c
      2 10.0.0.7
      3 10.0.0.8
      2 10.0.0.9

3、企业案例
处理一下文件内容，将域名取出并根据域名进行计数排序处理（百度和sohu面试题）

[[email protected] ~]# cat access.log 
http://www.etiantian.org/index.html
http://www.etiantian.org/1.html
http://post.etiantian.org/index.html
http://mp3.etiantian.org/index.html
http://www.etiantian.org/3.html
http://post.etiantian.org/2.html

解答：
分析：此类问题是运维工作中最常见的问题。可以演变成分析日志，查看TCP各个状态连接数，查看单IP连接数排名等等。

[[email protected] ~]# awk -F ‘[/]+‘ ‘{print $2}‘ access.log|sort|uniq -c|sort -rn -k1
      3 www.etiantian.org
      2 post.etiantian.org
      1 mp3.etiantian.org

时间： 2024-10-16 01:06:51

使用awk+sort+uniq进行文本分析

使用awk+sort+uniq进行文本分析的相关文章

awk sort uniq

[linux] grep awk sort uniq学习

shell 文本处理的几个命名sed,awk,sort,uniq,cut

7、Shell工具 cut sed awk sort

linux基础篇07，linux文本处理cat more less head tail sort uniq grep cut jion sed awk

05，文本处理cat more less head tail sort uniq wc tr grep cut jion sed awk ok

Linux sort uniq awk head 完成访问日志统计排序功能

awk、uniq、sort三个命令的基本用法

文本处理命令- cat more less cut wc sort uniq