How To Use the AWK language to Manipulate Text in Linux

https://www.digitalocean.com/community/tutorials/how-to-use-the-awk-language-to-manipulate-text-in-linux

Introduction

Linux utilities often follow the Unix philosophy of design. Tools are encouraged to be small, use plain text files for input and output, and operate in a modular manner. Because of this legacy, we have great text processing functionality with tools like sed and awk.

In this guide, we will discuss awk. Awk is both a programming language and text processor that can be used to manipulate text data in very useful ways. We will be discussing this on an Ubuntu 12.04 VPS, but it should operate the same on any modern Linux system.

Basic Syntax

The awk command is included by default in all modern Linux systems, so we do not need to install it to begin using it.

Awk is most useful when handling text files that are formatted in a predictable way. For instance, it is excellent at parsing and manipulating tabular data. It operates on a line-by-line basis and iterates through the entire file.

By default, it uses whitespace (spaces, tabs, etc.) to separate fields. Luckily, many configuration files on your Linux system use this format.

The basic format of an awk command is:

awk ‘/search_pattern/ { action_to_take_on_matches; another_action; }‘ file_to_parse

You can omit either the search portion or the action portion from any awk command. By default, the action taken if the "action" portion is not given is "print". This simply prints all lines that match.

If the search portion is not given, awk performs the action listed on each line.

If both are given, awk uses the search portion to decide if the current line reflects the pattern, and then performs the actions on matches.

Simple Uses

In its simplest form, we can use awk like cat to simply print all lines of a text file out to the screen.

Let‘s print out our server‘s fstab file, which lists the filesystems that it knows about:

awk ‘{print}‘ /etc/fstab
# /etc/fstab: static file system information.
#
# Use ‘blkid‘ to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#
proc            /proc           proc    nodev,noexec,nosuid 0       0
# / was on /dev/vda1 during installation
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd /               ext4    noatime,errors=remount-ro 0       1

This isn‘t very useful. Let‘s try out awk‘s search filtering capabilities:

awk ‘/UUID/‘ /etc/fstab
# device; this may be used with UUID= as a more robust way to name devices
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd /               ext4    noatime,errors=remount-ro 0       1

As you can see, awk now only prints the lines that have "UUID" in them. We can get rid of the extraneous comment line by specifying that UUID must be located at the very beginning of the line:

awk ‘/^UUID/‘ /etc/fstab
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd /               ext4    noatime,errors=remount-ro 0       1

Similarly, we can use the action section to specify which pieces of information we want to print. For instance, to print only the first column, we can type:

awk ‘/^UUID/ {print $1;}‘ /etc/fstab
UUID=b96601ba-7d51-4c5f-bfe2-63815708aabd

We can reference every column (as delimited by whitespace) by variables associated with their column number. The first column can be referenced by $1 for instance. The entire line can by referenced by $0.

Awk Internal Variables and Expanded Format

Awk uses some internal variables to assign certain pieces of information as it processes a file.

The internal variables that awk uses are:

  • FILENAME: References the current input file.
  • FNR: References the number of the current record relative to the current input file. For instance, if you have two input files, this would tell you the record number of each file instead of as a total.
  • FS: The current field separator used to denote each field in a record. By default, this is set to whitespace.
  • NF: The number of fields in the current record.
  • NR: The number of the current record.
  • OFS: The field separator for the outputted data. By default, this is set to whitespace.
  • ORS: The record separator for the outputted data. By default, this is a newline character.
  • RS: The record separator used to distinguish separate records in the input file. By default, this is a newline character.

We can change the values of these variables at will to match the needs of our files. Usually we do this during the initialization phase of our awk processing.

This brings us to another important concept. Awk syntax is actually slightly more complex than what we showed initially. There are also optional BEGIN and END blocks that can contain commands to execute before and after the file processing, respectively.

This makes our expanded syntax look something like this:

awk ‘BEGIN { action; }
/search/ { action; }
END { action; }‘ input_file

The BEGIN and END keywords are actually just specific sets of conditions just like the search parameters. They match before and after the document has been processed.

This means that we can change some of the internal variables in the BEGIN section. For instance, the /etc/passwd file is delimited with colons (:) instead of whitespace. If we wanted to print out the first column of this file, we could type:

sudo awk ‘BEGIN { FS=":"; }
{ print $1; }‘ /etc/passwd
root
daemon
bin
sys
sync
games
man
. . .

We can use the BEGIN and END blocks to print simple information about the fields we are printing:

sudo awk ‘BEGIN { FS=":"; print "User\t\tUID\t\tGID\t\tHome\t\tShell\n--------------"; }
{print $1,"\t\t",$3,"\t\t",$4,"\t\t",$6,"\t\t",$7;}
END { print "---------\nFile Complete" }‘ /etc/passwd
User        UID     GID     Home        Shell
--------------
root         0       0       /root       /bin/bash
daemon       1       1       /usr/sbin       /bin/sh
bin          2       2       /bin        /bin/sh
sys          3       3       /dev        /bin/sh
sync         4       65534       /bin        /bin/sync
. . .
---------
File Complete

As you can see, we can format things quite nicely by taking advantage of some of awk‘s features.

Each of the expanded sections are optional. In fact, the main action section itself is optional if another section is defined. We can do things like this:

awk ‘BEGIN { print "We can use awk like the echo command"; }‘
We can use awk like the echo command

Awk Field Searching and Compound Expressions

In one of the examples above, we printed the line in the /etc/fstab file that began with "UUID". This was easy because we were looking for the beginning of the entire line.

What if we wanted to find out if a search pattern matched at the beginning of a field instead?

We can create a favorite_food.txt file which lists an item number and the favorite foods of a group of friends:

echo "1 carrot sandy
2 wasabi luke
3 sandwich brian
4 salad ryan
5 spaghetti jessica" > favorite_food.txt

If we want to find all foods from this file that begin with "sa", we might begin by trying something like this:

awk ‘/sa/‘ favorite_food.txt
1 carrot sandy
2 wasabi luke
3 sandwich brian
4 salad ryan

Here, we are matching any instance of "sa" in the word. This does exclude things like "wasabi" which has the pattern in the middle, or "sandy" which is not in the column we want. We are only interested in words beginning with "sa" in the second column.

We can tell awk to only match at the beginning of the second column by using this command:

awk ‘$2 ~ /^sa/‘ favorite_food.txt
3 sandwich brian
4 salad ryan

As you can see, this allows us to only search at the beginning of the second column for a match.

The "^" character tells awk to limit its searches to the beginning of the field. The "field_num ~" part specifies that it should only pay attention to the second column.

We can just as easily search for things that do not match by including the "!" character before the tilde (~). This command will return all lines that do not have a food that starts with "sa":

awk ‘$2 !~ /^sa/‘ favorite_food.txt
1 carrot sandy
2 wasabi luke
5 spaghetti jessica

If we decide later on that we are only interested in lines where the above is true and the item number is less than 5, we could use a compound expression like this:

awk ‘$2 !~ /^sa/ && $1 < 5‘ favorite_food.txt

This introduces a few new things. The first is the ability to add additional requirements for the line to match by using the && operator. Using this, you can combine an arbitrary number of conditions for the line to match.

We use this operator to add a check that the value of the first column is less than 5.

Conclusion

By now, you should have a basic understanding of how awk can manipulate, format, and selectively print text files. Awk is a much larger topic though, and is actually an entire programming language complete with variable assignment, control structures, built-in functions, and more. It can be used in scripts to easily format text in a reliable way.

To learn more about how to work with awk, check out the great online resources for awk, and more relevantly, gawk, the GNU version of awk present on modern Linux distributions.

By Justin Ellingwood

原文地址:https://www.cnblogs.com/zjbfvfv/p/10337876.html

时间: 2024-11-02 08:16:35

How To Use the AWK language to Manipulate Text in Linux的相关文章

简明awk实战演练

简明awk实战演练 1  awk是什么?   AWK is a programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.                                               

Awk by Example--转载

原文地址: http://www.funtoo.org/Awk_by_Example,_Part_1?ref=dzone http://www.funtoo.org/Awk_by_Example,_Part_2 http://www.funtoo.org/Awk_by_Example,_Part_3 In defense of awk In this series of articles, I'm going to turn you into a proficient awk coder. I'

awk(1)-简述

1.概述 AWK is a programming language designed for text processing and typically used as a data extraction and reporting tool. AWK是一种设计用于文本处理的编程语言,通常用作数据提取和报告工具. awk程序是有一个或多个模式-动作语句构成的. 如:awk 'pattern {action} pattern {action}..'input 在一些语句中,pattern可以被省

Linux之强大的awk

来自[梦想家 Haima's blog > http://blog.dreamwe.cn] awk简介 awk是Linux中的一个命令,用来做文本处理与分析,功能简单强悍,同时它也是一门编程语言. awk处理文本文件时,以行为单位,可以高效的对日志文件进行处理. awk的man文档简介摘要: NAME gawk - pattern scanning and processing language //awk其实是gawk,文本匹配查询和 处理语言, SYNOPSIS gawk [ POSIX o

awk入门及awk数组相关实战

知识点: l 记录与字段 l 模式匹配:模式与动作 l 基本的awk执行过程 l awk常用内置变量(预定义变量) l awk数组(工作常用) l awk语法:循环.条件 l awk常用函数 l 向awk传递参数 l awk引用shell变量 l awk小程序及调试思路 [[email protected] ~]# awk --version|head -1 GNU Awk 3.1.7 第1章 记录和字段 record记录==行, field字段相当于列,字段==列. awk对每个要处理的输入数

A curated list of speech and natural language processing resources

A curated list of speech and natural language processing resources At Josh.ai, we’re often asked for developer resources relating to natural language processing, machine learning, and artificial intelligence. Paul Dixon, a researcher living in Kyoto

EL(expression language)

EL(expression language) 默默的看雨下 简介 EL(表达式语言)可以方便访问Web常用对象的数据.在JSP2开始默认支持,可以避免使用jsp脚本语言,如<%=%> 作用 获取数据 执行运算 获取Web常用对象 调用java方法 使用 1.EL语法:${expression} 如 ${user} 2.EL提供 . 与 [ ] 两种方式来获取数据.如 ${user.name} 3.EL内置对象: pageContext --- 代表该JSP的pageContext对象. pa

awk的应用

  awk的应用 awk是Linux文本处理三剑客之一,是一款强大的报告生成器,不同于sed和grep,它的侧重点是如何把文本信息更好的展示出来,常用与统计和格式化输出.awk相当于微型的shell,有着自己一套语法结构,例如:循环结构,数组,条件判断,函数,内置变量等功能.处理对象一般纯文本文件或纯文本信息.在开源界的awk是gawk(GNU).在Linux中常使用的gawk,但是一般都称之为awk. 基本用法: gawk [options] 'program' file[ file ...]

三十分钟学会AWK

摘要: 本文大部分内容翻译自我开始学习AWK时看到的一篇英文文章 AWK Tutorial ,觉得对AWK入门非常有帮助,所以对其进行了粗略的翻译,并对其中部分内容进行了删减或者补充,希望能为对AWK感兴趣的小伙伴提供一份快速入门的教程,帮助小伙伴们快速掌握AWK的基本使用方式,当然,我也是刚开始学习AWK,本文在翻译或者补充的过程中肯定会有很多疏漏或者错误,希望大家能够帮忙指正. 本文大部分内容翻译自我开始学习AWK时看到的一篇英文文章 AWK Tutorial ,觉得对AWK入门非常有帮助,