The Power of AWK


This next command is very powerful and yet I don’t fully understand it’s power.

The command is awk.

Awk is really good at processing lines of a file.
This might sounds familiar.  After all, we did this using the sed command.

However, awk works by scanning each line of a file to match against a program, that you specify, to decide what to do.  

The types of file that awk works well with, is data.

Data that has a pattern, like TSV, CSV and log files. 
Any file that separator between the values (Tabs for TSV and Commas for CSV).

This article makes use of a collection of random files. These files were put together to help you “Tweak Your Terminal”. Checkout our guide to setting up the random files.

Any example of a TSV file, is the salaries.tsv.
This file has rows of data, that could easily be imported into a spreadsheet.

cat data/salaries.tsv
Name Dept Salary
Stacey sales 34698.22
Sidney payroll 39117.12
Leola payroll 31314.98

Note: You will find these files in random-files-master/data/

At the simplest, we can make awk simulate the cat command with programme {print} action.

awk '{print}' data/salaries.tsv | head
Name Dept Salary
Stacey sales 34698.22
Sidney payroll 39117.12
Leola payroll 31314.98

The programme consist of a pattern and an {action}.
You can specify either or both.
However, if you don’t specify either, then nothing will be outputted.

awk '' salaries.tsv

AWK Actions

Actions do something with the data.
Mostly it is outputting it, but it could also perform a change to it.

We’ve seen one action so far, print.
The action '{print}' will print out a while line.

When we used '{print}', this was shorthand for '{print $0}'.
The $0 is a variable that is equivalent to the whole line.

We can also refer to each value from the column.

By default, awk will separate the values on any white-space.
The name column, would be $1.  Dept column would be $2 and so on.

In the previous example, {print $0} would be the equivalent of { print $1 "\t" $2 "\t" $3 }

We are taking each of the individual pieces on data and separating them with a single tab.

Note: if we didn’t include a character between the variables, then awk will output variables without any white-space between them, which makes reading them difficult.

If we wanted to reorder the columns, as Name, Salary and Dept, then we can.

awk '{ print $1 "\t" $3 "\t" $2 }' data/salaries.tsv | head
Name Salary Dept
Stacey 34698.22 sales
Sidney 39117.12 payroll
Leola 31314.98 payroll

Note: We need to add the a separator in, otherwise the columns will not be separated.

AWK Patterns

If we specify a pattern, then the action will default to print.

For example, in this example, we will specify the pattern, so we won’t need the action.

awk '1 == 1' data/salaries.tsv | head
Name Salary Dept
Stacey 34698.22 sales
Sidney 39117.12 payroll
Leola 31314.98 payroll

As you can imagine, this isn’t very useful.
What could be useful, is being able to output rows where the salary is greater than 20,000.

awk '$3 < 20000' data/salaries.tsv
George payroll 17352.3
Mervin sales 14141.15
Alexzander marketing 13590.12

Note: You can use any comparison operators here, as well ( >, <, >=, <= and ==).

Or we could show only the salaries of employees in the marketing department.

awk '$2 == "marketing"' salaries.tsv
Brett marketing 39154.19
Alexzander marketing 13590.12

Now, you can also change the data, or add new columns, if you wanted to in the pattern.
You may have accidentally noticed this, if you only used one equals sign, instead of two.

In this example, I’m using multiple statements and separating them with the semicolon ( ; ) symbol.

awk '$3 = 20000; $4 = "Y"' data/salaries.tsv
Name Dept 20000
Name Dept 20000 Y
Stacey sales 20000
Stacey sales 20000 Y
Sidney payroll 20000
Sidney payroll 20000 Y

Oh, that kind of worked how we wanted.

What has happened is both statements are treated as a positive result, and they both let through each row. If you want to change the data, then use the action.

awk '{$3 = 20000; $4 = "Y"; print}' data/salaries.tsv
Name Dept 20000 Y
Stacey sales 20000 Y
Sidney payroll 20000 Y
Leola payroll 20000 Y

This isn’t quite what we wanted, but it’s close enough for the moment.
Later we will learn how to remove the header.

AWK Conditionals

Suppose that because of new law saying that minimum wages were going up and that all full-time employee had to be paid a minimum of 20,000.

We need to be able to update the list.
if $3 is less than 20,000, then change it to 20,000.

awk '{ if ($3 < 20000) $3 = 20000; print} ' data/salaries.tsv
Name Dept Salary
Stacey sales 34698.22
Sidney payroll 39117.12
Leola payroll 31314.98
George payroll 20000

If we wanted to add the extra column for the new column, then we can use an if and also supply an alternative if it doesn’t match with an else.

awk '{ if ($3 < 20000) $4 = 20000; else $4 = $3; print} ' data/salaries.tsv
Name Dept Salary Salary
Stacey sales 34698.22 34698.22
Sidney payroll 39117.12 39117.12
Leola payroll 31314.98 31314.98
George payroll 17352.3 20000

Note: If the action that you to perform in the “if” is more than one action, then you will need to use the curly braces to group them together.  e.g. awk '{ if ($3 < 20000) {$4 = 20000;} else {$4 = $3;} print} ' data/salaries.tsv

Everything looks good, except the headers are now misleading, as both of them say “Salary”.

We need to be able to target the first row and this is possible, knowing what we know so far.

awk '{ if ($1 == "Name") {$4 = "New"$3; $3 = "Old"$3; } else if ($3 < 20000) $4 = 20000; else $4 = $3; print} ' data/salaries.tsv
Name Dept OldSalary NewSalary
Stacey sales 34698.22 34698.22
Sidney payroll 39117.12 39117.12

Whilst, this technically works, I’m not happy about using “Name” to figure out if we are on the first row.
What happens if “Name” was renamed to “FirstName”?

AWK Variables

Awk has a variable that keeps track of which row we on, called NR. 
NR is a simple number that increments for every row of the input.

To find the first row, we simple have to check if NR is 1.

awk '{ if (NR == 1) {$4 = "New"$3; $3 = "Old"$3; } else if ($3 < 20000) $4 = 20000; else $4 = $3; print} ' data/salaries.tsv
Name Dept OldSalary NewSalary
Stacey sales 34698.22 34698.22
Sidney payroll 39117.12 39117.12

There is the Modulo symbol (%) that gives us the reminder, when dividing two numbers together.

For example, dividing 5 by 2, would give us 1 as the remainder.
You would represent this as 5 % 2

We can use this symbol to only show us every other row.  Either the even-numbered rows, or the odd.

Whilst this might not be entirely useful for salary information, it could be useful for getting a sample of data from the whole. 

Note: I use this technique to filter data that is captured every two seconds, and reduce it down into every 60 seconds.

awk 'NR % 2 == 0 { if (NR == 1) {$4 = "New"$3; $3 = "Old"$3; } else if ($3 < 20000) $4 = 20000; else $4 = $3; print} ' data/salaries.tsv
Stacey sales 34698.22 34698.22
Leola payroll 31314.98 31314.98
Alanis payroll 24373.73 24373.73

Oh, we’ve lost our header, because it was on the odd-numbered rows, and we are showing the even.

We could change the pattern to “NR % 2 == 0” to show the odd rows or we can cheat, and add a second header row, to make sure it gets included.

Note: You’d need to include the header, as many times as you are dividing NR by.  If you do NR % 5, then you’d need 5 copies of the header.

Another approach we can use, is to use the head command to capture the heading line, and use it before the awk output.

head -n 1 salaries.tsv && awk 'NR % 2 == 0 { if (NR == 1) {$4 = "New"$3; $3 = "Old"$3; } else if ($3 < 20000) $4 = 20000; else $4 = $3; print} ' data/salaries.tsv
Name Dept Salary
Stacey sales 34698.22 34698.22
Leola payroll 31314.98 31314.98
Alanis payroll 24373.73 24373.73

Note: The && symbol is a way to run more than one command.  The second command is only run if the first was successful.

Using Other Formats With AWK

You may not have noticed that in the sample data, I hadn’t used a single space character.  I could have easily used a persons full name, rather than just their first.

Note: The separate between the fields was a tab.

As you might imagine, this was deliberate, as awk will split each column on any white-space character. (Tabs and spaces).

We can change the symbol that awk uses to split the data by, with the -F option and specifying a simple string, or a regular expression.

awk -F "\t" 'NR == 1' data/salaries.tsv
Name Dept Salary

This separator symbol could be a simple letter.

awk -F "a" 'NR == 1 { print $1 " "  $2 " " $3 " " $4}' data/salaries.tsv
N me Dept S l ry

However, you can also use this to help read in other file formats, for example csv.

awk -F "," '{ print $1 " "  $2 " " $3 " " $4}' data/salaries.csv
Name Dept Salary 
Stacey sales 34698.22 
Sidney payroll 39117.12 
Leola payroll 31314.98 
George payroll 17352.3 

Conclusion

The awk command is an invaluable tool in the realm of text processing in Unix-like operating systems. With its exceptional capability to scan, manipulate, and generate formatted reports from text files, awk is truly a Swiss Army knife for any command-line enthusiast or system administrator.

While it’s primarily used for simple tasks such as field extraction and arithmetic operations, the true strength of awk lies in its scripting capabilities. Complex programs can be written in awk, incorporating loops, conditionals, and functions. This makes awk not just a command, but a full-fledged scripting language tailor-made for processing structured text data.

As we’ve seen, the power of awk can be harnessed for a multitude of tasks, from parsing TSV and processing text files to generating reports and performing calculations. However, the depth of awk extends far beyond what we’ve touched upon in this guide. The awk command is vast and flexible, and the more you delve into it, the more useful you will find it.

In closing, whether you’re a system admin managing server logs, a programmer handling complex text data, or a curious learner exploring the expanses of your system’s command line, awk offers you a robust toolkit that can simplify your tasks and increase your productivity. Don’t hesitate to explore the man awk or awk --help commands to unlock even more of its potential.

Daniel

Whilst building web applications, Daniel also sets up web servers from scratch because he has yet to find the perfect hosting solution. His philosophy is “Why settle, when you can build it better yourself?”

Recent Posts