AWK programming language
AWK is a programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating system. When written in all lowercase letters, as
awk
, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language.SOME AWK COMMANDS AND USE:
to find all the lines with a string ‘foo’
awk ‘/foo/ {print}’ file
to print line 1 to 2
awk ‘NR == 1, NR == 2 {print}’ file
to print line 15000 to end of the file
awk 'NR==15000, NR==$NR' test.csv
to print 2nd and last columns
awk '{print $2,$NF;}' employee.txt
In the print statement ‘,’ is a concatenator.
Printf >>
Dash (‘-’) means left aligned
awk -F ' ' '{printf "%-10s%-10s%-10s\n", $2,$4,$6}'
Sprintf >>
Replace the first field by its formatted representation and output the line.
awk '{$1 = sprintf("%4d", $1); print}' infile > outfile.txt
To use bash variables inside awk command >>>
If bash variable is the pattern to march >
awk -v ref="$pattern" 'match($0, ref) {print $2}'
If bash variable is to print as output >
awk -v var="$printthis" '/patt/ {print var}'
for ff in $( ls out* ); do sed -n '/^1.*/p' $ff |awk -F':' -v v="$ff" '{printf("%s %e\n"), v, $2}'; done
to get the number of lines >>
awk ‘END {print NR}’ file_name
Syntax >>
BEGIN { Actions} # Action before read file
{ACTION} # Action for every line in a file
END { Actions } # Action after read file
file_name
example >>
awk 'BEGIN {print "Name\tDesignation\tDepartment\tSalary";}
{print $2,"\t",$3,"\t",$4,"\t",$NF;}
END{print "Report Generated\n--------------";
}' file_name
Get average of 1st column
awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
to print if the column 1 value is greater than 200
awk ‘$1 > 200’ file_name
~ operator is for comparing regular expressions. to print all the line which have “text” in 4th column
awk '$4 ~/text/' file_name
To count number of rows with “text” in 4th column
awk 'BEGIN { count=0;}
$4 ~ /text/ { count++; }
END { print "Number of lines =",count;}' file_name
to find average of entries which satisfy a condition.
awk -F, 'BEGIN {s=0; c=0;} {if($1>1952) {s=s+$2; c++;}} END {print "avg =",s/c}' input_file
Conditions >>
Check if line starts with ## and print second column
awk -F ' ' '(/^##/) {print $2}' file_name
If statement>>
awk -F ' ' ' {if(/^##/) print $2}' file_name
Combined conditions >>
awk '($1>100) && ($1<200) {print $1;}' rgb-only
If statement>>
awk '{if($1>100 && $1<200) print $1;}' rgb-only
If-else statement>>
awk -v vv="$myline" '{if(/^region/) print vv; else print $0}' in.comp.32
Nested if statements >>
If the line begins with Totaltime then increment q and print if q%2 is equal to 1
awk 'BEGIN{q=0;}{if(/^Totaltime=.*/){q++; if(q%2==1) print;}}' Log1
to denote field separator use -F
awk -F, '$2>=1950 {print $2}' file_name
Using .awk files>>
Content of the grade.awk file ::
{
total=$3+$4+$5;
avg=total/3;
if ( avg >= 90 ) grade="A";
else if ( avg >= 80) grade ="B";
else if (avg >= 70) grade ="C";
else grade="D";
print $0,"=>",grade;
}
Command to run ::
awk -f grade.awk student-marks
http://www.hcs.harvard.edu/~dholland/computers/awk.html
lab2 -Data viz >>
get data from line 4 to the end of the file.
sed -n '4,$p' nation.1751_2010.csv >> onlydata
get lines which have values greater than 1950. "-F," field are
separated by commas.
awk -F, '$2>=1950' onlydata >> onlydata.4m1950
to sort according to the numerical value of column 3. "-t," indicate the
field separator.
sort -nt, -k 3,3 onlydata.4m1950 >>sorted.byemission
To get all the lines starting the string "UNITED KINGDOM".
awk '/^UNITED KINGDOM,/' onlydata.4m1950 >>uk.csv
to replace commas with tabs while extracting column 2, 3, and 9.
awk -F, '{print $2"\t"$3"\t"$9}' ../japan.csv >>japan.tsv
remove lines with "CHINA" in the beginning. then sort according to 3rd line
awk '!/^CHINA/ {print}' trimed_data | sort -nt, -k 3,3
remove lines with all the names given and sort.
awk '!/^CHINA|^UNITED|^INDIA|^JAPAN/ {print}' trimed_data | sort -nt, -k 3,3
awk -F'\t' '$2 >=1999 {print}'
to use bash variables in awk, use "'${var_name}'"
year=1950
awk -F'\t' '$2 >="'${year}'" {print}' all.others.4m1950
awk 'BEGIN{x=0;} $2==1999 {x=x+$3} END {print x}' all.others.4m1950
to add values in third column to a single value if the year matches
##################
#!/bin/bash
for i in `seq 1950 2010`
do
x=`awk 'BEGIN{x=0;} $2=="'${i}'" {x=x+$3} END {print x}' all.others.4m1950`
echo "$i,$x"
done
##################
to get the mean of the columns 2, 3, 4, and 5 (except the 1st row) from several files and print it with the file name
#!/bin/bash
for file in 60 80 120 158 182 200 216
do
printf $i"\t"
awk 'BEGIN{x1=0; x2=0; x3=0; x4=0; count=0;} \
{if(NR>1) x1+=$2; x2+=$3; x3+=$4; x4+=$5; count++;} \
END{print x1/count"\t"x2/count"\t"x3/count"\t"x4/count}' \
./$file
done
to do same with sed (this is only for one file at a time)
sed -n '2,$p' 20 |awk 'BEGIN{x=0} {x+=$2} END{print x/NR}'
Combine two files with matching values >>
File2 contains the column1 ($1) with the unique ID and column2 and column3 with values need to be added to the end of the lines in file1, when unique ID is matches file1’s 1st column
awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
Explanation ::
FNR==NR > if current file’s record number(FNR) == total records read(NR). This is true only for the first file (ie. file2 in this case)
{a[$1]=$2 FS $3;next} > create a hash array with key being $1 and values being $2 and $3 separated by field separator FS. This is executed only for the first file, since this is the body of the above condition. next make awk to jump into the next line without executing the rest.
{ print $0, a[$1]} >prints whole line ($0) and the value associated with the key $1 from the hash array a[]
Subtract a value in a column from value in a column of a different file >>
Here, values in 3rd column of ‘bb’ are subtracted from values in 3rd column of file ‘tt’
awk '{c1 = $1; c2=$2; c3=$3; getline<"bb"; print c1,c2,c3,$3, c3-$3;}' <tt
Comments
Post a Comment