1 Appendix

1.1 Appendix I. Reformatting Lab Data with Sed and Awk

Sed and awk are two Unix editting tools that are particularly useful in processing data taken in lab. Awk is programmable and uses the syntax of C and shell programming.
If you have little or no programming experience, and it is difficult for you to use the support programs included in the manual, try to use Awk to process your data. It is easy to learn, and there are lots of Awk scripts strewn throughout the manual. Awk is a free download for Win32 machines from the physics department website, and is part of any Linux installation.
As an example we will process data for the Nuclear Counting Statistics experiment.
Suppose that we count nuclear decays for 10 second periods, obtaining the following 20 measurements

 
1016  
1207  
1186  
1244  
1110  
1099  
1099  
1185  
1220  
1286  
1117  
1280  
1190  
1083  
1177  
1189  
1200  
1201  
1188  
1291

which we enter into a file called list , one number per line as shown above, so the command

cat list

produces the output above. We need to find the largest and smallest values, easy for 20 data points, but what if we had hundreds? We can sort the data using the Unix program sort, with the -n switch telling sort to sort by arithmetic value.

sort -n < list > sorted_list

which looks like this

1016  
1083  
1099  
1099  
1110  
1117  
1177  
1185  
1186  
1188  
1189  
1190  
1200  
1201  
1207  
1220  
1244  
1280  
1286  
1291

This file will have a first line that is blank, a feature of some versions of sort, this means that the second line of the file sorted_list and the last lineare the smallest and largest entries in our data file. Check for this behaviour when running sort for the first time .We can pipe the output of this program into an awk script min_max.awk that will read the file and report the second and last records to the console. We need to tell awk that it will read a file whose records are single lines in column format, one record per line. The record separator RS will be a blank line, beginning with a null character, the field separator FS is a newline character.
The normal operation mode of awk is that awk acts on each line of a file in turn, treating each line as a record with several data fields, each separated by a field separator symbol that is user defined. The fields are accessed by refereing to them by the names $1, $2, and so forth. The last data field is called $NF. Here is the awk script min_max.awk

#min_max.awk  
BEGIN { FS = "\n"; RS = ""}  
{ print "smallest is " , $2, " " , " largest is", $NF}

We can obtain the largest and smallest numbers in our data set by running sort on the file list and piping the output into this awk script

sort < list | awk -f min_max.awk

which produces the output line

 .  
Smallest is  1016   largest is 1291

The next stage in processing the data for this experiment is to sort the data into bins of a given width in count-space. The range of counts is

Nmax - Nmin =  1291- 1016 = 275
We can sort our data into 11 bins of width 25 counts by running the data in list through the following awk script called bin_sort.awk
#bin_sort.awk  
{  
i=0  
while ( i<11 ) {  
if ( $1 >= 1016+25*i && $1 <= 1016+25+25*i )  
print i, " ", $1  
++i}  
}

This will run through the loop for each line of the file list in turn, assign each data point to a bin, and print the bin first, then the data point separated by a blank space. The output of

awk -f bin_sort.awk list

is

 
0   1016  
7   1207  
6   1186  
9   1244  
3   1110  
3   1099  
3   1099  
6   1185  
8   1220  
10   1286  
4   1117  
10   1280  
6   1190  
2   1083  
6   1177  
6   1189  
7   1200  
7   1201  
6   1188  
10   1291

We can sort this data by running

awk -f bin_sort.awk list | sort -n

which will produce the sorted output below. Note that sort will perform numeric sorting based on the first field of each data line.

0   1016  
2   1083  
3   1099  
3   1099  
3   1110  
4   1117  
6   1177  
6   1185  
6   1186  
6   1188  
6   1189  
6   1190  
7   1200  
7   1201  
7   1207  
8   1220  
9   1244  
10   1280  
10   1286  
10   1291

We can do even better by telling bin_sort.awk to print only the bin number of each data point and piping the output through sort into the uniq command that will list the number of occurances of each bin label in the list, and output two columns, bin population followed by bin number.

#bin_sort.awk  
{  
i=0  
while ( i<11 ) {  
if ( $1 >= 1016+25*i && $1 <= 1016+25+25*i )  
print i  
++i}  
}

We run the command

awk -f bin_sort.awk list | sort -n | uniq -c

which produces output

      1 0  
      1 2  
      3 3  
      1 4  
      6 6  
      3 7  
      1 8  
      1 9  
      3 10

We can save this data to a file called bin_pop with

awk -f bin_sort.awk list | sort -n | uniq -c >bin_pop

and print out the contents of the file in a nice format that can be imported into a LaTeX lab report as a table

awk ’{ print $2, "&", $1, "\\\\", "\\hline"}’ bin_pop

which produces

0 & 1 \\ \hline  
2 & 1 \\ \hline  
3 & 3 \\ \hline  
4 & 1 \\ \hline  
6 & 6 \\ \hline  
7 & 3 \\ \hline  
8 & 1 \\ \hline  
9 & 1 \\ \hline  
10 & 3 \\ \hline

We now add a few lines to this and we have our processed data in the form of a nice LaTeX table.

\begin{tabular}{|c|c|}\hline  
Bin label & Bin population \\  
\hline  
0 & 1 \\ \hline  
2 & 1 \\ \hline  
3 & 3 \\ \hline  
4 & 1 \\ \hline  
6 & 6 \\ \hline  
7 & 3 \\ \hline  
8 & 1 \\ \hline  
9 & 1 \\ \hline  
10 & 3 \\ \hline  
\end{tabular}

and this prints as seen below



Bin labelBin population


0 1


2 1


3 3


4 1


6 6


7 3


8 1


9 1


10 3



We can include another column for the center of the bin with

awk -f table.bin bin_pop

using the script

#table.awk  
{ print $2, "&", $1, "&", 1016+12+25*$2, "\\\\", "\\hline"}

which produces the output

 
0 & 1 & 1028 \\ \hline  
2 & 1 & 1078 \\ \hline  
3 & 3 & 1103 \\ \hline  
4 & 1 & 1128 \\ \hline  
6 & 6 & 1178 \\ \hline  
7 & 3 & 1203 \\ \hline  
8 & 1 & 1228 \\ \hline  
9 & 1 & 1253 \\ \hline  
10 & 3 & 1278 \\ \hline

After adding afew LaTeX lines this becomes the table




BinBin pop.Bin center



0 1 1028



2 1 1078



3 3 1103



4 1 1128



6 6 1178



7 3 1203



8 1 1228



9 1 1253



10 3 1278




Awk and sed can not only be used to easily process huge data files for analysis, but also to prepare data processed into convenient reports.