UNIX/Linux

In order to proceed with our analysis of large data sets, we will require a more sophisticated computing environment. The “operating system” is the set of instructions that allows users to interact with the hardware; it is basically a set of computer programs. Examples are DOS, Windows, WindowsXP, MacOS and Unix. Note that an operating system is not the same thing as a window system. Unix comes in several varieties, now most are based on Linux. We will use the ubuntu version of linux. Why this OS?

  • open source (lots of free stuff)

  • networking

  • multiprocessing

  • security

  • process-level interaction

  • file sharing

  • etc.

While the operating system is based on ubuntu Linux, the window (desktop) system is based on KDE. When you first login, you should get a desktop with a menu accessed via the small mouse icon. Here you can access all the different programs, and this works much the same way as Windows and Mac desktops (i.e., you can configure, change preferences, customize, etc.). For this class we will mainly use two client tools: a terminal window and a text editor. The former is simply “terminal”, while the latter is called “leafpad”. Note here you could also use a terminal-based editor such as “vi” or “emacs”, but these are less straightforward.

A. Background

The UNIX 1 operating system was developed by AT&T back in the mid-1970s when machine memory was at a premium. You will find that most of the commands are short and take optional arguments (usually as single letters) preceded with a dash (-).

First, it is important to note that everything on a UNIX system is either a file or a directory (think folder). UNIX systems rely on a hierarchy of safeguards that partition, and in some cases prohibit, various tasks and capabilities to certain users. On most systems you will be required to login using your username. This at once defines you as “user” belonging to a “group”.

Individual files have three different permissions for specific actions that can be taken with the file: permissions for all users (“all”), to a specific group of users (“group”) and limited to a specific user (“user”). Further, files and directories have three different states (or actions): readable, writeable and executable.

As an example, a user “jimp” might belong to a group “soest”. Further, “jimp” might have a file called “my_data.dat”. The file will then have a status whereby it can be viewed (“readable”) and/or modified (“writeable”) and/or run as a program (“executable”). These permissions then can be applied independently to the user (“jimp”) the group (“soest”) and all other users. This will be described in more detail later, along with commands to change permissions.

Next, UNIX file systems are organized around a directory structure. Directories are demarcated with a forward slash (“/”). The uppermost (aka root) directory is simply “/”, while user directories normally are organized under a subdirectory with individual names. For example, on a Mac, users have directories in a subdirectory of “/” called “Users”. The series of subdirectories that lead to individual files is referred to as the path.

For commands in UNIX to be “found”, i.e., recognized by the operating system, they must either be in the user’s path, or such a path must be entered explicitly. If there is a program called xplot in the directory /usr/local/bin, to run this program, a user can enter the command xplot. However, if /usr/local/bin is not in the user’s path, it will not be found (the operating system will return “command not found”). In this case the entire path must be entered:

installation@kilo:~$ xplot

xplot: Command not found

installation@kilo:~$ /usr/local/bin/xplot

The default path should include /bin and /usr/bin where most commands are found. To expand or change this, shell commands must be entered (shown later).

There are many more useful commands that will come up with practice and experience. More information about commands is found with the on-line manual pages, accessed with the command man followed by the command of interest. Of course to use man you’d have to already know the command; apropos will look for commands that may be relevant. As an example, to get information on how to use the ls command, “man ls”; to look for all commands that may involve file listing, “apropos list”.

B. File system navigation

The first set of commands that we will cover include “file system navigation”, or how to move around the directory structure and show files. Once logged in, the first UNIX command to try is “pwd” to print the current working directory. This is also known as the user’s home directory, and it is the root level directory for all the user’s files. For example, if you login as user “installation” (the default instructor account):

installation@:~$ pwd

returns:

/homelocal/installation

This means that the home directory of the user “installation” is in a subdirectory called “installation”, under main directory called “home”. There are likely other users in the same home subdirectory. Most of the students in this class are in the GES program, and their home directories are in /home/iniki/ges/username.

UNIX has a few shorthand characters for directories. These include tilde (~) for the home directory, a single period (.) for the current directory and a double-period (..) for the directory “above” the current working directory in the hierarchical sense.

The command to change into different directories is cd (change directory). The syntax is

installation@kilo:~$ cd new_dir

where “new_dir” is the directory where you want to go. Specific to the cd command, cd with no argument will change to the users home directory. As an example, if user jimp has a home directory in /home/users/jimp, these three methods will change directory into jimp’s home:

installation@kilo:~$ cd /home/users/jimp

installation@kilo:~$ cd

installation@kilo:~$ cd ~

The use of . and .. are a little more subtle but can sometimes be quite useful. There are two important notes here. First, be sure to make note here of the path. For example, let’s say there are two subdirectories in /home/users/jimp, one called homework and another called scripts. If your current directory is homework, and you want to change to scripts, you can’t simply do

installation@kilo:~$ cd scripts

since there is no directory within homework called scripts, it’s a directory above. So, you can do either of these:

installation@kilo:~$ cd /home/users/jimp/scripts

installation@kilo:~$ cd ../scripts

The command “ls” will list all the files and directories in a directory. This command has a large number of optional arguments, and some of the more useful ones are “F” for a formatted listing and “l” for a long listing. For example:

installation@kilo:~$ ls -lF

returns:

total 96

-rwxr–r– 1 pi pi 1270 Jun 14 20:57 blink.py

-rw-r–r– 1 pi pi 340 Jun 16 02:09 button_test.py

drwxr-xr-x 2 pi pi 4096 Mar 18 08:45 Desktop/

-rw-r–r– 1 pi pi 5969 Jun 14 19:48 dht11.py

-rw-r–r– 1 pi pi 4501 Jun 14 19:48 dht11.pyc

drwxr-xr-x 5 pi pi 4096 Mar 18 08:45 Documents/

drwxr-xr-x 2 pi pi 4096 Mar 18 08:58 Downloads/

-rwxr-xr-x 1 pi pi 7868 Mar 18 20:27 example*

-rwxr–r– 1 pi pi 402 Jun 16 04:37 io_test.py

-rw-r–r– 1 pi pi 387 Jun 17 20:56 motion_sensor.py

drwxr-xr-x 2 pi pi 4096 Mar 18 08:58 Music/

-rw-r–r– 1 pi pi 370 Jun 16 04:16 output_test.py

drwxr-xr-x 2 pi pi 4096 Mar 18 08:58 Pictures/

drwxr-xr-x 2 pi pi 4096 Mar 18 08:58 Public/

This is a list of all the files in the “current working directory”. Note the files have different colors (plain files in black, directories in blue, “programs” in green). This may or may not be the case and is sometimes setup automatically (more on this later). The files are listed by name, and since the option -l was given, a long-listing is shown, and this gives information about the file permissions described later.

The first character identifies the file as either a directory (d) or file (-, or “not a directory”). The next set of characters is three groups of three combinations of r (“readable”) w (“writable”), x (“executable”) or – (“not”). The first group is specific for the user, the second for the group, and the third for everyone else. The file owner and group are given next.

For example the first file (blink.py) is owned by user pi who belong to group pi. The leading -rwxr–r— indicates:

  • this is a file (-) not a directory,

  • the user (pi) can read, modify and execute the file as a program (rwx),

  • the user’s group (coincidentally also called pi) can read this file but not modify or execute it (r–), and

  • everyone else can also read but not modify nor execute this (r–).

Next in the long-listing is the file size in bytes (1270) and modification date (Jun 14 20:57). Since we also gave the ls command the option –F for formatting, the display shows a “/” after directories and a “*” after files that are executable.

The files that are listed as a result of “ls” are those in the current working directory. We could also optionally specify a directory in order to show the files there instead. For example, to show the files in another user’s directory (e.g., “jimp”), we could enter:

installation@kilo:~$ ls –lF /home/jimp

Or, to show all the subdirectories under “home”, “ls –lF /home”, and so on. UNIX makes special use of files that begin with a period (.). These can be anything, but the convention is to name configuration files with a leading period. The listing command will not show these files, but with the option –a, list all files, they will be. For example, the file .bashrc is a configuration file for the bash shell (more on this later). This file will not appear using ls –lF, but will using ls –laF.

C. Viewing files

Now that we can see the files and directories, the commands “cat” (for concatenate) and “more” will display the contents of a file. The difference is cat will stream the entire contents without stopping, while more will show one page at a time. Unlike ls, which can be entered without specifying a directory path (defaulting to the current directory), more and cat require a file name as an argument. This name can either be an absolute path or a relative path. For example, to display the contents of the file blink.py in the home directory of installation:

installation@kilo:~$ cat blink.py

This is an example of specifying the filename using a relative path (e.g., relative to where I currently am, show the contents of file blink.py). Alternately, to use the absolute path,

installation@kilo:~$ cat /homelocal/installation/blink.py

Recall from the earlier discussion that all files have permissions set that allow one to read/write/execute files. If a user does not have permission to read a file, the commands cat and more will return “permission denied”. Similar to cat and more, the commands head and tail will show the top and bottom, respectively, of files. These two commands have optional arguments to show specific lines. For example, head -5 will show the top five lines.

D. Modifying files

The next set of commands that we will cover includes creating (or changing or removing) files. Typically, files are created using a text editor or some piece of software. Files can also be created with the commands touch (creates an empty file), cp (copy) and mv (move). The syntax for touch is simply:

installation@kilo:~$ touch sample_file.txt

This will create a file of zero size (nothing in it) called sample_file.txt. The other two commands require two filenames, a source and a target file. For example, if jimp has a file called sample_file.txt in his home directory and wants to copy this to a new file called sample2_file.txt:

jimp@kilo:~$ cd ~

jimp@kilo:~$ cp sample_file.txt sample2_file.txt

There will now be two identical files with different names. On the other hand, to rename the file from sample_file.txt to sample2_file.txt

jimp@kilo:~$ mv sample_file.txt sample_file2.txt

will rename (move) the original file. As before, paths become very important here and both relative and absolute pathnames can be used. Recall as well the shorthand notation for current directory (.) and upper-level directory (..). In the following example, assume there is a file called file1.txt in jimp’s home directory (/home/jimp). The following three commands will all do the same thing, i.e., make a copy of file1.txt in the subdirectory called data, and name the copy file2.txt:

installation@kilo:~$ cd ~jimp

installation@kilo:~$ cp file1.txt /home/jimp/data/file2.txt

installation@kilo:~$ cp file1.txt ~/data/file2.txt

installation@kilo:~$ cp file1.txt ./data/file2.txt

The first makes use of the absolute path to the new directory (and filename); the second and third are relative paths (one relative to jimp’s home, the other to the current working directory).

While cp and mv can be used for both files and directories, the command mkdir is used to create a new directory. Again, the argument following mkdir is the directory name, and can include a path.

Files are removed using rm, while rmdir is used to remove directories. As always, file (and directory) permissions will determine whether files or directories can be removed.

There are additional “wildcards” in UNIX that can be used to make file handling more efficient. These include text replacements ? (question mark) and * (asterisk). These represent, respectively, any single character or any set of characters. For example, if there were files Sep15.dat, Sep.txt, Oct15.dat, Nov1.dat, Nov2.dat, and Nov15.dat, the following command would delete all files that end with .dat:

installation@kilo:~$ rm *.dat

The character * replaced any text, so essentially the command issued is to remove any files that have any series of text followed by .dat. Similarly,

installation@kilo:~$ rm Sep*

will remove both Sep15.dat and Sep15.txt (any files that start with Sep). The question mark works the same way but replaces a single character:

installation@kilo:~$ rm Nov?.dat

will remove Nov1.dat and Nov2.dat but not Nov15.dat since the first two fit the criteria starting with Nov, followed by a single character, followed by .dat while the third file (Nov15.dat) does not.

The use of wildcards should be done with extreme caution. For example, rm * will delete all files in a directory. Once deleted, it is impossible to retrieve files.

E. Permissions

A final useful command is chmod, used to change the permissions on files. There are two ways to use chmod, one with numerics and the other with symbols. The numerical method uses a 1 for execute, a 2 for write and a 4 for read. Since there are three levels (user, group, all), three such numbers are needed. This command:

installation@kilo:~$ chmod 644 file1.txt

will cause file1.txt to be readable, writable (but not executable) by the user (2+4=6), readable (but not writable nor executable) by the group (4) and all (4). Zero can also be used for none. Thus, to prevent others outside of a user’s group to view files, use 0. Instead of numbers, symbols may be used, e.g., +r-w to enable read privilege but disable write. Related commands, chown and chgrp, can be used to change the owner or group, respectively.

F. Special characters

At this point it should become clear that certain characters should not be used in file names. For example, if you name a file my*data.dat, it could cause problems when trying to delete, rename or otherwise use this file (since * is a wildcard). So wildcards (*,?) should not be used. Similarly we’ve seen ~, ., .. and / used for directories addressing, so these should be avoided in file names. Spaces are okay in Linux filenames, but it would be much better to avoid them so there’s no confusion between a single file and two separate ones. In short, it’s best to restrict characters used in file names to letters, numbers and underscores.

G. File I/O

The shell and many Linux commands take their input from standard input (stdin), write output to standard output (stdout), and write error output to standard error (stderr). By default, standard input is connected to the terminal keyboard and standard output and error to the terminal screen.

Redirection of I/O, for example to a file, is accomplished by specifying the destination on the command line using a redirection metacharacter followed by the desired destination. These characters depend on the type of shell being used.

C Shell Family

Some of the forms of redirection for the C shell family are:

Character

Action

>

Redirect standard output

>&

Redirect standard output and standard error

<

Redirect standard input

>!

Redirect standard output; overwrite file if it exists

>&!

Redirect standard output and standard error; overwrite file if it exists

|

Redirect standard output to another command (pipe)

>>

Append standard output

>>&

Append standard output and standard error

Bourne Shell Family

The Bourne shell uses a different format for redirection which includes numbers. The numbers refer to the file descriptor numbers (0 standard input, 1 standard output, 2 standard error). For example, 2> redirects file descriptor 2, or standard error. &n is the syntax for redirecting to a specific open file. For example 2>&1 redirects 2 (standard error) to 1 (standard output); if 1 has been redirected to a file, 2 goes there too. Other file descriptor numbers are assigned sequentially to other open files, or can be explicitly referenced in the shell scripts. Some of the forms of redirection for the Bourne shell family are:

Character

Action

>

Redirect standard output

2>

Redirect standard error

2>&1

Redirect standard error to standard output

<

Redirect standard input

|

Pipe standard output to another command

>>

Append to standard output

2>&1|

Pipe standard output and standard error to another command

Note that < and > assume standard input and output, respectively, as the default, so the numbers 0 and 1 can be left off.

The form of a command with standard input and output redirection is:

command -[options] [arguments] < input file > output file

As an example, we’ve seen the ls command will list files in a directory. The output of the command goes to standard out, i.e., the terminal. To instead redirect to a file:

installation@kilo:~$ ls -lF > file1.txt

will send the output to the file file1.txt.

If you are using csh and do not have the noclobber variable set, using > and >& to redirect output will overwrite any existing file of that name. Setting noclobber prevents this. Using >! and >&! always forces the file to be overwritten. Use >> and >>& to append output to existing files.

Redirection may fail under some circumstances: 1) if you have the variable noclobber set and you attempt to redirect output to an existing file without forcing an overwrite, 2) if you redirect output to a file you don’t have write access to, and 3) if you redirect output to a directory.

Input redirection can be useful if you have written a program which expects input from the terminal and you want to provide it from a file. In the following example, myprog, which was written to read standard input and write standard output, is redirected to read myin and write myout.

installation@kilo:~$ myprog < myin > myout

To redirect standard error and output to different files (note that grouping is not necessary in the Bourne shell):

(csh) installation@kilo:~$ (cat myfile > myout) >& myerror

(bash) installation@kilo:~$ cat myfile > myout2 > myerror

It should also be noted at this point that programs that are writing to standard output are typically run in “background”, both so that the terminal can be used and so that the program will run even if the terminal is closed (e.g., the user logs out). The “&” is used to place programs into background. To retrieve and make active, the command “fg” is used (fg %# if more than one are running, where # is a number obtained from “jobs”); to return to background, <cont-z> and bg are used. There are numerous other commands to help with managing jobs.

The pipe command is used to redirect the output of one command into another command. A built-in pattern recognition in Linux is grep, used for example as:

installation@kilo:~$ grep jimp file1.txt

This will display all the lines in file1.txt that contain the string jimp. Thus, to list all files that contain the string jimp, one could list all files, then pipe the output of that command to the command grep, rather than having to run grep individually on each file. In this case only files names that contain jimp are displayed:

installation@kilo:~$ ls -lF | grep jimp

H. Inter-computer communication

Most computers are connected to the internet and therefore can be accessed remotely. This interaction can happen via file transfer, e.g., copying files from one machine to another, or via direct login. The older ways of doing this include telnet (for remote login) and ftp (for file transfer). With heightened security concerns these have been replaced with “secure” versions: ssh for remote login (secure shell) and scp for file copy (secure copy).

The syntax is like other Linux commands, with scp working somewhat like cp. The main difference is scp can work over the internet. As an example, a user has a file file1.dat on a computer in the lab (comp1.soest.hawaii.edu) and wants to copy it to a remote machine (comp2.rr.com). Further, let’s assume that the user login ID is “jim” for the remote machine. The appropriate command is:

installation@comp1:~$ scp file1.txt jim@comp2.rr.com

More generally, the syntax is scp, followed by the local filename (and path if needed), followed by the remote user name, @, the remote machine, :, and finally the remote path (the directory where the file will be placed). This can work both copying from a local machine to a remote machine, but also from a remote machine to a local machine, e.g.:

jim@comp2:~$ scp pi@comp1.soest.hawaii.edu:/home/jimp/file1.txt /home/jimp

The secure shell is a little more straightforward. Here the command takes the remote machine name as an argument. Additional options that are very common include the user name (preceded with –l) on the remote machine and an optional “tunneling” for graphics. Following the above example, if user jim wants to access comp1.soest.hawaii.edu from a remote machine,

jim@comp2:~$ ssh –l pi –X comp1.soest.hawaii.edu

Here, the username on the remote machine (comp1.soest.hawaii.edu) is “pi”, so this is specified following the option –l. Without this option, ssh will assume that the username on the remote machine is the same as the local machine (jim).

The tunneling options are typically –X and/or –Y for X11 graphics. Many graphics-based programs (gui) or programs that can create graphics (e.g., Matlab), will require tunneling so that these graphics can be displayed. In the above example, if user jim does an ssh into the remote machine comp1, and then runs some program that makes a display, it doesn’t make sense (usually) to have the display show up on the remote machine. Instead, the –X flag specifies in a sense that Xwindows graphic displays are to be displayed on comp2. Some use –X, others –Y (for trusted X11 forwarding), but it doesn’t hurt to use both.

I. System level commands

We will not typically be running system commands on these computers (leave that to RCF). However, sometimes these commands can help diagnose problems. Linux allows complete system administration access (given the right user permissions). System level tools are also numerous, but the main ones include ways to check and kill processes, installing software, and managing system performance.

The command “top” will show the main jobs or programs running, in real-time, on the system. The “q” key will escape out of top. An example top response is (note this was done on a Mac):

Processes: 177 total, 2 running, 11 stuck, 164 sleeping, 831 threads 18:23:28

Load Avg: 1.02, 1.05, 1.19 CPU usage: 1.46% user, 2.93% sys, 95.59% idle

SharedLibs: 11M resident, 11M data, 0B linkedit.

MemRegions: 69456 total, 2532M resident, 56M private, 628M shared.

PhysMem: 6643M used (1573M wired), 9737M unused.

VM: 438G vsize, 1064M framework vsize, 6523591(0) swapins, 8411222(0) swapouts.

Networks: packets: 33086991/44G in, 6542916/693M out.

Disks: 3162666/428G read, 3646933/509G written.

PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID

4815 top 1.7 00:00.44 1/1 0 19 2288K 0B 0B 4815 1786

4810 com.apple.iC 0.0 00:00.06 2 0 45 1984K 0B 0B 4810 1

4803 mdworker 0.0 00:00.01 4 0 49 1588K 0B 0B 4803 1

4794 ocspd 0.0 00:00.03 1 0 19 1208K 0B 0B 4794 1

4787- Microsoft AU 0.0 00:00.07 2 0 96 3604K 0B 0B 4787 1

4781- Microsoft Wo 0.3 00:36.35 4 1 153 150M 0B 0B 4781 1

4772 mdworker 0.0 00:00.12 4 0 59 7876K 0B 0B 4772 1

4770 mdworker 0.0 00:00.05 4 0 59 7840K 0B 0B 4770 1

4769 CalNCService 0.0 00:00.34 2 0 62 8256K 0B 0B 4769 1

4765 mdworker 0.0 00:00.30 4 0 59 9036K 0B 0B 4765 1

4751 distnoted 0.0 00:00.02 2 0 29 636K 0B 0B 4751 1

4721 coresymbolic 0.0 00:00.02 2 1 23 904K 0B 0B 4721 1

4719 bird 0.0 00:00.12 5 0 93 4048K 0B 0B 4719 1

4710 com.apple.GS 0.0 00:00.01 3 2 22 980K 0B 0B 4710 1

4707 deleted 0.0 00:00.04 2 1 33 1436K 0B 0B 4707 1

A similar process-oriented command is “ps”, for “process status”. This command will show the processes that are running, along with their process ID, time running, and user (note, the more useful options are –ef, so typically the command is ps –ef. Processes can be terminated (if the user has sufficient permission), using the “kill” command with the appropriate process ID.

Maintaining system software can be done using installation from source, e.g., compiling the code, or using a package management system. The Debian set of Linux distributions (i.e., ubuntu) use “apt” (Advanced Packaging Tool). There are two main commands: apt-get to install and apt-cache to work with package management. (Note that other Linux variants use yum or rpm or something similar). Below is a list of the more common command options:

  • apt-cache pkgnames – list available packages

  • apt-cache show *packname* – provide details about a certain package (specify with packname)

  • apt-cache showpkg *packname* – show the dependencies for packname (some packages require extra libraries or additional programs in order to work)

  • apt-get update – to update all installed packages

  • apt-get upgrade – to upgrade packages

  • apt-get install *packname* – to install the package packname (note there are many options here, including to update to a specific version, to use wild cards, to add multiple packages at once, and so on)

  • apt-get remove *packname* – to remove package packname

  • apt-get purge *packname* – similar to remove, but in addition all configuration files will go

J. File types

We already discussed certain characters to avoid in naming files as they may be misinterpreted by the Linux shell. At this point it would useful to discuss certain file conventions. In short, there are two types of files that we need to know: binary and ASCII. Binary files are “machine readable”, i.e., the computer (or program) will know what to do with them, but are not readable text. Files of this type include certain image file formats, executable programs, etc. If you try one of our commands to view a binary file, e.g., “cat myprog.exe”, you’ll see unrecognizable output on the screen.

ASCII files, on the other hand, are human readable. For these, we can use a standard text editor to manipulate them. It should be noted that some files may seem to be binary but are actually ASCII (e.g., PDF files), while others might be thought to be ASCII but are actually binary (e.g., .doc files).

There is a Linux command “file” that will report (if known) what the file type is. For example:

installation@kilo:~$ file mat1.m

mat1.m: ASCII text

installation@kilo:~$ file graph1.jpg

graph1.jpg: JPEG image data, JFIF standard 1.01

installation@kilo:~$ file test2.mat

test2.mat: Matlab v5 mat-file (little endian) version 0x0100

There is certainly a standard convention for file names that we will try to adhere to in class. Usually files consist of some name with a suffix and a dot (.) between the two. The suffix should help identify the file type (although you should not rely on this). For example, you are likely familiar with files ending in .doc or .docx (Microsoft Word files), .xls or .xsls (Microsoft Excel files), and.pdf (Portable Document Format) files.

In this class we will work with ASCII data files usually named file.dat or file.txt. Script files (again ASCII) for Matlab will end with .m, for python .py and for shell scripts, .s (or .csh and .sh). Binary files will usually end in .bin (in a generic sense), .mat for Matlab binary, and .nc for NetCDF files. GIS files (e.g., shapefiles) have their own convention that we’ll discuss later.

1

Technically speaking, UNIX is a copyright held by certain big computing firms like Sun and HP; Linux on the other hand was developed as a UNIX clone. I will just be using UNIX throughout this document, but it’s probably more appropriate to use Linux.