Tide Gauges¶
Lecture notes from Friday, Jan 11
We are now ready to work with some data, and our example will be tide gauge data from the UH Sea Level Center (http://uhslc.soest.hawaii.edu).
As with most applications, we start with actually getting the data. There are many different ways to do this, but in a broad sense we can either get data and store it on a local machine/directory or try access the data directly from a “data server”, e.g. , a remote web site. This latter option is not always available, so it’s good to learn both methods. Also, it is sometimes useful to have the data on your local machine, for example, you can still work with it while not on the Internet.
1. Getting data onto local machine¶
The basic process is to visit some web site and download a data set. If you click on something in a browser, the browser will try open it with some known program. It should also allow you to “save as”.
In this example, go to http://uhslc.soest.hawaii.edu –> data –> legacy data. You should see a page like this:
There are three columns to get the data: “Data”, “CSV” and “NetCDF”. You should either get one from the CSV or NetCDF columns. If you “left-click” on the data of interest, e.g., Honolulu hourly under CSV, you should get a popup that asks to “save the data”:
You now have the data on your local machine. It’s fine to use a local python/jupyter instance to work with the data, but if you want to use the class machine (sunrise) you’ll have to move the file over. This can be done with scp, recall the syntax is copy-file-from, space, copy-file-to. For example, if you download the file h001.cvs into a directory called “Downloads”, and you want it on your home directory for class, from your local computer you would open a terminal session and enter:
scp Downloads/h001.csv name@ocn463.soest.hawaii.edu:/home/name/jupyter/data
where name is your login name on the class machine.
Recall that you can check if the file is there by opening a terminal session on the jupyterhub, then using the linux ls command:
ls -l h001.csv
and you can view the file with
more h001.csv
Now let’s look at some examples reading the data.
[1]:
# First we import needed packages. Note these three will pretty much be the
# defaults for this class (we'll always use them), and let's stick to the
# convention of using numpy as np, pandas as pd, and matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. Accessing the data with python¶
There are many different ways to access data within python, and here we’ll go through just a few examples. In a broad sense we will always follow the same process: read the data in, have a look at it to see if it makes sense, and print out the type and shape to see if they make sense. It’s the details that will vary from case to case.
Note it is always useful to look at the data when downloaded to make sure it makes sense. For example, if you read in a dataset and expect hundreds of rows, and instead see only one, something went wrong.
[2]:
# 1. First example: load using numpy "loadtxt"
# the syntax is filename, delimiter, comments
# in this case we have comma-separated, with
# no comment/header lines, but it doesn't hurt
# to keep this.
data = np.loadtxt('./data/h001.csv',delimiter=',',comments='#')
[4]:
data.shape
[4]:
(175698, 5)
[3]:
# We now have a variable called "data" that should
# have all the sea level data in it. Let's print
# the type, shape and look at the data just to make
# sure. NOTE: type is an internal function, so the
# syntax is type(data), while shape is a method of
# the object data, so it's syntax is data.shape()
print( 'Data are ', type(data), ' shape is ', data.shape )
print(data)
Data are <class 'numpy.ndarray'> shape is (175698, 5)
[[2001. 12. 16. 6. 1333.]
[2001. 12. 16. 7. 1130.]
[2001. 12. 16. 8. 867.]
...
[2021. 12. 31. 21. 520.]
[2021. 12. 31. 22. 596.]
[2021. 12. 31. 23. 833.]]
[5]:
# data are now loaded into a variable called "data";
# note the object is a numpy array, and there are
# 167,321 rows and five columns. Here we either
# check the web site or just assume the columns are:
# year, month, day, hour, sea level
# Next, we assing the columns to individual variables.
# NOTE: since our object "data" is a numpy array, we
# use the synax of [row,column], with ":" meaning all
# so "year = data[:,0]" means the variable "year" will
# be set to "all rows", "first column" of "data":
year = data[:,0]
month = data[:,1]
day = data[:,2]
hour = data[:,3]
sea_level = data[:,4]
[6]:
# Finally, to check what we have, let's make a quick
# plot of sea_level
plt.plot(sea_level)
[6]:
[<matplotlib.lines.Line2D at 0x7f80fa157a90>]

[6]:
# 2. Second example: load using Pandas "read_table"
# the syntax is filename, delimiter (in this case
# sep for separater); again in this case we have
# comma-separated
data = pd.read_table('./data/h001.csv',sep=',')
[7]:
# Again, let's check what we have:
print( 'Data are ', type(data), ' shape is ', data.shape )
print(data)
Data are <class 'pandas.core.frame.DataFrame'> shape is (175697, 5)
2001 12 16 6 1333
0 2001 12 16 7 1130
1 2001 12 16 8 867
2 2001 12 16 9 553
3 2001 12 16 10 308
4 2001 12 16 11 177
... ... .. .. .. ...
175692 2021 12 31 19 584
175693 2021 12 31 20 493
175694 2021 12 31 21 520
175695 2021 12 31 22 596
175696 2021 12 31 23 833
[175697 rows x 5 columns]
[8]:
# This time we get a "pandas dataframe"
# We will see this is a much better format
# with which to work. More in the next
# example
[7]:
# 3. Third example: load using pandas read_csv.
# This is very similar to read_table, but
# presumes a comma separated data set
data = pd.read_csv('./data/h001.csv')
print( 'Data are ', type(data), ' shape is ', data.shape )
print(data)
Data are <class 'pandas.core.frame.DataFrame'> shape is (175697, 5)
2001 12 16 6 1333
0 2001 12 16 7 1130
1 2001 12 16 8 867
2 2001 12 16 9 553
3 2001 12 16 10 308
4 2001 12 16 11 177
... ... .. .. .. ...
175692 2021 12 31 19 584
175693 2021 12 31 20 493
175694 2021 12 31 21 520
175695 2021 12 31 22 596
175696 2021 12 31 23 833
[175697 rows x 5 columns]
[10]:
# Again, we get a pandas data frame. Let's just
# look at a few features. For example, we can
# look at the top of it (data.head()) or bottom
# (data.tail())
data.head()
[10]:
2001 | 12 | 16 | 6 | 1333 | |
---|---|---|---|---|---|
0 | 2001 | 12 | 16 | 7 | 1130 |
1 | 2001 | 12 | 16 | 8 | 867 |
2 | 2001 | 12 | 16 | 9 | 553 |
3 | 2001 | 12 | 16 | 10 | 308 |
4 | 2001 | 12 | 16 | 11 | 177 |
[9]:
# Notice we have a nicely formatted table, with
# a row index (starting at zero) and column
# headings. However, since this data set had
# no headings, pandas uses the first row as the
# headings. This is actually read data, so we
# want to put our own headings. read_csv lets
# us do that:
# define a list that contains our desired headings
column_headings = ['year', 'month', 'day', 'hour', 'sea level']
# now add that to the read_csv
data = pd.read_csv('../jupyter-gesteach/data/h001.csv',names=column_headings)
data.head()
[9]:
year | month | day | hour | sea level | |
---|---|---|---|---|---|
0 | 2001 | 12 | 16 | 6 | 1333 |
1 | 2001 | 12 | 16 | 7 | 1130 |
2 | 2001 | 12 | 16 | 8 | 867 |
3 | 2001 | 12 | 16 | 9 | 553 |
4 | 2001 | 12 | 16 | 10 | 308 |
[10]:
# So now we have a nice looking table. NOTE: since
# this is a DataFrame, we access the rows/columns
# differently than with the arrays (e.g., remember
# before we had data[:,0]). Now we simply address
# the column by its heading. So, to make a plot:
plt.plot(data['sea level']);

[13]:
# 4. Third example: load using netcdf. We now will
# use the netCDF4 package to read a netCDF file.
# netCDF files are very common in ocean/atmo data,
# and contain metadata within the file, so it's
# a great format to use.
# NOTE: you'll have to repeat the above download
# but this time get the netcdf file and not the CSV.
[39]:
from netCDF4 import Dataset
fin = Dataset('./data/h001.nc')
[40]:
# We now have an object call "fin" that has the
# data inside. Since it's a netCDF we can do
# different queries. For example,
# show all the metadata
#fin.variables
# show the variable names
fin.variables.keys()
[40]:
dict_keys(['sea_level', 'time', 'lat', 'lon', 'station_name', 'station_country', 'station_country_code', 'record_id', 'uhslc_id', 'gloss_id', 'ssc_id', 'last_rq_date'])
[16]:
# The data can be extracted in a similar way
# where we specify the input variable name.
# For example, we can set the dataset variable
# "lon" to a variable longitude:
time = fin['time'][:]
latitude = fin['lat']
longitude = fin['lon']
sea_level = fin['sea_level'][0]
[17]:
# NOTE: in the above we added an extra constraint
# on time [:] and on sea_level [0]; this will become
# clear later, but essentially we are getting all
# time and all sea level, but the file stores these
# with different dimentions (check this with type)
# Again, we make a quick plot. This time we can use
# time as the x-axis and sea_level as y. Notice we
# don't have problems with missing values!
plt.plot(time,sea_level)
[17]:
[<matplotlib.lines.Line2D at 0x7fcbe7938b20>]

Repeat without downloading¶
It is important to note that both the CSV and netCDF files can be imported into your python script without having to download the data. Insead you can just get the link to the data and use that (note the server has to be properly setup to do this, like the Sea Level Center does). To get the URL, right click on the data of interest and get a pop-up like the one below. Select “copy link location” and paste that url into your code.
[17]:
import pandas as pd
url = 'http://uhslc.soest.hawaii.edu/data/csv/fast/daily/d057.csv'
df = pd.read_csv(url)
[ ]:
[15]:
df
[15]:
1905 | 1 | 2 | 1263 | |
---|---|---|---|---|
0 | 1905 | 1 | 3 | 1264 |
1 | 1905 | 1 | 4 | 1269 |
2 | 1905 | 1 | 5 | 1294 |
3 | 1905 | 1 | 6 | 1311 |
4 | 1905 | 1 | 7 | 1330 |
... | ... | ... | ... | ... |
42758 | 2022 | 1 | 27 | 1439 |
42759 | 2022 | 1 | 28 | 1432 |
42760 | 2022 | 1 | 29 | 1427 |
42761 | 2022 | 1 | 30 | 1430 |
42762 | 2022 | 1 | 31 | 1452 |
42763 rows × 4 columns
[19]:
# define end-point URL to the data set of interest (see above)
# Note that the URL contains the site number (001 below) and the
# resolution (e.g., daily). So, you could actually change this
# URL without having to go back to the site and right-click/save
# get daily data from Pohnpei (UH id 001)
URL = 'http://uhslc.soest.hawaii.edu/data/csv/fast/daily/d001.csv'
[20]:
# read data and have a look
# Note this will come in as a pandas dataframe
data = pd.read_csv(URL)
data.head()
[20]:
2001 | 12 | 17 | 652 | |
---|---|---|---|---|
0 | 2001 | 12 | 18 | 650 |
1 | 2001 | 12 | 19 | 654 |
2 | 2001 | 12 | 20 | 662 |
3 | 2001 | 12 | 21 | 644 |
4 | 2001 | 12 | 22 | 605 |
[21]:
# The data set has no column headings, so pandas just uses the top data line
# Let's change this to be year, month, day and sea level
column_names = ['year', 'month', 'day', 'sea level']
data = pd.read_csv(URL, names = column_names)
data.head()
[21]:
year | month | day | sea level | |
---|---|---|---|---|
0 | 2001 | 12 | 17 | 652 |
1 | 2001 | 12 | 18 | 650 |
2 | 2001 | 12 | 19 | 654 |
3 | 2001 | 12 | 20 | 662 |
4 | 2001 | 12 | 21 | 644 |
Working with the data¶
At this point we have read in the data and setup some variables. There are a couple things we’d like to do:
Convert the time variable to something more sensible, and
Remove the missing data which appears as some big negative number
Fortunately numpy and pandas will provide the solution. First, we can use pandas to_datetime to “fix” the dates.
[19]:
# We have an index in the DataFrame that indicates each row (reminder this starts
# at zero). Since we have one hour per row, the index amounts to "hours since",
# and the first time value is May 14, 1993. Thus, row one is "zero hours since
# 05/14/93", second row it "1 hour since 05/14/93" and so on.
data.head()
[19]:
year | month | day | hour | sea level | |
---|---|---|---|---|---|
0 | 1993 | 5 | 14 | 1 | 1092 |
1 | 1993 | 5 | 14 | 2 | 939 |
2 | 1993 | 5 | 14 | 3 | 804 |
3 | 1993 | 5 | 14 | 4 | 710 |
4 | 1993 | 5 | 14 | 5 | 697 |
[20]:
# there is a function in pandas to make dates easily; pd.to_datetime()
# it takes a "date", an "origin", and a format so if we specify the
# index and "hours since may-14-1993", each row will have the proper
# time (note this is just one way to do it).
# a) to get columns, e.g., "hour", we use square brackets: data['hour']
# but now we want the row, or index; to get this use data.index
# b) datetime will read seconds, so we convert from hours to seconds
# (*3600)
# c) finally, we specify the start date as the origin
date = pd.to_datetime(data.index*3600.0, origin = '05-14-1993', unit='s')
[21]:
# Now we can plot with a more "sensible" time
plt.plot(date,data['sea level'])
[21]:
[<matplotlib.lines.Line2D at 0x7fcbe7886370>]

[22]:
# Another way to convert the date, which is currently
# year, month, day in separate columns to a single entry. This
# involves two steps: make a new variable year-month-day (e.g.,
# 2001-12-31) and then add this as a new column
# 1. make new variable
measurement_date = data['year'].astype(str) + '-' + data['month'].astype(str) + '-' + data['day'].astype(str)
# 2. add to dataframe
data['date'] = measurement_date
data.head()
[22]:
year | month | day | sea level | date | |
---|---|---|---|---|---|
0 | 2001 | 12 | 17 | 652 | 2001-12-17 |
1 | 2001 | 12 | 18 | 650 | 2001-12-18 |
2 | 2001 | 12 | 19 | 654 | 2001-12-19 |
3 | 2001 | 12 | 20 | 662 | 2001-12-20 |
4 | 2001 | 12 | 21 | 644 | 2001-12-21 |
[24]:
# Now we use the Pandas function "datetime" to make
# a date that is more recognizable; note we specify
# our format: %Y=4 character year; %m=2 character month
# %d=2 character day
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
[25]:
data.head()
[25]:
year | month | day | sea level | date | |
---|---|---|---|---|---|
0 | 2001 | 12 | 17 | 652 | 2001-12-17 |
1 | 2001 | 12 | 18 | 650 | 2001-12-18 |
2 | 2001 | 12 | 19 | 654 | 2001-12-19 |
3 | 2001 | 12 | 20 | 662 | 2001-12-20 |
4 | 2001 | 12 | 21 | 644 | 2001-12-21 |
[26]:
# We notice that missing values are set to some extreme
# negative number (turns out to be -32768). We want to
# remove these to make our plot look better. Note there
# are several ways to do this, but here we just replace
# all negative values with "nan" or "not a number"
data[data['sea level'] < 0 ] = np.nan
[27]:
plt.figure(figsize=(16,8))
plt.plot(data['date'],data['sea level'])
plt.grid(linestyle='dashed')
plt.title('Sea level at Pohnpei, FSM')
plt.xlabel('Date')
plt.ylabel('sea level (mm)');

[22]:
# Now how to do this with netCDF data? Here we can use
# the same funtion, and it's a litte easier. The netCDF
# variable "time" is defined as "days since 1800-01-01"
# so we just use that:
date = pd.to_datetime(time,origin='1800-01-01 00:00:00',unit='d')
[22]:
[<matplotlib.lines.Line2D at 0x7f2e9f4821d0>]

Do some analysis¶
A quick thing we can do is try plot the data between a certain range. Let’s pull data for Honolulu and Midway, then make a plot over mid May 1960 to see if we can observe the tsunami generated by the Chile earthquake.
[30]:
import datetime as dt
# read data from URL; specify URL here
HI_URL = 'http://uhslc.soest.hawaii.edu/data/csv/fast/hourly/h057.csv'
MW_URL = 'http://uhslc.soest.hawaii.edu/data/csv/fast/hourly/h050.csv'
# define column headings
column_headings = ['year', 'month', 'day', 'hour', 'sea level']
# read data into DataFrame called "data"
HI_data = pd.read_csv(HI_URL,names=column_headings)
MW_data = pd.read_csv(MW_URL,names=column_headings)
# convert time into a reasonal date format
HI_date = pd.to_datetime(HI_data.index*3600.0, origin = '01-02-1905', unit='s')
MW_date = pd.to_datetime(MW_data.index*3600.0, origin = '02-09-1947', unit='s')
# plot
plt.plot(HI_date,HI_data['sea level'],label='Honolulu')
plt.plot(MW_date,MW_data['sea level'],label='Midway')
# set the x-axis to run from specific start/end
plt.xlim([dt.datetime(1960,5,24),dt.datetime(1960,5,25)])
# set y-axis limit
plt.ylim(400,2000)
plt.grid()
plt.legend();

[33]:
# fit a linear trend to the data
#
# here we take advantage of the functions in numpy called
# "polyfit" that will fit a polynomial to data. This uses
# least-squares to get the best fit, and you can specify
# the order as well (1: linear, 2: square, 3: cubic, etc.)
#
# One important note: polyfit can't deal with missing
# values, so we need to somehow fill these. We will do
# this by filling with the mean. In otherwords, if a
# value is missing, we fill it with the mean value. This
# is not scientifically advisable, but we use it here as
# an example.
[31]:
# read data from URL; specify URL here
HI_URL = 'http://uhslc.soest.hawaii.edu/data/csv/fast/hourly/h057.csv'
# define column headings
column_headings = ['year', 'month', 'day', 'hour', 'sea level']
# read data into DataFrame called "data"
HI_data = pd.read_csv(HI_URL,names=column_headings)
# convert time into a reasonal date format
HI_date = pd.to_datetime(HI_data.index*3600.0, origin = '01-02-1905', unit='s')
plt.plot(HI_date,HI_data['sea level']);

[32]:
# replace missing values with mean
# find values equal to -32767 and replace with nan
HI_data2 = HI_data.replace(-32767,np.nan)
# replace these nan values with the mean
HI_data3 = HI_data2.fillna(np.nanmean(HI_data2))
# compute the linear trend
B = np.polyfit(HI_data3.index.values,HI_data3['sea level'],1)
ssh_trend = np.polyval(B,HI_data.index.values)
plt.plot(HI_date,ssh_trend,'r-')
[32]:
[<matplotlib.lines.Line2D at 0x7f80f971dfd0>]

[33]:
B
[33]:
array([1.75693716e-04, 1.27824609e+03])
[38]:
# show the trend as text on the graph
# the syntax here is: 1) slope is the first
# value of B, i.e., B[0]; 2) this is in
# milimeters per hour, so we change this to
# centimeters per decade; 3) we use the
# matplotlib function "text" and place the
# string at x=1910, y = 2100
trend = str(B[0] * 24.0 * 365.0 * 10.0 / 10.0)
plt.plot(HI_date,HI_data2['sea level'])
plt.plot(HI_date,ssh_trend,'r-')
plt.text(dt.date(1910,1,1,),2100,'trend is:' + trend + 'cm/decade');

[36]:
trend = str(B[0]*24.0*365.0*10.0/10.0)
[36]:
1.539076951910953
Summary¶
[40]:
# read data from URL; specify URL here
HI_URL = 'http://uhslc.soest.hawaii.edu/data/csv/fast/hourly/h057.csv'
# define column headings
column_headings = ['year', 'month', 'day', 'hour', 'sea level']
# read data into DataFrame called "data"
HI_data = pd.read_csv(HI_URL,names=column_headings)
# convert time into a reasonal date format
HI_date = pd.to_datetime(HI_data.index*3600.0, origin = '01-02-1905', unit='s')
# find values equal to -32767 and replace with nan
HI_data2 = HI_data.replace(-32767,np.nan)
# replace these nan values with the mean
HI_data3 = HI_data2.fillna(np.nanmean(HI_data2))
# compute the linear trend
B = np.polyfit(HI_data3.index.values,HI_data3['sea level'],1)
ssh_trend = np.polyval(B,HI_data.index.values)
trend = str(B[0] * 24.0 * 365.0 * 10.0 / 10.0)
# plot
plt.plot(HI_date,HI_data2['sea level'])
plt.plot(HI_date,ssh_trend,'r-')
plt.text(dt.date(1910,1,1,),2100,'trend is: ' + trend + 'cm/decade')
plt.grid()
plt.title('Tide gauge record at Honolulu')
plt.xlabel('Date')
plt.ylabel('Sea Level (mm)');
[40]:
Text(0, 0.5, 'Sea Level (mm)')

Now repeat the same thing, but use netCDF data
[39]:
# import netCDF4 package, only "Dataset" and "chartostring"
from netCDF4 import Dataset, chartostring
import matplotlib.pyplot as plt
import pandas as pd
fin = Dataset('./data/h001.nc')
print(fin.variables.keys())
latitude = fin['lat']
longitude = fin['lon']
time = fin['time'][:]
h = fin['sea_level'][0]
date = pd.to_datetime(time,origin='1800-01-01 00:00:00',unit='d')
plt.plot(date,h)
#print(fin.variables)
name = chartostring(fin['station_name'][0])
place = chartostring(fin['station_country'][0])
plt.title('Sea level at ' + str(name) + ',' + str(place));
dict_keys(['sea_level', 'time', 'lat', 'lon', 'station_name', 'station_country', 'station_country_code', 'record_id', 'uhslc_id', 'gloss_id', 'ssc_id', 'last_rq_date'])

[ ]: