Working with HOT data (cont’d)


Lecture notes from Friday, Feb 18


We saw in the previous classes how to deal with data of a single variable, sea level as a function of time, and how to make some line plots. We will now deal with data in two dimensions: time and space (in this case depth). To visualize these types of data we will use contour plots, with values contoured (or color-shaded) on a grid with depth along one axis and time on the other.

But first, let’s continue to work with line plots and see more features of pandas DataFrames. A data set containing a summary of surface values from HOT is avaiable at https://hahana.soest.hawaii.edu/hot/products/HOT_surface_CO2.txt We will use this to do some line plotting.

The next set of data come from one of the two HOT sites: https://hahana.soest.hawaii.edu/hot/ in the data archive for CTD: ftp://ftp.soest.hawaii.edu/hot/ctd/year1_31ctd.dat and we will use this to look at contour plots.

First, as usual, we import our needed functions:

[2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as sp

1. Line plots

The first data set is called “HOT_surface_CO2.txt” and can be found at http://hahana.soest.hawaii.edu/hot/products/products.html. The dataset was by John Dore (jdore@montana.edu) and can be cited as:

Dore, J.E., R. Lukas, D.W. Sadler, M.J. Church, and D.M. Karl. 2009. Physical and biogeochemical modulation of ocean acidification in the central North Pacific. Proc Natl Acad Sci USA 106:12235-12240.

There is also documentation (HOT_surface_CO2_readme.pdf) at http://hahana.soest.hawaii.edu/hot/products/products.html

The data are laid out in an ASCII table, so we can use pandas read_csv to read them in and put them in a DataFrame. Note that the file has header information at the top, so we will skip that (header=7 to skip the top seven lines). Note also that the file has column headings, so we can read them directly. Remember the tide gauge data had none, so we manually created it; here we just read it and the DataFrame will use it as the column headings. Finally, this file is not comma separated, but white space separated (data columns are separated by a different number of white spaces).

[3]:
URL = 'https://hahana.soest.hawaii.edu/hot/products/HOT_surface_CO2.txt'
HOT_surface = pd.read_csv(URL,header=7, delim_whitespace=True)
[4]:
HOT_surface.head()
[4]:
cruise days date temp sal phos sil DIC TA nDIC ... pHmeas_insitu pHcalc_25C pHcalc_insitu pCO2calc_insitu pCO2calc_20C aragsatcalc_insitu calcsatcalc_insitu freeCO2_insitu carbonate_insitu notes
0 1 30 31-Oct-88 26.283 35.186 0.08 0.71 1963.91 2319.5 1953.5 ... -999.0 8.1292 8.1097 330.9 256.3 3.98 6.02 9.06 250.2 abc
1 2 62 02-Dec-88 25.659 34.984 0.09 0.99 1958.94 2304.9 1959.8 ... -999.0 8.1193 8.1092 330.6 262.6 3.87 5.86 9.20 243.5 bc
2 3 99 08-Jan-89 24.610 35.028 0.07 0.93 1963.77 2305.0 1962.2 ... -999.0 8.1113 8.1168 324.3 268.7 3.80 5.77 9.27 240.1 c
3 4 148 26-Feb-89 23.479 34.883 0.09 0.88 1957.80 2295.5 1964.4 ... -999.0 8.1091 8.1316 310.9 269.7 3.74 5.69 9.15 237.2 cd
4 5 177 27-Mar-89 24.278 34.735 0.12 2.01 1946.33 2283.0 1961.2 ... -999.0 8.1113 8.1218 317.7 266.8 3.74 5.69 9.17 236.4 ac

5 rows × 22 columns

It appears that the data file uses -999.0 as the missing values, so let’s replace that now, and make a quick plot to see what it looks like.

[5]:
HOT_surface = HOT_surface.replace(-999.0, np.nan)
plt.plot(HOT_surface['days'], HOT_surface['temp']);
../_images/notebooks_0223_HOT_6_0.png
[6]:
# convert to DateTime
HOT_surface['date'] = pd.to_datetime(HOT_surface['date'])
plt.plot(HOT_surface['date'], HOT_surface['temp']);
../_images/notebooks_0223_HOT_7_0.png
[7]:
# Let's zoom into a three year period
from datetime import datetime
plt.figure(figsize=(14,10))
plt.plot(HOT_surface['date'],HOT_surface['temp'],'bo-')
plt.xlim(datetime(2000,1,1),datetime(2003,12,31))
plt.grid()
../_images/notebooks_0223_HOT_8_0.png
[8]:
# Example 1 three panel plot
plt.subplots(nrows=3,ncols=1,figsize=(14,10))
plt.subplot(3,1,1)
plt.plot(HOT_surface['date'],HOT_surface['temp'])
plt.title('HOT Surface Temperature')
plt.grid()
plt.subplot(3,1,2)
plt.plot(HOT_surface['date'],HOT_surface['sal'])
plt.title('HOT Surface Salinity')
plt.grid()
plt.subplot(3,1,3)
plt.plot(HOT_surface['date'],HOT_surface['TA'])
plt.title('HOT Surface Total Alkalinity')
plt.grid()
../_images/notebooks_0223_HOT_9_0.png
[9]:
# Example 2 three panel plot
fig, ax = plt.subplots(figsize = (14,10), nrows = 3)

ax[0].plot(HOT_surface['date'],HOT_surface['temp'])
ax[0].set_title('HOT Surface Temperature')
ax[0].grid()

ax[1].plot(HOT_surface['date'],HOT_surface['sal'])
ax[1].set_title('HOT Surface Salinity')
ax[1].grid()

ax[2].plot(HOT_surface['date'],HOT_surface['TA'])
ax[2].set_title('HOT Surface Total Alkalinity')
ax[2].grid()
../_images/notebooks_0223_HOT_10_0.png
[10]:
# Now we try repeat L et al 2001
fig, ax = plt.subplots( figsize = (10,14), nrows = 2 )
ax[0].scatter(HOT_surface['sal'], HOT_surface['TA'])
ax[0].set_xlabel('Salinity')
ax[0].set_ylabel('TALK')
ax[1].scatter(HOT_surface['temp'], HOT_surface['DIC'])
ax[1].set_xlabel('Temperature')
ax[1].set_ylabel('DIC');
../_images/notebooks_0223_HOT_11_0.png

reproduce the HOT figure:

9a0fec997d26491ab7553102738a3d7d

[11]:
# Get the CO2 data from NOAA
URL = 'https://www.esrl.noaa.gov/gmd/aftp/data/trace_gases/co2/in-situ/surface/mlo/co2_mlo_surface-insitu_1_ccgg_MonthlyData.txt'
CO2_data = pd.read_csv(URL,delim_whitespace=True,header=150)
CO2_data.head()

# make datetime; note a little more tricky this time
year = (CO2_data['time_decimal'].astype(int))
doy = ( CO2_data['time_decimal'] - CO2_data['time_decimal'].astype(int) ) * 365.25
CO2_date = pd.to_datetime(year*1000+doy+1, format = '%Y%j')
CO2_data = CO2_data.replace(-999.99, np.nan)
CO2_data['date'] = CO2_date
plt.plot(CO2_data['date'],CO2_data['value'])
plt.plot(HOT_surface['date'],HOT_surface['pHmeas_insitu']);
../_images/notebooks_0223_HOT_13_0.png
[12]:
fig, ax1 = plt.subplots( figsize = (12,12) )
plt.title('CO2 Time Series in the North Pacific Ocean')

# 1. make first plot, CO2 (atm and ocean) as ax1
ax1.set_xlabel('Year')
ax1.set_ylabel('CO2 (ppm)')
ax1.plot(CO2_data['date'], CO2_data['value'], marker = 'd', color='red',label='Mauna Loa CO2 (ppmv)')
#ax1.tick_params(axis='y', labelcolor='red')
ax1.plot(HOT_surface['date'], HOT_surface['pCO2calc_insitu'],
         marker = 'd', color='blue',label='ALOHA Seawater pCO2')
ax1.set_ylim(275, 400)
ax1.legend()

[12]:
<matplotlib.legend.Legend at 0x7f5118813040>
../_images/notebooks_0223_HOT_14_1.png
[13]:
# We have a figure, but we have quite a few parameters to change:
# 1.  add titles and labels
# 2.  use same x-axis but different y-axis so we can see the variability
#     - we will define two axis objects, ax1 and ax2 and reference these
# 3.  add legend
#     - use plot object to get this

fig, ax1 = plt.subplots( figsize = (12,12) )
plt.title('CO2 Time Series in the North Pacific Ocean')

# 1. make first plot, CO2 (atm and ocean) as ax1
ax1.set_xlabel('Year')
ax1.set_ylabel('CO2 (ppm)')
ax1.plot(CO2_data['date'], CO2_data['value'], marker = 'd', color='red',label='Mauna Loa CO2 (ppmv)')
#ax1.tick_params(axis='y', labelcolor='red')
ax1.plot(HOT_surface['date'], HOT_surface['pCO2calc_insitu'],
         marker = 'd', color='blue',label='ALOHA Seawater pCO2')
ax1.set_ylim(275, 400)
ax1.legend()

# 2. instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.set_ylabel('pH')
ax2.plot(HOT_surface['date'], HOT_surface['pHmeas_insitu'],
         marker = 'd', color='cyan',label='ALOHA Seawater pH')
#ax2.tick_params(axis='y', labelcolor='cyan')
ax2.set_ylim(8.03, 8.38)
ax2.legend();
../_images/notebooks_0223_HOT_15_0.png
[14]:
CO2_data.head()
[14]:
site_code year month day hour minute second time_decimal value value_std_dev nvalue latitude longitude altitude elevation intake_height qcflag date
0 MLO 1973 1 1 0 0 0 1973.000000 NaN -99.99 0 19.536 -155.576 3437.0 3397.0 40.0 *.. 1973-01-01
1 MLO 1973 2 1 0 0 0 1973.084932 NaN -99.99 0 19.536 -155.576 3437.0 3397.0 40.0 *.. 1973-02-01
2 MLO 1973 3 1 0 0 0 1973.161644 NaN -99.99 0 19.536 -155.576 3437.0 3397.0 40.0 *.. 1973-03-01
3 MLO 1973 4 1 0 0 0 1973.246575 NaN -99.99 0 19.536 -155.576 3437.0 3397.0 40.0 *.. 1973-04-01
4 MLO 1973 5 1 0 0 0 1973.328767 NaN -99.99 0 19.536 -155.576 3437.0 3397.0 40.0 *.. 1973-05-01
[15]:
x1  = np.array(CO2_data[CO2_data['value'].notna()]['date'].dropna(), dtype=float)
y1  = np.array(CO2_data['value'].dropna().values, dtype=float)
[17]:
slope1, intercept1, r_value1, p_value1, std_err1 = sp.linregress(x1,y1)
print(slope1,intercept1,r_value1,p_value1,std_err1)
5.67182538057028e-17 318.2202763702599 0.9914657958383869 0.0 3.1599881771584806e-19
[18]:
# Now we need to fit some lines; to do this, we will use the scipy linear regression
# module
# define variables; note we only use rows where the variable is non a NAN
# look for all rows that have NaN in the column headed "value" and skip these
# rows
x1  = np.array(CO2_data[CO2_data['value'].notna()]['date'].dropna(), dtype=float)
y1  = np.array(CO2_data['value'].dropna().values, dtype=float)
slope1, intercept1, r_value1, p_value1, std_err1 = sp.linregress(x1,y1)
xf  = np.linspace(min(x1), max(x1), 100)
xf1 = xf.copy()
xf1 = pd.to_datetime(xf1)
yf1 = (slope1*xf) + intercept1
print('r = ', r_value1, '\n', 'p = ', p_value1, '\n', 's = ', std_err1)

x2  = np.array(HOT_surface[HOT_surface['pCO2calc_insitu'].notna()]['date'].dropna(), dtype=float)
y2  = np.array(HOT_surface['pCO2calc_insitu'].dropna().values, dtype=float)
slope2, intercept2, r_value2, p_value2, std_err2 = sp.linregress(x2,y2)
xf  = np.linspace(min(x2), max(x2), 100)
xf2 = xf.copy()
xf2 = pd.to_datetime(xf2)
yf2 = (slope2*xf) + intercept2
print('r = ', r_value2, '\n', 'p = ', p_value2, '\n', 's = ', std_err2)

x3  = np.array(HOT_surface[HOT_surface['pHmeas_insitu'].notna()]['date'].dropna(), dtype=float)
y3  = np.array(HOT_surface['pHmeas_insitu'].dropna().values, dtype=float)
slope3, intercept3, r_value3, p_value3, std_err3 = sp.linregress(x3,y3)
xf  = np.linspace(min(x3),max(x3),100)
xf3 = xf.copy()
xf3 = pd.to_datetime(xf3)
yf3 = (slope3*xf)+intercept3
print('r = ', r_value3, '\n', 'p = ', p_value3, '\n', 's = ', std_err3)
r =  0.9914657958383869
 p =  0.0
 s =  3.1599881771584806e-19
r =  0.7972626084939435
 p =  4.707211641265142e-69
 s =  2.562318459034268e-18
r =  -0.7225105066828504
 p =  1.1381023220096529e-35
 s =  3.2893110009056287e-21
[19]:
fig, ax1 = plt.subplots( figsize = (12,12) )
plt.title('CO2 Time Series in the North Pacific Ocean')

# 1. make first plot, CO2 (atm and ocean) as ax1
ax1.set_xlabel('Year')
ax1.set_ylabel('CO2 (ppm)')
ax1.plot(CO2_data['date'], CO2_data['value'], marker = 'd', color='red', label='Mauna Loa CO2 (ppmv)')
ax1.plot(xf1, yf1, color='black')
#ax1.tick_params(axis='y', labelcolor='red')
ax1.plot(HOT_surface['date'], HOT_surface['pCO2calc_insitu'], marker = 'd', color='blue', label='ALOHA Seawater pCO2')
ax1.plot(xf2, yf2,color='black')
ax1.set_ylim(275,400)
ax1.legend()

# 2. instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.set_ylabel('pH')
ax2.plot(HOT_surface['date'], HOT_surface['pHmeas_insitu'], marker = 'd', color='cyan',label='ALOHA Seawater pH')
ax2.plot(xf3, yf3, color='black')
#ax2.tick_params(axis='y', labelcolor='cyan')
ax2.set_ylim(8.03, 8.38)
ax2.legend();
../_images/notebooks_0223_HOT_20_0.png
[22]:
# All looks good except legend is broken in two parts

fig, ax1 = plt.subplots( figsize = (12,12) )
plt.title('CO2 Time Series in the North Pacific Ocean')

# 1. make first plot, CO2 (atm and ocean) as ax1
ax1.set_xlabel('Year')
# note here we can color the axis label the same as the line
ax1.set_ylabel('CO2 (ppm)', color='red')
ax1.tick_params(axis='y', labelcolor='red')

# save the plot object as lns1 and lns2
lns1 = ax1.plot(CO2_date, CO2_data['value'], marker = 'd', color='red',label='Mauna Loa CO2 (ppmv)')
lns2 = ax1.plot(HOT_surface['date'], HOT_surface['pCO2calc_insitu'],marker='d',color='blue',label='ALOHA Seawater pCO2')
ax1.plot(xf1, yf1,color='black')
ax1.plot(xf2, yf2,color='black')
ax1.set_ylim(275,400)

# 2. instantiate a second axes that shares the same x-axis
ax2 = ax1.twinx()
ax2.set_ylabel('pH', color='cyan')
ax2.tick_params(axis='y', labelcolor='cyan', size=16)

# save the third plot object as lns3
lns3 = ax2.plot(HOT_surface['date'], HOT_surface['pHmeas_insitu'], marker='d',color='cyan',label='ALOHA Seawater pH')
ax2.plot(xf3, yf3,color='black')
ax2.set_ylim(8.03, 8.38)

# finally, show the legend based on lns1, lns2 and lns3
lns = lns1+lns2+lns3
labs = [l.get_label() for l in lns]
ax1.legend(lns, labs, loc=0);
../_images/notebooks_0223_HOT_21_0.png

2. Data with two dimensions (depth and time)

Next, we will look at how to visualize data with an added dimension, depth in this case. We will use contour/color shaded plots. These are a common way to visualize two-dimensional data. You could have, for example, a map of temperature so the x-axis is longitude, the y-axis is latitude, and then the temperature is color-shaded.

There are a few different ways to make such plots in Python, including “contour”, “contourf”, “pcolor” and “imshow”. You can find the difference from the Python Matplotlib help pages on the web. First we will go through some generic examples, then try make one from actual HOT data.

2a. Synthetic data

Let’s start with a simple example. We define a variable, A, that is a function of y and z defined as follows:

y = sin(x)

z = cos(x)

[23]:
x = np.array([2, 4, 6])
y = np.array([1,3,5])
[26]:
z = y[:,np.newaxis]*x
[27]:
print(z)
[[ 2  4  6]
 [ 6 12 18]
 [10 20 30]]
[ ]:

[28]:
# Recall how we can create a sine/cosine 0 to 2 pi:
x = np.arange(0, 2*np.pi+0.01, 0.01)
y = np.sin(x)
z = np.cos(x)

# We now make a two dimensional (axis), array
A = y[:, np.newaxis] * z
[32]:
fig = plt.subplots(figsize=(8,8))
plt.pcolormesh(x,x,A)
[32]:
<matplotlib.collections.QuadMesh at 0x7f511429cd30>
../_images/notebooks_0223_HOT_28_1.png
[27]:
# Now we make a four panel plot, 2 rows and 2 columns showing different methods
fig, ax = plt.subplots(2,2, figsize = (10,10))
ax[0,0].contour(x, x, A)
ax[0,0].set_title('contour')
ax[0,1].contourf(x, x, A)
ax[0,1].set_title('contourf')
ax[1,0].pcolormesh(x, x, A, shading='auto')
ax[1,0].set_title('colormesh')
ax[1,1].imshow(A)
ax[1,1].set_title('imshow');
../_images/notebooks_0223_HOT_29_0.png
[33]:
# color levels in plot
clevs  = np.linspace(-0.75, 0.75, 100)
clevs1 = np.linspace(-0.75, 0.75, 10)

fig, ax = plt.subplots(2,3, figsize = (14,4))

# contourf cuts out data beyond limits normally
cs0 = ax[0,0].contourf(x, x, A, levels=clevs)
plt.colorbar(cs0, ax = ax[0,0])

# this can be fixed by extending the color range,
# which automatically extends the color bar as well when
# it is called
cs1 = ax[1,0].contourf(x, x, A, levels=clevs, extend='both')
plt.colorbar(cs1, ax=ax[1,0])

# contour behaves as contourf does
cs2 = ax[0,1].contour(x, x, A, levels=clevs1)
plt.colorbar(cs2, ax=ax[0,1])

# in some cases, it makes more sense to add labels directly
# to contours instead of a colorbar
cs3 = ax[1,1].contour(x, x, A, levels=clevs1)
ax[1,1].clabel(cs3, cs3.levels, inline=True)

# pcolormesh takes limits only, not color levels;
# it does not cut off data beyond the given limits,
# but it does not automatically extend the colorbar
cs4 = ax[0,2].pcolormesh(x, x, A, vmin=-0.75, vmax=0.75, shading='auto')
plt.colorbar(cs4, ax=ax[0,2])

# that must be done manually
cs5 = ax[1,2].pcolormesh(x, x, A, vmin=-0.75, vmax=0.75, shading='auto')
plt.colorbar(cs5, ax=ax[1,2], extend = 'both')
[33]:
<matplotlib.colorbar.Colorbar at 0x7f510fac2430>
../_images/notebooks_0223_HOT_30_1.png

2b. HOT data

The second data set we will use is somewhat similar to HOT_surface_CO2: it is an ASCII table with columns separated by white spaces. However, in this case there are no column names, so we will have to add those. The data are available at ftp://ftp.soest.hawaii.edu/hot/ctd/year1_31ctd.dat and there is a copy in our data directory.

Note the data are aligned in an ASCII table, so we can use Pandas read_csv. The columns are day since Jan 1, 1988; depth (m); density; temperature (C); salinity; oxygen. There is a space between the columns, so we will need to specify that as the delimiter.

[35]:
# Download the data from the web site, and read it in.  Note we need to
# specify that the column separater is white space (not commas)
# Like last time, we can specify the column titles since there are none
# in the file itself
column_names = ['day', 'depth', 'density', 'temperature', 'salinity', 'oxygen']
hot_ctd = pd.read_csv('../data/HOTyears01_31ctd.dat', header = None, delim_whitespace = True, names = column_names)
[36]:
hot_ctd.head()
[36]:
day depth density temperature salinity oxygen
0 305 0 -23.125200 26.269300 35.2335 224.00
1 305 -2 -23.118516 26.270650 35.2252 226.20
2 305 -4 -23.104949 26.272301 35.2079 224.75
3 305 -6 -23.105681 26.269251 35.2076 215.70
4 305 -8 -23.107977 26.266502 35.2095 218.00
[37]:
# The time is given as "days since January 1, 1988", but we can easily
# change this to a more proper date with datatime
date = pd.to_datetime(hot_ctd['day'], origin = '01-01-1988', unit='d')
[38]:
# Let's now add this variable to the dataframe, and make it the index
hot_ctd['date'] = date
hot_ctd = hot_ctd.set_index('date')
[39]:
# Okay, what do we have
hot_ctd.head()
[39]:
day depth density temperature salinity oxygen
date
1988-11-01 305 0 -23.125200 26.269300 35.2335 224.00
1988-11-01 305 -2 -23.118516 26.270650 35.2252 226.20
1988-11-01 305 -4 -23.104949 26.272301 35.2079 224.75
1988-11-01 305 -6 -23.105681 26.269251 35.2076 215.70
1988-11-01 305 -8 -23.107977 26.266502 35.2095 218.00
[45]:
#Z = hot_ctd.pivot_table(index='date', columns='depth', values='temperature').T.values
Z = hot_ctd.pivot_table(index='date', columns='depth', values='oxygen').values
X_unique = np.sort(hot_ctd.depth.unique())
Y_unique = np.sort(hot_ctd.day.unique())
X, Y = np.meshgrid(X_unique, Y_unique)
print(X.shape, Y.shape, Z.shape)
(312, 721) (312, 721) (312, 721)
[46]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,10))
#ax = fig.add_subplot(111)
plt.contourf(Y, X, Z, alpha = 0.7, cmap = plt.cm.jet);
plt.colorbar()
[46]:
<matplotlib.colorbar.Colorbar at 0x7f510d89ca00>
../_images/notebooks_0223_HOT_38_1.png
[47]:
x = hot_ctd['temperature'][0:100]
y = hot_ctd['depth'][0:100]
plt.plot(x,y)
[47]:
[<matplotlib.lines.Line2D at 0x7f510d7b6280>]
../_images/notebooks_0223_HOT_39_1.png
[ ]: