Readng data¶
Data input¶
Getting data into an out of our programs will be a key component of our scripts. There are several ways to do this (yet another good/bad aspect of python). For class we will rely on a few, key python packages:
numpy: contains many numerical functions, together with matplotlib and scipy allows many of the same features of matlab
scipy: scientific analysis packages
pandas: package specifically for loading data, and creating data frames
matplotlib: provides many of the matlab plotting functions
Let’s start with a simple set of examples. If we assume we have an ascii dataset, for example, the Honolulu tidegauge data from past classes, here are a few ways to read the data into a script.
ASCII
open and readline (this makes variables that are strings, not arrays)
# Open file f = open('sample.dat', 'r') # Read and ignore header lines header1 = f.readline() header2 = f.readline() # Loop over lines and extract variables of interest for line in f: line = line.strip() columns = line.split() month = columns[0] temp = float(columns[1]) print(month, temp) f.close()
numpy loadtxt (this makes an m by n numpy array)
numpy fromfile (if the number of columns are not consistent)
data = np.fromfile('sample2.dat', dtype=float, sep='\t', count=-1)
numpy fromregex (this makes an m by n numpy array)
data = np.fromregex('sample.dat', r'(\d+),\s(\d+)', np.float)
numpy genfromtxt (this makes an m by n numpy array)
data = np.genfromtxt('sample.dat',delimiter=',',skiprows=2) # or, if the columns have different types: #1 2.0000 buckle_my_shoe #3 4.0000 margery_door data = np.genfromtxt('filename', dtype= None) # data = [(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')]
pandas read_table (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)
data = pd.read_table('sample.dat', sep=',')
pandas read_csv (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)
data = pd.read_csv('sample.dat', header=1)
<pre> data = np.loadtxt('sample.dat', delimiter=',', comments='#') </pre>
sound (wav) files
librosa
x, sr = librosa.load('sample.wav')
wave
wf = wave.open(('sample.wav'), 'rb')
NetCDF
netCDF4
from netCDF4 import Dataset fh = Dataset('sample.nc', mode='r') time = fh.variables['time'][:] lon = fh.variables['lon'][:,:] lat = fh.variables['lat'][:,:] temp = fh.variables['temp'][:,:]
<li> xarray <pre> import xarray as xr ds = xr.open_dataset('sample.nc') df = ds.to_dataframe() </pre> </ol>
OPeNDAP
netCDF4 – see above (just like local file, but pass URL endpoint)
pydap
from pydap.client import open_url import numpy as np from numpy import * # set ULR from PO.DAAC dataset = open_url("http://opendap-uat.jpl.nasa.gov/thredds/dodsC/ncml_aggregation/OceanTemperature/ghrsst/aggregate__ghrsst_DMI_OI-DMI-L4-GLOB-v1.0.ncml") lat = dataset.lat[:] lon = dataset.lon[:] time = dataset.time[:] sst = dataset.analysed_sst.array[0]
</ol>
matlab binary
scipy loadmat
from scipy.io import loadmat fin1 = loadmat('sample.mat',squeeze_me=True) mtime = fin1['mday'] Tair = fin1['ta_h'] Press = fin1['bpr']
</ol>
shapefile
geopandas
import geopandas as gpd shape_gpd = gpd.read_file('sample.shp')
<li> salem <pre> shpf = salem.get_demo_file('sample.shp') gdf = salem.read_shapefile(shpf)
[4]:
# Open file, # filename is sample_clev.dat, open as read only 'r' f = open('../jupyter-gesteach/data/sample_clev.dat', 'r') # Read and ignore header lines header1 = f.readline() header2 = f.readline() # Loop over lines and extract variables of interest for line in f: line = line.strip() columns = line.split() year = columns[0] month = columns[1] day = columns[2] hour = columns[3] clev = columns[4] print(hour, clev) f.close()
1 2063 2 1997 3 1846 4 1689 5 1517 6 1441 7 1412 8 1472 9 1590 10 1718 11 1883 12 1988 13 2048 14 1965 15 1830 16 1639 17 1461 18 1348 19 1290 20 1343 21 1480 22 1662 23 1890 0 2059 1 2172 2 2198 3 2060 4 1843 5 1631 6 1476 7 1315 8 1271 9 1382 10 1504 11 1728 12 1976 13 2070 14 2044 15 1997 16 1822 17 1559 18 1327 19 1232 20 1194 21 1202 22 1471 23 1747 0 2000 1 2222 2 2295 3 2207 4 2028 5 1775 6 1526 7 1271 8 1135 9 1130 10 1261 11 1527 12 1817
[5]:
import numpy as np data = np.loadtxt('data/sample_clev.dat', delimiter=' ', comments='#') print(data)
[[1.992e+03 1.200e+01 4.000e+00 1.000e+00 2.063e+03] [1.992e+03 1.200e+01 4.000e+00 2.000e+00 1.997e+03] [1.992e+03 1.200e+01 4.000e+00 3.000e+00 1.846e+03] [1.992e+03 1.200e+01 4.000e+00 4.000e+00 1.689e+03] [1.992e+03 1.200e+01 4.000e+00 5.000e+00 1.517e+03] [1.992e+03 1.200e+01 4.000e+00 6.000e+00 1.441e+03] [1.992e+03 1.200e+01 4.000e+00 7.000e+00 1.412e+03] [1.992e+03 1.200e+01 4.000e+00 8.000e+00 1.472e+03] [1.992e+03 1.200e+01 4.000e+00 9.000e+00 1.590e+03] [1.992e+03 1.200e+01 4.000e+00 1.000e+01 1.718e+03] [1.992e+03 1.200e+01 4.000e+00 1.100e+01 1.883e+03] [1.992e+03 1.200e+01 4.000e+00 1.200e+01 1.988e+03] [1.992e+03 1.200e+01 4.000e+00 1.300e+01 2.048e+03] [1.992e+03 1.200e+01 4.000e+00 1.400e+01 1.965e+03] [1.992e+03 1.200e+01 4.000e+00 1.500e+01 1.830e+03] [1.992e+03 1.200e+01 4.000e+00 1.600e+01 1.639e+03] [1.992e+03 1.200e+01 4.000e+00 1.700e+01 1.461e+03] [1.992e+03 1.200e+01 4.000e+00 1.800e+01 1.348e+03] [1.992e+03 1.200e+01 4.000e+00 1.900e+01 1.290e+03] [1.992e+03 1.200e+01 4.000e+00 2.000e+01 1.343e+03] [1.992e+03 1.200e+01 4.000e+00 2.100e+01 1.480e+03] [1.992e+03 1.200e+01 4.000e+00 2.200e+01 1.662e+03] [1.992e+03 1.200e+01 4.000e+00 2.300e+01 1.890e+03] [1.992e+03 1.200e+01 5.000e+00 0.000e+00 2.059e+03] [1.992e+03 1.200e+01 5.000e+00 1.000e+00 2.172e+03] [1.992e+03 1.200e+01 5.000e+00 2.000e+00 2.198e+03] [1.992e+03 1.200e+01 5.000e+00 3.000e+00 2.060e+03] [1.992e+03 1.200e+01 5.000e+00 4.000e+00 1.843e+03] [1.992e+03 1.200e+01 5.000e+00 5.000e+00 1.631e+03] [1.992e+03 1.200e+01 5.000e+00 6.000e+00 1.476e+03] [1.992e+03 1.200e+01 5.000e+00 7.000e+00 1.315e+03] [1.992e+03 1.200e+01 5.000e+00 8.000e+00 1.271e+03] [1.992e+03 1.200e+01 5.000e+00 9.000e+00 1.382e+03] [1.992e+03 1.200e+01 5.000e+00 1.000e+01 1.504e+03] [1.992e+03 1.200e+01 5.000e+00 1.100e+01 1.728e+03] [1.992e+03 1.200e+01 5.000e+00 1.200e+01 1.976e+03] [1.992e+03 1.200e+01 5.000e+00 1.300e+01 2.070e+03] [1.992e+03 1.200e+01 5.000e+00 1.400e+01 2.044e+03] [1.992e+03 1.200e+01 5.000e+00 1.500e+01 1.997e+03] [1.992e+03 1.200e+01 5.000e+00 1.600e+01 1.822e+03] [1.992e+03 1.200e+01 5.000e+00 1.700e+01 1.559e+03] [1.992e+03 1.200e+01 5.000e+00 1.800e+01 1.327e+03] [1.992e+03 1.200e+01 5.000e+00 1.900e+01 1.232e+03] [1.992e+03 1.200e+01 5.000e+00 2.000e+01 1.194e+03] [1.992e+03 1.200e+01 5.000e+00 2.100e+01 1.202e+03] [1.992e+03 1.200e+01 5.000e+00 2.200e+01 1.471e+03] [1.992e+03 1.200e+01 5.000e+00 2.300e+01 1.747e+03] [1.992e+03 1.200e+01 6.000e+00 0.000e+00 2.000e+03] [1.992e+03 1.200e+01 6.000e+00 1.000e+00 2.222e+03] [1.992e+03 1.200e+01 6.000e+00 2.000e+00 2.295e+03] [1.992e+03 1.200e+01 6.000e+00 3.000e+00 2.207e+03] [1.992e+03 1.200e+01 6.000e+00 4.000e+00 2.028e+03] [1.992e+03 1.200e+01 6.000e+00 5.000e+00 1.775e+03] [1.992e+03 1.200e+01 6.000e+00 6.000e+00 1.526e+03] [1.992e+03 1.200e+01 6.000e+00 7.000e+00 1.271e+03] [1.992e+03 1.200e+01 6.000e+00 8.000e+00 1.135e+03] [1.992e+03 1.200e+01 6.000e+00 9.000e+00 1.130e+03] [1.992e+03 1.200e+01 6.000e+00 1.000e+01 1.261e+03] [1.992e+03 1.200e+01 6.000e+00 1.100e+01 1.527e+03] [1.992e+03 1.200e+01 6.000e+00 1.200e+01 1.817e+03]]
[6]:
# now use Pandas import pandas as pd data = pd.read_table('data/sample_clev.dat',sep=' ')
[7]:
data
[7]:
# 002 \t113 \tTarawa Unnamed: 4 Bairiki \tKiribati \t1.33200 \t173.01300 0 # hourly sea level from UHSLC NaN NaN NaN 1 1992 12 4 1 2063 NaN NaN NaN NaN 2 1992 12 4 2 1997 NaN NaN NaN NaN 3 1992 12 4 3 1846 NaN NaN NaN NaN 4 1992 12 4 4 1689 NaN NaN NaN NaN ... ... ... ... ... ... ... ... ... ... 56 1992 12 6 8 1135 NaN NaN NaN NaN 57 1992 12 6 9 1130 NaN NaN NaN NaN 58 1992 12 6 10 1261 NaN NaN NaN NaN 59 1992 12 6 11 1527 NaN NaN NaN NaN 60 1992 12 6 12 1817 NaN NaN NaN NaN 61 rows × 9 columns
[ ]: