Readng data¶

Data input¶

Getting data into an out of our programs will be a key component of our scripts. There are several ways to do this (yet another good/bad aspect of python). For class we will rely on a few, key python packages:

numpy: contains many numerical functions, together with matplotlib and scipy allows many of the same features of matlab
scipy: scientific analysis packages
pandas: package specifically for loading data, and creating data frames
matplotlib: provides many of the matlab plotting functions

Let’s start with a simple set of examples. If we assume we have an ascii dataset, for example, the Honolulu tidegauge data from past classes, here are a few ways to read the data into a script.

ASCII

open and readline (this makes variables that are strings, not arrays)

       # Open file
       f = open('sample.dat', 'r')
       # Read and ignore header lines
       header1 = f.readline()
       header2 = f.readline()
       # Loop over lines and extract variables of interest
       for line in f:
         line = line.strip()
         columns = line.split()
         month = columns[0]
         temp = float(columns[1])
       print(month, temp)
       f.close()

numpy loadtxt (this makes an m by n numpy array)

<pre>
 data = np.loadtxt('sample.dat', delimiter=',', comments='#')
</pre>

numpy fromfile (if the number of columns are not consistent)

       data = np.fromfile('sample2.dat', dtype=float, sep='\t', count=-1)

numpy fromregex (this makes an m by n numpy array)

       data = np.fromregex('sample.dat', r'(\d+),\s(\d+)', np.float)

numpy genfromtxt (this makes an m by n numpy array)

       data = np.genfromtxt('sample.dat',delimiter=',',skiprows=2)
       # or, if the columns have different types:
       #1   2.0000  buckle_my_shoe
       #3   4.0000  margery_door
       data = np.genfromtxt('filename', dtype= None)
       # data = [(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')]

pandas read_table (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)
```
       data = pd.read_table('sample.dat', sep=',')
      
```
pandas read_csv (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)
```
       data = pd.read_csv('sample.dat', header=1)
      
```

sound (wav) files

librosa

      x, sr = librosa.load('sample.wav')

wave

       wf = wave.open(('sample.wav'), 'rb')

NetCDF

netCDF4

          from netCDF4 import Dataset
          fh = Dataset('sample.nc', mode='r')
          time = fh.variables['time'][:]
          lon = fh.variables['lon'][:,:]
          lat = fh.variables['lat'][:,:]
          temp = fh.variables['temp'][:,:]

<li> xarray
   <pre>
      import xarray as xr
      ds = xr.open_dataset('sample.nc')
      df = ds.to_dataframe()
   </pre>
</ol>

OPeNDAP

netCDF4 – see above (just like local file, but pass URL endpoint)

pydap

            from pydap.client import open_url
            import numpy as np
            from numpy import *
            # set ULR from PO.DAAC
            dataset = open_url("http://opendap-uat.jpl.nasa.gov/thredds/dodsC/ncml_aggregation/OceanTemperature/ghrsst/aggregate__ghrsst_DMI_OI-DMI-L4-GLOB-v1.0.ncml")
            lat = dataset.lat[:]
            lon = dataset.lon[:]
            time = dataset.time[:]
            sst = dataset.analysed_sst.array[0]

</ol>

matlab binary

scipy loadmat

           from scipy.io import loadmat
           fin1 = loadmat('sample.mat',squeeze_me=True)
           mtime = fin1['mday']
           Tair = fin1['ta_h']
           Press = fin1['bpr']

</ol>

shapefile

geopandas

          import geopandas as gpd
          shape_gpd = gpd.read_file('sample.shp')

<li> salem
   <pre>
      shpf = salem.get_demo_file('sample.shp')
      gdf = salem.read_shapefile(shpf)

[4]:

# Open file,
# filename is sample_clev.dat, open as read only 'r'
f = open('../jupyter-gesteach/data/sample_clev.dat', 'r')
# Read and ignore header lines
header1 = f.readline()
header2 = f.readline()
# Loop over lines and extract variables of interest
for line in f:
    line = line.strip()
    columns = line.split()
    year = columns[0]
    month = columns[1]
    day = columns[2]
    hour = columns[3]
    clev = columns[4]
    print(hour, clev)
f.close()

[5]:

import numpy as np
data = np.loadtxt('data/sample_clev.dat', delimiter=' ', comments='#')
print(data)

[[1.992e+03 1.200e+01 4.000e+00 1.000e+00 2.063e+03]
 [1.992e+03 1.200e+01 4.000e+00 2.000e+00 1.997e+03]
 [1.992e+03 1.200e+01 4.000e+00 3.000e+00 1.846e+03]
 [1.992e+03 1.200e+01 4.000e+00 4.000e+00 1.689e+03]
 [1.992e+03 1.200e+01 4.000e+00 5.000e+00 1.517e+03]
 [1.992e+03 1.200e+01 4.000e+00 6.000e+00 1.441e+03]
 [1.992e+03 1.200e+01 4.000e+00 7.000e+00 1.412e+03]
 [1.992e+03 1.200e+01 4.000e+00 8.000e+00 1.472e+03]
 [1.992e+03 1.200e+01 4.000e+00 9.000e+00 1.590e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.000e+01 1.718e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.100e+01 1.883e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.200e+01 1.988e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.300e+01 2.048e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.400e+01 1.965e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.500e+01 1.830e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.600e+01 1.639e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.700e+01 1.461e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.800e+01 1.348e+03]
 [1.992e+03 1.200e+01 4.000e+00 1.900e+01 1.290e+03]
 [1.992e+03 1.200e+01 4.000e+00 2.000e+01 1.343e+03]
 [1.992e+03 1.200e+01 4.000e+00 2.100e+01 1.480e+03]
 [1.992e+03 1.200e+01 4.000e+00 2.200e+01 1.662e+03]
 [1.992e+03 1.200e+01 4.000e+00 2.300e+01 1.890e+03]
 [1.992e+03 1.200e+01 5.000e+00 0.000e+00 2.059e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.000e+00 2.172e+03]
 [1.992e+03 1.200e+01 5.000e+00 2.000e+00 2.198e+03]
 [1.992e+03 1.200e+01 5.000e+00 3.000e+00 2.060e+03]
 [1.992e+03 1.200e+01 5.000e+00 4.000e+00 1.843e+03]
 [1.992e+03 1.200e+01 5.000e+00 5.000e+00 1.631e+03]
 [1.992e+03 1.200e+01 5.000e+00 6.000e+00 1.476e+03]
 [1.992e+03 1.200e+01 5.000e+00 7.000e+00 1.315e+03]
 [1.992e+03 1.200e+01 5.000e+00 8.000e+00 1.271e+03]
 [1.992e+03 1.200e+01 5.000e+00 9.000e+00 1.382e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.000e+01 1.504e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.100e+01 1.728e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.200e+01 1.976e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.300e+01 2.070e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.400e+01 2.044e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.500e+01 1.997e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.600e+01 1.822e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.700e+01 1.559e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.800e+01 1.327e+03]
 [1.992e+03 1.200e+01 5.000e+00 1.900e+01 1.232e+03]
 [1.992e+03 1.200e+01 5.000e+00 2.000e+01 1.194e+03]
 [1.992e+03 1.200e+01 5.000e+00 2.100e+01 1.202e+03]
 [1.992e+03 1.200e+01 5.000e+00 2.200e+01 1.471e+03]
 [1.992e+03 1.200e+01 5.000e+00 2.300e+01 1.747e+03]
 [1.992e+03 1.200e+01 6.000e+00 0.000e+00 2.000e+03]
 [1.992e+03 1.200e+01 6.000e+00 1.000e+00 2.222e+03]
 [1.992e+03 1.200e+01 6.000e+00 2.000e+00 2.295e+03]
 [1.992e+03 1.200e+01 6.000e+00 3.000e+00 2.207e+03]
 [1.992e+03 1.200e+01 6.000e+00 4.000e+00 2.028e+03]
 [1.992e+03 1.200e+01 6.000e+00 5.000e+00 1.775e+03]
 [1.992e+03 1.200e+01 6.000e+00 6.000e+00 1.526e+03]
 [1.992e+03 1.200e+01 6.000e+00 7.000e+00 1.271e+03]
 [1.992e+03 1.200e+01 6.000e+00 8.000e+00 1.135e+03]
 [1.992e+03 1.200e+01 6.000e+00 9.000e+00 1.130e+03]
 [1.992e+03 1.200e+01 6.000e+00 1.000e+01 1.261e+03]
 [1.992e+03 1.200e+01 6.000e+00 1.100e+01 1.527e+03]
 [1.992e+03 1.200e+01 6.000e+00 1.200e+01 1.817e+03]]

[6]:

# now use Pandas
import pandas as pd
data = pd.read_table('data/sample_clev.dat',sep=' ')

[7]:

data

[7]:

	#	002	\t113	\tTarawa	Unnamed: 4	Bairiki	\tKiribati	\t1.33200	\t173.01300
0	#	hourly	sea	level	from	UHSLC	NaN	NaN	NaN
1	1992	12	4	1	2063	NaN	NaN	NaN	NaN
2	1992	12	4	2	1997	NaN	NaN	NaN	NaN
3	1992	12	4	3	1846	NaN	NaN	NaN	NaN
4	1992	12	4	4	1689	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...
56	1992	12	6	8	1135	NaN	NaN	NaN	NaN
57	1992	12	6	9	1130	NaN	NaN	NaN	NaN
58	1992	12	6	10	1261	NaN	NaN	NaN	NaN
59	1992	12	6	11	1527	NaN	NaN	NaN	NaN
60	1992	12	6	12	1817	NaN	NaN	NaN	NaN

61 rows × 9 columns

[ ]:

Readng data¶

Data input¶

OCN-463

Navigation

Related Topics