Readng data

Data input

Getting data into an out of our programs will be a key component of our scripts. There are several ways to do this (yet another good/bad aspect of python). For class we will rely on a few, key python packages:

  1. numpy: contains many numerical functions, together with matplotlib and scipy allows many of the same features of matlab

  2. scipy: scientific analysis packages

  3. pandas: package specifically for loading data, and creating data frames

  4. matplotlib: provides many of the matlab plotting functions

Let’s start with a simple set of examples. If we assume we have an ascii dataset, for example, the Honolulu tidegauge data from past classes, here are a few ways to read the data into a script.

  1. ASCII

    1. open and readline (this makes variables that are strings, not arrays)

             # Open file
             f = open('sample.dat', 'r')
             # Read and ignore header lines
             header1 = f.readline()
             header2 = f.readline()
             # Loop over lines and extract variables of interest
             for line in f:
               line = line.strip()
               columns = line.split()
               month = columns[0]
               temp = float(columns[1])
             print(month, temp)
             f.close()
            
    2. numpy loadtxt (this makes an m by n numpy array)

    3. <pre>
       data = np.loadtxt('sample.dat', delimiter=',', comments='#')
      </pre>
      
    4. numpy fromfile (if the number of columns are not consistent)

             data = np.fromfile('sample2.dat', dtype=float, sep='\t', count=-1)
            
    5. numpy fromregex (this makes an m by n numpy array)

             data = np.fromregex('sample.dat', r'(\d+),\s(\d+)', np.float)
            
    6. numpy genfromtxt (this makes an m by n numpy array)

             data = np.genfromtxt('sample.dat',delimiter=',',skiprows=2)
             # or, if the columns have different types:
             #1   2.0000  buckle_my_shoe
             #3   4.0000  margery_door
             data = np.genfromtxt('filename', dtype= None)
             # data = [(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')]
            
    7. pandas read_table (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)

             data = pd.read_table('sample.dat', sep=',')
            
    8. pandas read_csv (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)

             data = pd.read_csv('sample.dat', header=1)
            
  2. sound (wav) files

    1. librosa

            x, sr = librosa.load('sample.wav')
           
    2. wave

             wf = wave.open(('sample.wav'), 'rb')
           
  3. NetCDF

    1. netCDF4

                from netCDF4 import Dataset
                fh = Dataset('sample.nc', mode='r')
                time = fh.variables['time'][:]
                lon = fh.variables['lon'][:,:]
                lat = fh.variables['lat'][:,:]
                temp = fh.variables['temp'][:,:]
             
      <li> xarray
         <pre>
            import xarray as xr
            ds = xr.open_dataset('sample.nc')
            df = ds.to_dataframe()
         </pre>
      </ol>
      
    2. OPeNDAP

      1. netCDF4 – see above (just like local file, but pass URL endpoint)

      2. pydap

                    from pydap.client import open_url
                    import numpy as np
                    from numpy import *
                    # set ULR from PO.DAAC
                    dataset = open_url("http://opendap-uat.jpl.nasa.gov/thredds/dodsC/ncml_aggregation/OceanTemperature/ghrsst/aggregate__ghrsst_DMI_OI-DMI-L4-GLOB-v1.0.ncml")
                    lat = dataset.lat[:]
                    lon = dataset.lon[:]
                    time = dataset.time[:]
                    sst = dataset.analysed_sst.array[0]
                 
        </ol>
        
      3. matlab binary

        1. scipy loadmat

                     from scipy.io import loadmat
                     fin1 = loadmat('sample.mat',squeeze_me=True)
                     mtime = fin1['mday']
                     Tair = fin1['ta_h']
                     Press = fin1['bpr']
                  
          </ol>
          
        2. shapefile

          1. geopandas

                      import geopandas as gpd
                      shape_gpd = gpd.read_file('sample.shp')
                   
            <li> salem
               <pre>
                  shpf = salem.get_demo_file('sample.shp')
                  gdf = salem.read_shapefile(shpf)
            
        [4]:
        
        # Open file,
        # filename is sample_clev.dat, open as read only 'r'
        f = open('../jupyter-gesteach/data/sample_clev.dat', 'r')
        # Read and ignore header lines
        header1 = f.readline()
        header2 = f.readline()
        # Loop over lines and extract variables of interest
        for line in f:
            line = line.strip()
            columns = line.split()
            year = columns[0]
            month = columns[1]
            day = columns[2]
            hour = columns[3]
            clev = columns[4]
            print(hour, clev)
        f.close()
        
        1 2063
        2 1997
        3 1846
        4 1689
        5 1517
        6 1441
        7 1412
        8 1472
        9 1590
        10 1718
        11 1883
        12 1988
        13 2048
        14 1965
        15 1830
        16 1639
        17 1461
        18 1348
        19 1290
        20 1343
        21 1480
        22 1662
        23 1890
        0 2059
        1 2172
        2 2198
        3 2060
        4 1843
        5 1631
        6 1476
        7 1315
        8 1271
        9 1382
        10 1504
        11 1728
        12 1976
        13 2070
        14 2044
        15 1997
        16 1822
        17 1559
        18 1327
        19 1232
        20 1194
        21 1202
        22 1471
        23 1747
        0 2000
        1 2222
        2 2295
        3 2207
        4 2028
        5 1775
        6 1526
        7 1271
        8 1135
        9 1130
        10 1261
        11 1527
        12 1817
        
        [5]:
        
        import numpy as np
        data = np.loadtxt('data/sample_clev.dat', delimiter=' ', comments='#')
        print(data)
        
        [[1.992e+03 1.200e+01 4.000e+00 1.000e+00 2.063e+03]
         [1.992e+03 1.200e+01 4.000e+00 2.000e+00 1.997e+03]
         [1.992e+03 1.200e+01 4.000e+00 3.000e+00 1.846e+03]
         [1.992e+03 1.200e+01 4.000e+00 4.000e+00 1.689e+03]
         [1.992e+03 1.200e+01 4.000e+00 5.000e+00 1.517e+03]
         [1.992e+03 1.200e+01 4.000e+00 6.000e+00 1.441e+03]
         [1.992e+03 1.200e+01 4.000e+00 7.000e+00 1.412e+03]
         [1.992e+03 1.200e+01 4.000e+00 8.000e+00 1.472e+03]
         [1.992e+03 1.200e+01 4.000e+00 9.000e+00 1.590e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.000e+01 1.718e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.100e+01 1.883e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.200e+01 1.988e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.300e+01 2.048e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.400e+01 1.965e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.500e+01 1.830e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.600e+01 1.639e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.700e+01 1.461e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.800e+01 1.348e+03]
         [1.992e+03 1.200e+01 4.000e+00 1.900e+01 1.290e+03]
         [1.992e+03 1.200e+01 4.000e+00 2.000e+01 1.343e+03]
         [1.992e+03 1.200e+01 4.000e+00 2.100e+01 1.480e+03]
         [1.992e+03 1.200e+01 4.000e+00 2.200e+01 1.662e+03]
         [1.992e+03 1.200e+01 4.000e+00 2.300e+01 1.890e+03]
         [1.992e+03 1.200e+01 5.000e+00 0.000e+00 2.059e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.000e+00 2.172e+03]
         [1.992e+03 1.200e+01 5.000e+00 2.000e+00 2.198e+03]
         [1.992e+03 1.200e+01 5.000e+00 3.000e+00 2.060e+03]
         [1.992e+03 1.200e+01 5.000e+00 4.000e+00 1.843e+03]
         [1.992e+03 1.200e+01 5.000e+00 5.000e+00 1.631e+03]
         [1.992e+03 1.200e+01 5.000e+00 6.000e+00 1.476e+03]
         [1.992e+03 1.200e+01 5.000e+00 7.000e+00 1.315e+03]
         [1.992e+03 1.200e+01 5.000e+00 8.000e+00 1.271e+03]
         [1.992e+03 1.200e+01 5.000e+00 9.000e+00 1.382e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.000e+01 1.504e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.100e+01 1.728e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.200e+01 1.976e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.300e+01 2.070e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.400e+01 2.044e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.500e+01 1.997e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.600e+01 1.822e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.700e+01 1.559e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.800e+01 1.327e+03]
         [1.992e+03 1.200e+01 5.000e+00 1.900e+01 1.232e+03]
         [1.992e+03 1.200e+01 5.000e+00 2.000e+01 1.194e+03]
         [1.992e+03 1.200e+01 5.000e+00 2.100e+01 1.202e+03]
         [1.992e+03 1.200e+01 5.000e+00 2.200e+01 1.471e+03]
         [1.992e+03 1.200e+01 5.000e+00 2.300e+01 1.747e+03]
         [1.992e+03 1.200e+01 6.000e+00 0.000e+00 2.000e+03]
         [1.992e+03 1.200e+01 6.000e+00 1.000e+00 2.222e+03]
         [1.992e+03 1.200e+01 6.000e+00 2.000e+00 2.295e+03]
         [1.992e+03 1.200e+01 6.000e+00 3.000e+00 2.207e+03]
         [1.992e+03 1.200e+01 6.000e+00 4.000e+00 2.028e+03]
         [1.992e+03 1.200e+01 6.000e+00 5.000e+00 1.775e+03]
         [1.992e+03 1.200e+01 6.000e+00 6.000e+00 1.526e+03]
         [1.992e+03 1.200e+01 6.000e+00 7.000e+00 1.271e+03]
         [1.992e+03 1.200e+01 6.000e+00 8.000e+00 1.135e+03]
         [1.992e+03 1.200e+01 6.000e+00 9.000e+00 1.130e+03]
         [1.992e+03 1.200e+01 6.000e+00 1.000e+01 1.261e+03]
         [1.992e+03 1.200e+01 6.000e+00 1.100e+01 1.527e+03]
         [1.992e+03 1.200e+01 6.000e+00 1.200e+01 1.817e+03]]
        
        [6]:
        
        # now use Pandas
        import pandas as pd
        data = pd.read_table('data/sample_clev.dat',sep=' ')
        
        [7]:
        
        data
        
        [7]:
        
        # 002 \t113 \tTarawa Unnamed: 4 Bairiki \tKiribati \t1.33200 \t173.01300
        0 # hourly sea level from UHSLC NaN NaN NaN
        1 1992 12 4 1 2063 NaN NaN NaN NaN
        2 1992 12 4 2 1997 NaN NaN NaN NaN
        3 1992 12 4 3 1846 NaN NaN NaN NaN
        4 1992 12 4 4 1689 NaN NaN NaN NaN
        ... ... ... ... ... ... ... ... ... ...
        56 1992 12 6 8 1135 NaN NaN NaN NaN
        57 1992 12 6 9 1130 NaN NaN NaN NaN
        58 1992 12 6 10 1261 NaN NaN NaN NaN
        59 1992 12 6 11 1527 NaN NaN NaN NaN
        60 1992 12 6 12 1817 NaN NaN NaN NaN

        61 rows × 9 columns

        [ ]: