{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Readng data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data input\n", "Getting data into an out of our programs will be a key component of our scripts. There are several ways to do this (yet another good/bad aspect of python). For class we will rely on a few, key python packages:\n", "
    \n", "
  1. numpy: contains many numerical functions, together with matplotlib and scipy allows many of the same features of matlab\n", "
  2. scipy: scientific analysis packages\n", "
  3. pandas: package specifically for loading data, and creating data frames\n", "
  4. matplotlib: provides many of the matlab plotting functions\n", "
\n", "Let's start with a simple set of examples. If we assume we have an ascii dataset, for example, the Honolulu tidegauge data from past classes, here are a few ways to read the data into a script." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
    \n", "
  1. ASCII\n", "
      \n", "
    1. open and readline (this makes variables that are strings, not arrays)\n", "
      \n",
          "       # Open file\n",
          "       f = open('sample.dat', 'r')\n",
          "       # Read and ignore header lines\n",
          "       header1 = f.readline()\n",
          "       header2 = f.readline()\n",
          "       # Loop over lines and extract variables of interest\n",
          "       for line in f:\n",
          "         line = line.strip()\n",
          "         columns = line.split()\n",
          "         month = columns[0]\n",
          "         temp = float(columns[1])\n",
          "       print(month, temp)\n",
          "       f.close()\n",
          "      
      \n", "
    2. numpy loadtxt (this makes an m by n numpy array)
    3. \n", "
      \n",
          "       data = np.loadtxt('sample.dat', delimiter=',', comments='#')\n",
          "      
      \n", "
    4. numpy fromfile (if the number of columns are not consistent)\n", "
      \n",
          "       data = np.fromfile('sample2.dat', dtype=float, sep='\\t', count=-1)\n",
          "      
      \n", "
    5. numpy fromregex (this makes an m by n numpy array)\n", "
      \n",
          "       data = np.fromregex('sample.dat', r'(\\d+),\\s(\\d+)', np.float)\n",
          "      
      \n", "
    6. numpy genfromtxt (this makes an m by n numpy array)\n", "
      \n",
          "       data = np.genfromtxt('sample.dat',delimiter=',',skiprows=2)\n",
          "       # or, if the columns have different types:\n",
          "       #1   2.0000  buckle_my_shoe\n",
          "       #3   4.0000  margery_door\n",
          "       data = np.genfromtxt('filename', dtype= None)\n",
          "       # data = [(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')]\n",
          "      
      \n", "
    7. pandas read_table (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)\n", "
      \n",
          "       data = pd.read_table('sample.dat', sep=',')\n",
          "      
      \n", "
    8. pandas read_csv (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)\n", "
      \n",
          "       data = pd.read_csv('sample.dat', header=1)\n",
          "      
      \n", "
    \n", "
  2. \tsound (wav) files\n", "
      \n", "
    1. librosa\n", "
      \n",
          "      x, sr = librosa.load('sample.wav')\n",
          "     
      \n", "
    2. wave\n", "
      \n",
          "       wf = wave.open(('sample.wav'), 'rb')\n",
          "     
      \n", "
    \n", "
  3. NetCDF\n", "
      \n", "
    1. netCDF4\n", "
      \n",
          "          from netCDF4 import Dataset\n",
          "          fh = Dataset('sample.nc', mode='r')\n",
          "          time = fh.variables['time'][:]\n",
          "          lon = fh.variables['lon'][:,:]\n",
          "          lat = fh.variables['lat'][:,:]\n",
          "          temp = fh.variables['temp'][:,:]\n",
          "       
      \n", "
    2. xarray\n", "
      \n",
          "          import xarray as xr\n",
          "          ds = xr.open_dataset('sample.nc')\n",
          "          df = ds.to_dataframe()\n",
          "       
      \n", "
    \n", "
  4. OPeNDAP\n", "
      \n", "
    1. netCDF4 – see above (just like local file, but pass URL endpoint)\n", "
    2. pydap\n", "
      \n",
          "            from pydap.client import open_url\n",
          "            import numpy as np\n",
          "            from numpy import *\n",
          "            # set ULR from PO.DAAC\n",
          "            dataset = open_url(\"http://opendap-uat.jpl.nasa.gov/thredds/dodsC/ncml_aggregation/OceanTemperature/ghrsst/aggregate__ghrsst_DMI_OI-DMI-L4-GLOB-v1.0.ncml\")\n",
          "            lat = dataset.lat[:]\n",
          "            lon = dataset.lon[:]\n",
          "            time = dataset.time[:]\n",
          "            sst = dataset.analysed_sst.array[0]\n",
          "         
      \n", "
    \n", "
  5. matlab binary\n", "
      \n", "
    1. scipy loadmat\n", "
      \n",
          "           from scipy.io import loadmat\n",
          "           fin1 = loadmat('sample.mat',squeeze_me=True)\n",
          "           mtime = fin1['mday']\n",
          "           Tair = fin1['ta_h']\n",
          "           Press = fin1['bpr']\n",
          "        
      \n", "
    \n", "
  6. shapefile\n", "
      \n", "
    1. geopandas\n", "
      \n",
          "          import geopandas as gpd\n",
          "          shape_gpd = gpd.read_file('sample.shp')\n",
          "       
      \n", "
    2. salem\n", "
      \n",
          "          shpf = salem.get_demo_file('sample.shp')\n",
          "          gdf = salem.read_shapefile(shpf)\n",
          "   
    \n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 2063\n", "2 1997\n", "3 1846\n", "4 1689\n", "5 1517\n", "6 1441\n", "7 1412\n", "8 1472\n", "9 1590\n", "10 1718\n", "11 1883\n", "12 1988\n", "13 2048\n", "14 1965\n", "15 1830\n", "16 1639\n", "17 1461\n", "18 1348\n", "19 1290\n", "20 1343\n", "21 1480\n", "22 1662\n", "23 1890\n", "0 2059\n", "1 2172\n", "2 2198\n", "3 2060\n", "4 1843\n", "5 1631\n", "6 1476\n", "7 1315\n", "8 1271\n", "9 1382\n", "10 1504\n", "11 1728\n", "12 1976\n", "13 2070\n", "14 2044\n", "15 1997\n", "16 1822\n", "17 1559\n", "18 1327\n", "19 1232\n", "20 1194\n", "21 1202\n", "22 1471\n", "23 1747\n", "0 2000\n", "1 2222\n", "2 2295\n", "3 2207\n", "4 2028\n", "5 1775\n", "6 1526\n", "7 1271\n", "8 1135\n", "9 1130\n", "10 1261\n", "11 1527\n", "12 1817\n" ] } ], "source": [ "# Open file,\n", "# filename is sample_clev.dat, open as read only 'r'\n", "f = open('../jupyter-gesteach/data/sample_clev.dat', 'r')\n", "# Read and ignore header lines\n", "header1 = f.readline()\n", "header2 = f.readline()\n", "# Loop over lines and extract variables of interest\n", "for line in f:\n", " line = line.strip()\n", " columns = line.split()\n", " year = columns[0]\n", " month = columns[1]\n", " day = columns[2]\n", " hour = columns[3]\n", " clev = columns[4]\n", " print(hour, clev)\n", "f.close()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1.992e+03 1.200e+01 4.000e+00 1.000e+00 2.063e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 2.000e+00 1.997e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 3.000e+00 1.846e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 4.000e+00 1.689e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 5.000e+00 1.517e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 6.000e+00 1.441e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 7.000e+00 1.412e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 8.000e+00 1.472e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 9.000e+00 1.590e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.000e+01 1.718e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.100e+01 1.883e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.200e+01 1.988e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.300e+01 2.048e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.400e+01 1.965e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.500e+01 1.830e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.600e+01 1.639e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.700e+01 1.461e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.800e+01 1.348e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 1.900e+01 1.290e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 2.000e+01 1.343e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 2.100e+01 1.480e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 2.200e+01 1.662e+03]\n", " [1.992e+03 1.200e+01 4.000e+00 2.300e+01 1.890e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 0.000e+00 2.059e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.000e+00 2.172e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 2.000e+00 2.198e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 3.000e+00 2.060e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 4.000e+00 1.843e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 5.000e+00 1.631e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 6.000e+00 1.476e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 7.000e+00 1.315e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 8.000e+00 1.271e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 9.000e+00 1.382e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.000e+01 1.504e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.100e+01 1.728e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.200e+01 1.976e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.300e+01 2.070e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.400e+01 2.044e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.500e+01 1.997e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.600e+01 1.822e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.700e+01 1.559e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.800e+01 1.327e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 1.900e+01 1.232e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 2.000e+01 1.194e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 2.100e+01 1.202e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 2.200e+01 1.471e+03]\n", " [1.992e+03 1.200e+01 5.000e+00 2.300e+01 1.747e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 0.000e+00 2.000e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 1.000e+00 2.222e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 2.000e+00 2.295e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 3.000e+00 2.207e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 4.000e+00 2.028e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 5.000e+00 1.775e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 6.000e+00 1.526e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 7.000e+00 1.271e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 8.000e+00 1.135e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 9.000e+00 1.130e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 1.000e+01 1.261e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 1.100e+01 1.527e+03]\n", " [1.992e+03 1.200e+01 6.000e+00 1.200e+01 1.817e+03]]\n" ] } ], "source": [ "import numpy as np\n", "data = np.loadtxt('data/sample_clev.dat', delimiter=' ', comments='#')\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# now use Pandas\n", "import pandas as pd\n", "data = pd.read_table('data/sample_clev.dat',sep=' ')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#002\\t113\\tTarawaUnnamed: 4Bairiki\\tKiribati\\t1.33200\\t173.01300
0#hourlysealevelfromUHSLCNaNNaNNaN
1199212412063NaNNaNNaNNaN
2199212421997NaNNaNNaNNaN
3199212431846NaNNaNNaNNaN
4199212441689NaNNaNNaNNaN
..............................
56199212681135NaNNaNNaNNaN
57199212691130NaNNaNNaNNaN
581992126101261NaNNaNNaNNaN
591992126111527NaNNaNNaNNaN
601992126121817NaNNaNNaNNaN
\n", "

61 rows × 9 columns

\n", "
" ], "text/plain": [ " # 002 \\t113 \\tTarawa Unnamed: 4 Bairiki \\tKiribati \\t1.33200 \\\n", "0 # hourly sea level from UHSLC NaN NaN \n", "1 1992 12 4 1 2063 NaN NaN NaN \n", "2 1992 12 4 2 1997 NaN NaN NaN \n", "3 1992 12 4 3 1846 NaN NaN NaN \n", "4 1992 12 4 4 1689 NaN NaN NaN \n", ".. ... ... ... ... ... ... ... ... \n", "56 1992 12 6 8 1135 NaN NaN NaN \n", "57 1992 12 6 9 1130 NaN NaN NaN \n", "58 1992 12 6 10 1261 NaN NaN NaN \n", "59 1992 12 6 11 1527 NaN NaN NaN \n", "60 1992 12 6 12 1817 NaN NaN NaN \n", "\n", " \\t173.01300 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", ".. ... \n", "56 NaN \n", "57 NaN \n", "58 NaN \n", "59 NaN \n", "60 NaN \n", "\n", "[61 rows x 9 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }