{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Readng data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data input\n",
    "Getting data into an out of our programs will be a key component of our scripts.  There are several ways to do this (yet another good/bad aspect of python).  For class we will rely on a few, key python packages:\n",
    "<ol>\n",
    "    <li>numpy: contains many numerical functions, together with matplotlib and scipy allows many of the same features of matlab\n",
    "    <li>scipy: scientific analysis packages\n",
    "    <li>pandas: package specifically for loading data, and creating data frames\n",
    "    <li>matplotlib: provides many of the matlab plotting functions\n",
    "</ol>\n",
    "Let's start with a simple set of examples.  If we assume we have an ascii dataset, for example, the Honolulu tidegauge data from past classes, here are a few ways to read the data into a script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<ol>\n",
    "  <li> ASCII\n",
    "  <ol>\n",
    "      <li> open and readline (this makes variables that are strings, not arrays)\n",
    "      <pre>\n",
    "       # Open file\n",
    "       f = open('sample.dat', 'r')\n",
    "       # Read and ignore header lines\n",
    "       header1 = f.readline()\n",
    "       header2 = f.readline()\n",
    "       # Loop over lines and extract variables of interest\n",
    "       for line in f:\n",
    "         line = line.strip()\n",
    "         columns = line.split()\n",
    "         month = columns[0]\n",
    "         temp = float(columns[1])\n",
    "       print(month, temp)\n",
    "       f.close()\n",
    "      </pre>\n",
    "   <li> numpy loadtxt (this makes an m by n numpy array) </li>\n",
    "      <pre>\n",
    "       data = np.loadtxt('sample.dat', delimiter=',', comments='#')\n",
    "      </pre>      \n",
    "   <li> numpy fromfile (if the number of columns are not consistent)\n",
    "      <pre>\n",
    "       data = np.fromfile('sample2.dat', dtype=float, sep='\\t', count=-1)\n",
    "      </pre>\n",
    "   <li> numpy fromregex (this makes an m by n numpy array)\n",
    "      <pre>\n",
    "       data = np.fromregex('sample.dat', r'(\\d+),\\s(\\d+)', np.float)\n",
    "      </pre>\n",
    "   <li> numpy genfromtxt (this makes an m by n numpy array)\n",
    "      <pre>\n",
    "       data = np.genfromtxt('sample.dat',delimiter=',',skiprows=2)\n",
    "       # or, if the columns have different types:\n",
    "       #1   2.0000  buckle_my_shoe\n",
    "       #3   4.0000  margery_door\n",
    "       data = np.genfromtxt('filename', dtype= None)\n",
    "       # data = [(1, 2.0, 'buckle_my_shoe'), (3, 4.0, 'margery_door')]\n",
    "      </pre>\n",
    "   <li> pandas read_table (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)\n",
    "      <pre>\n",
    "       data = pd.read_table('sample.dat', sep=',')\n",
    "      </pre>\n",
    "   <li> pandas read_csv (this makes a pandas.core.frame.DataFrame; note the column headers will be the first row, so may need to specify this)\n",
    "      <pre>\n",
    "       data = pd.read_csv('sample.dat', header=1)\n",
    "      </pre>\n",
    "   </ol>\n",
    "<li>\tsound (wav) files\n",
    "  <ol>\n",
    "  <li>librosa\n",
    "     <pre>\n",
    "      x, sr = librosa.load('sample.wav')\n",
    "     </pre>\n",
    "  <li>wave\n",
    "     <pre>\n",
    "       wf = wave.open(('sample.wav'), 'rb')\n",
    "     </pre>\n",
    "  </ol>\n",
    "<li> NetCDF\n",
    "  <ol>\n",
    "    <li> netCDF4\n",
    "       <pre>\n",
    "          from netCDF4 import Dataset\n",
    "          fh = Dataset('sample.nc', mode='r')\n",
    "          time = fh.variables['time'][:]\n",
    "          lon = fh.variables['lon'][:,:]\n",
    "          lat = fh.variables['lat'][:,:]\n",
    "          temp = fh.variables['temp'][:,:]\n",
    "       </pre>\n",
    "    <li> xarray\n",
    "       <pre>\n",
    "          import xarray as xr\n",
    "          ds = xr.open_dataset('sample.nc')\n",
    "          df = ds.to_dataframe()\n",
    "       </pre>\n",
    "    </ol>\n",
    "<li>OPeNDAP\n",
    "  <ol>\n",
    "     <li> netCDF4 – see above (just like local file, but pass URL endpoint)\n",
    "     <li> pydap\n",
    "         <pre>\n",
    "            from pydap.client import open_url\n",
    "            import numpy as np\n",
    "            from numpy import *\n",
    "            # set ULR from PO.DAAC\n",
    "            dataset = open_url(\"http://opendap-uat.jpl.nasa.gov/thredds/dodsC/ncml_aggregation/OceanTemperature/ghrsst/aggregate__ghrsst_DMI_OI-DMI-L4-GLOB-v1.0.ncml\")\n",
    "            lat = dataset.lat[:]\n",
    "            lon = dataset.lon[:]\n",
    "            time = dataset.time[:]\n",
    "            sst = dataset.analysed_sst.array[0]\n",
    "         </pre>\n",
    "    </ol>\n",
    "<li> matlab binary\n",
    "  <ol>\n",
    "      <li> scipy loadmat\n",
    "        <pre>\n",
    "           from scipy.io import loadmat\n",
    "           fin1 = loadmat('sample.mat',squeeze_me=True)\n",
    "           mtime = fin1['mday']\n",
    "           Tair = fin1['ta_h']\n",
    "           Press = fin1['bpr']\n",
    "        </pre>\n",
    "    </ol>\n",
    "<li>shapefile\n",
    "  <ol>\n",
    "    <li> geopandas\n",
    "       <pre>\n",
    "          import geopandas as gpd\n",
    "          shape_gpd = gpd.read_file('sample.shp')\n",
    "       </pre>\n",
    "    <li> salem\n",
    "       <pre>\n",
    "          shpf = salem.get_demo_file('sample.shp')\n",
    "          gdf = salem.read_shapefile(shpf)\n",
    "   </ol>\n",
    "</ol>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 2063\n",
      "2 1997\n",
      "3 1846\n",
      "4 1689\n",
      "5 1517\n",
      "6 1441\n",
      "7 1412\n",
      "8 1472\n",
      "9 1590\n",
      "10 1718\n",
      "11 1883\n",
      "12 1988\n",
      "13 2048\n",
      "14 1965\n",
      "15 1830\n",
      "16 1639\n",
      "17 1461\n",
      "18 1348\n",
      "19 1290\n",
      "20 1343\n",
      "21 1480\n",
      "22 1662\n",
      "23 1890\n",
      "0 2059\n",
      "1 2172\n",
      "2 2198\n",
      "3 2060\n",
      "4 1843\n",
      "5 1631\n",
      "6 1476\n",
      "7 1315\n",
      "8 1271\n",
      "9 1382\n",
      "10 1504\n",
      "11 1728\n",
      "12 1976\n",
      "13 2070\n",
      "14 2044\n",
      "15 1997\n",
      "16 1822\n",
      "17 1559\n",
      "18 1327\n",
      "19 1232\n",
      "20 1194\n",
      "21 1202\n",
      "22 1471\n",
      "23 1747\n",
      "0 2000\n",
      "1 2222\n",
      "2 2295\n",
      "3 2207\n",
      "4 2028\n",
      "5 1775\n",
      "6 1526\n",
      "7 1271\n",
      "8 1135\n",
      "9 1130\n",
      "10 1261\n",
      "11 1527\n",
      "12 1817\n"
     ]
    }
   ],
   "source": [
    "# Open file,\n",
    "# filename is sample_clev.dat, open as read only 'r'\n",
    "f = open('../jupyter-gesteach/data/sample_clev.dat', 'r')\n",
    "# Read and ignore header lines\n",
    "header1 = f.readline()\n",
    "header2 = f.readline()\n",
    "# Loop over lines and extract variables of interest\n",
    "for line in f:\n",
    "    line = line.strip()\n",
    "    columns = line.split()\n",
    "    year = columns[0]\n",
    "    month = columns[1]\n",
    "    day = columns[2]\n",
    "    hour = columns[3]\n",
    "    clev = columns[4]\n",
    "    print(hour, clev)\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1.992e+03 1.200e+01 4.000e+00 1.000e+00 2.063e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 2.000e+00 1.997e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 3.000e+00 1.846e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 4.000e+00 1.689e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 5.000e+00 1.517e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 6.000e+00 1.441e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 7.000e+00 1.412e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 8.000e+00 1.472e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 9.000e+00 1.590e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.000e+01 1.718e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.100e+01 1.883e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.200e+01 1.988e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.300e+01 2.048e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.400e+01 1.965e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.500e+01 1.830e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.600e+01 1.639e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.700e+01 1.461e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.800e+01 1.348e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 1.900e+01 1.290e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 2.000e+01 1.343e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 2.100e+01 1.480e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 2.200e+01 1.662e+03]\n",
      " [1.992e+03 1.200e+01 4.000e+00 2.300e+01 1.890e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 0.000e+00 2.059e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.000e+00 2.172e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 2.000e+00 2.198e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 3.000e+00 2.060e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 4.000e+00 1.843e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 5.000e+00 1.631e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 6.000e+00 1.476e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 7.000e+00 1.315e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 8.000e+00 1.271e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 9.000e+00 1.382e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.000e+01 1.504e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.100e+01 1.728e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.200e+01 1.976e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.300e+01 2.070e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.400e+01 2.044e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.500e+01 1.997e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.600e+01 1.822e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.700e+01 1.559e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.800e+01 1.327e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 1.900e+01 1.232e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 2.000e+01 1.194e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 2.100e+01 1.202e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 2.200e+01 1.471e+03]\n",
      " [1.992e+03 1.200e+01 5.000e+00 2.300e+01 1.747e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 0.000e+00 2.000e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 1.000e+00 2.222e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 2.000e+00 2.295e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 3.000e+00 2.207e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 4.000e+00 2.028e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 5.000e+00 1.775e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 6.000e+00 1.526e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 7.000e+00 1.271e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 8.000e+00 1.135e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 9.000e+00 1.130e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 1.000e+01 1.261e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 1.100e+01 1.527e+03]\n",
      " [1.992e+03 1.200e+01 6.000e+00 1.200e+01 1.817e+03]]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "data = np.loadtxt('data/sample_clev.dat', delimiter=' ', comments='#')\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now use Pandas\n",
    "import pandas as pd\n",
    "data = pd.read_table('data/sample_clev.dat',sep=' ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>#</th>\n",
       "      <th>002</th>\n",
       "      <th>\\t113</th>\n",
       "      <th>\\tTarawa</th>\n",
       "      <th>Unnamed: 4</th>\n",
       "      <th>Bairiki</th>\n",
       "      <th>\\tKiribati</th>\n",
       "      <th>\\t1.33200</th>\n",
       "      <th>\\t173.01300</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#</td>\n",
       "      <td>hourly</td>\n",
       "      <td>sea</td>\n",
       "      <td>level</td>\n",
       "      <td>from</td>\n",
       "      <td>UHSLC</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2063</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1997</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>1846</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>1689</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>1135</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>57</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>1130</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>1261</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>6</td>\n",
       "      <td>11</td>\n",
       "      <td>1527</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>1992</td>\n",
       "      <td>12</td>\n",
       "      <td>6</td>\n",
       "      <td>12</td>\n",
       "      <td>1817</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>61 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       #     002 \\t113 \\tTarawa Unnamed: 4 Bairiki  \\tKiribati  \\t1.33200  \\\n",
       "0      #  hourly   sea    level       from   UHSLC         NaN        NaN   \n",
       "1   1992      12     4        1       2063     NaN         NaN        NaN   \n",
       "2   1992      12     4        2       1997     NaN         NaN        NaN   \n",
       "3   1992      12     4        3       1846     NaN         NaN        NaN   \n",
       "4   1992      12     4        4       1689     NaN         NaN        NaN   \n",
       "..   ...     ...   ...      ...        ...     ...         ...        ...   \n",
       "56  1992      12     6        8       1135     NaN         NaN        NaN   \n",
       "57  1992      12     6        9       1130     NaN         NaN        NaN   \n",
       "58  1992      12     6       10       1261     NaN         NaN        NaN   \n",
       "59  1992      12     6       11       1527     NaN         NaN        NaN   \n",
       "60  1992      12     6       12       1817     NaN         NaN        NaN   \n",
       "\n",
       "    \\t173.01300  \n",
       "0           NaN  \n",
       "1           NaN  \n",
       "2           NaN  \n",
       "3           NaN  \n",
       "4           NaN  \n",
       "..          ...  \n",
       "56          NaN  \n",
       "57          NaN  \n",
       "58          NaN  \n",
       "59          NaN  \n",
       "60          NaN  \n",
       "\n",
       "[61 rows x 9 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}