Water Data Boot Camp: Fall 2019

Unit 1: Task 1
Comparing Monthly Stream Flows Below Falls Lake

Here we apply the Analytical Workflow introduced in the previous document. The general question posed to us is: How has the construction of Falls Lake altered monthly stream flow? Let's explore how we might tackle this question using our framework.


1. Clarify the question into an actionable task

The initial question - "How has the construction of Falls Lake altered stream flow?" - is fairly specific, but it still includes a few vague aspects we need to clarify before making it actionable. We know we'll be looking at a specific location (somewhere downstream of Falls Lake) and a specific time range (before and after construction of Falls lake in 1980-1984). However, it's up to us to define the metric used to measure stream flow (cfs? cms? mgd?) and what defines "altered stream flow".

Ideally, you'd consult your team to clarify these issues, but often the answer is to "figure that out yourself and get back to me". Since the issues seem somewhat minor, we'll take the approach of making decisions as we go. Perhaps data availability will dictate what we measure streamflow in, and perhaps we'll provide some plots of streamflow that will allow others to interpret whether Falls Lake "altered" streamflow...


2. Obtain, explore, and clean the data

As mentioned above, we've determined that we need streamflow data for some site below Falls Lake dam, and those data are provided via the USGS' National Water Information System. The following steps outline the process of finding the appropriate gaging site and then the data we need to run our analysis.

Finding the streamflow data

  1. Open the NWIS mapper in a browser: https://maps.waterdata.usgs.gov/mapper/index.html

  2. Locate Falls Lake on the map. (Tip: Search for Falls Lake, NC)

  3. On the map, visually follow the stream flowing out of Falls Lake dam until you find a gage site. Click on it, and it should reveal Site #: 02087500, NEUSE RIVER NEAR CLAYTON, NC. This is the site we'll use for our analysis.

    Data Analytics Process Streamflow

  4. Click on the Access Data link. This brings up the interface for selecting which data you want to grab.

  5. On this page, from the dropdown menu near the top, select Time-series: Daily Data and hit GO. This will open the form for selecting the current and historical stream flow data we want.

  6. In this form: - Check the box next to 00060 Discharge;

    - Select Tab-separated data and Display in browser as the output format;

    - Set the From date to 1930-10-01 and the End date to 2017-09-30 (note: the "water year" goes from Oct. to Sept.);

    - And finally hit Go. This will call up a page with all the data you just requested.

    → If needed, this LINK will take you directly to the data...

Pause and look at what's provided. The first several lines begin with a '#' and display the dataset's metadata, i.e. data about the data. The first line without the '#' appears to be column headers, but the line after that looks different than the actual data values below that. (As it happens, this line indicates the data length and type: 5s indicates a string that is 5 characters long. The other data types include d for date, and n for numeric.)

 

Getting the data into Excel

*Before embarking on our work with Excel, be sure you are clear what each of the following refers to with respect to Excel: workbook, worksheet, row, column, cell, menu options.

Now that we have found the data, we need to get it into Excel so that each value is in its own cell. We'll discuss two, somewhat imperfect, methods to do this, each requiring a bit of manual editing. We also present a third, more automated method. However, while this third method may seem easier, it is less transparent and a bit unstable with the current release of Excel.

Method 1

This works, but not perfectly. Notice the data (starting in row 33) are in columns now, but the column headers don’t match the data fields until 2004 (row 27,062) when minimum and maximum discharge were collected. Thus, we need to be careful for these types of errors using this method. Let's look at an alternative method and see whether it works better...

 

Method 2

 

Method 3

(Note, this method is somewhat buggy and may take a bit longer to run...)

Once you have your data imported, rename the worksheet to Raw. We'll be adding more worksheets to your Excel workbook and this will help identify which contains the original, raw data.

Fig1_1_RawData

Cleaning the data

While we now have the data we need, we should take a moment and tidy it up to facilitate analysis. (Hadley Wickham has an excellent paper on tidy data here: http://vita.had.co.nz/papers/tidy-data.html). We'll begin by creating a tidier version of the raw data, but keeping the raw worksheet intact in case we make a mistake and need to revert to it. And with this copy, we'll remove the metadata and other extraneous items so that we have a clean, efficient dataset, ready for quick analysis.

Fig1_2_EDAsheet

Exploring the data: Plots

While we now have our data in Excel, we should explore and inspect the data before we dive into our analysis. This may expose irregularities in the data, stemming either from our import process or in the dataset itself. Plots and summary statistics are a quick way to reveal any data gaps or outliers. We'll begin by looking at scatterplots of the data.

Create a scatterplot of discharge over time:
Adjust the aesthetics of your plot

Fig1_3_StreamflowPlot

Task: Altering the units of your data

You remembered that scientists like the metric system and you need to convert the data from cubic feet per second to cubic meters per second. You'll need to create a new column with the discharge values in these units and re-create a new plot. Here are the steps:

Fig1_4_StreamflowPlot_cms


○ EXERCISE 1: Altering units and plotting

Your project manager looks at the chart but doesn’t like the metric system. Add a new column to convert discharge to millions of gallons per day. Make a new plot and show side by side to the "cfs" or "cms" plot. (1 CFS = 0.53817 MGD)


 

Exploring the data: Grouping and summarizing data

Know your data. Whenever you get data, it’s good practice to look at and understand the raw data to make sure it looks reasonable and you understand something about its quality and content. One of the first things to assess in the metadata is how confident you are in the quality of the data being used in your analysis. Is the data fit for purpose?

How confident are we in the data?

Our data included data-value qualification codes, and indication of the confidence we can have in the values reported. Let's examine our data in terms of those codes. We deleted these data from the table, so we'll have to add them back in to the data in our EDA worksheet.

There's a lot going on in that VLOOKUP formula, so lets explain it.

First, the VLOOKUP function looks up a value by matching a given value in one table to a given value in another table, specified by a range of cells. In our example, this matching value is in cell B2, i.e. the date 10/1/1930. VLOOKUP searches the range of cells specified in the second argument, Raw!$C$33:$I$31809 for that value, and returns the value in the 7th column of that range.

What still may be confusing is how the range of cells is defined by Raw!$C$33:$I$31809. Here, Raw! tells Excel that the range of values is in the worksheet named "Raw", not the current worksheet. The range of cells in the "Raw" worksheet is C33 to 31809, but the $ indicates that, as we copy this formula to other cells, this range should remain static. Otherwise, if we copied the formula to the cell below it the range would dynamical update from C33:I31809 to C34:I31810, but in our case, we want the lookup range to be locked in.

Now, let's add into our worksheet a table listing how many records are associated with each confidence value. A look at the metadata indicate three data-value qualification codes: A, P, and e. Let's first confirm what values are contained in our dataset. We can do this quickly by setting up a data filter:

You'll now see that the header cell has a dropdown arrow. Click this arrow and it will list all the unique values, and you'll see that our data indeed has three values, but they are slightly different than what was listed in the metadata. They are: A and A:e.

Now to create the table listing how many records are associated with these three values. The countif tool is useful here.

With this table you can interpret the results. What proportion of the data is reliable? Would a plot be helpful?

Fig1_6_ConfidenceTable

 

Examine Summary Statistics of our Data

Summary statistics provide another quick way to examine our data for peculiarities. Here, we'll outline the process for computing min, max, mean, and percentiles from our data in Excel. We'll simplify this procedure by creating a name for the range of Mean flow (cms) data cell.

Now that we know how to compute summary statistics, let's examine whether building Falls Lake had a noticeable impact on these values. We can do this by computing summary statistics for subsets of the discharge data, with one subset containing the discharge before 1980, when construction of the lake started, and after 1984, after the dam was completed.

What do you notice? Particularly with regards to min and max streamflow. Does that make sense given what you know about reservoirs?

Is there seasonal variation in streamflow?

To examine seasonal (or monthly) variation in streamflow, we need to extract the months from our date values. We'll also convert calendar years into water years, which run from October thru September.

What do you observe? Would a plot of the results facilitate interpretation?

Monthly Plots


○ Exercise 2: Compute total streamflow by water year

Repeat the above using SUMIF for annual streamflow based on the water year.

AnnualPlot


END