Water Data Boot Camp: Fall 2019
Here we apply the Analytical Workflow introduced in the previous document. The general question posed to us is: How has the construction of Falls Lake altered monthly stream flow? Let's explore how we might tackle this question using our framework.
The initial question - "How has the construction of Falls Lake altered stream flow?" - is fairly specific, but it still includes a few vague aspects we need to clarify before making it actionable. We know we'll be looking at a specific location (somewhere downstream of Falls Lake) and a specific time range (before and after construction of Falls lake in 1980-1984). However, it's up to us to define the metric used to measure stream flow (cfs? cms? mgd?) and what defines "altered stream flow".
Ideally, you'd consult your team to clarify these issues, but often the answer is to "figure that out yourself and get back to me". Since the issues seem somewhat minor, we'll take the approach of making decisions as we go. Perhaps data availability will dictate what we measure streamflow in, and perhaps we'll provide some plots of streamflow that will allow others to interpret whether Falls Lake "altered" streamflow...
As mentioned above, we've determined that we need streamflow data for some site below Falls Lake dam, and those data are provided via the USGS' National Water Information System. The following steps outline the process of finding the appropriate gaging site and then the data we need to run our analysis.
Open the NWIS mapper in a browser: https://maps.waterdata.usgs.gov/mapper/index.html
Locate Falls Lake on the map. (Tip: Search for Falls Lake, NC
)
On the map, visually follow the stream flowing out of Falls Lake dam until you find a gage site. Click on it, and it should reveal Site #: 02087500
, NEUSE RIVER NEAR CLAYTON, NC
. This is the site we'll use for our analysis.
Click on the Access Data link. This brings up the interface for selecting which data you want to grab.
On this page, from the dropdown menu near the top, select Time-series: Daily Data
and hit GO
. This will open the form for selecting the current and historical stream flow data we want.
In this form:
- Check the box next to 00060 Discharge
;
- Select Tab-separated data
and Display in browser
as the output format;
- Set the From date to 1930-10-01
and the End date to 2017-09-30
(note: the "water year" goes from Oct. to Sept.);
- And finally hit Go
. This will call up a page with all the data you just requested.
→ If needed, this LINK will take you directly to the data...
Pause and look at what's provided. The first several lines begin with a '#' and display the dataset's metadata, i.e. data about the data. The first line without the '#' appears to be column headers, but the line after that looks different than the actual data values below that. (As it happens, this line indicates the data length and type: 5s
indicates a s
tring that is 5
characters long. The other data types include d
for date, and n
for numeric.)
*Before embarking on our work with Excel, be sure you are clear what each of the following refers to with respect to Excel:
workbook
,worksheet
,row
,column
,cell
,menu options
.
Now that we have found the data, we need to get it into Excel so that each value is in its own cell. We'll discuss two, somewhat imperfect, methods to do this, each requiring a bit of manual editing. We also present a third, more automated method. However, while this third method may seem easier, it is less transparent and a bit unstable with the current release of Excel.
Select the entire contents of the web page containing the discharge data. (Tip: use ctrl
-a
)
Copy (ctrl
-c
) the contents from the browser and paste (ctrl
-v
) them into a new Excel worksheet.
Notice that the contents are lumped into a single column, which prevents us from properly working with the data. To fix this, you can use the Text to Columns
command .
To convert text to data, first select the cells containing text you want to convert into columns. For us, its the entire first column, which you can select by click the header of Column A
.
From the Data
menu, click the Text to Columns
command in the Data Tools
panel.
In the wizard, specify that your data are delimited
by a space
, and then click Finish
.
This works, but not perfectly. Notice the data (starting in row 33) are in columns now, but the column headers don’t match the data fields until 2004 (row 27,062) when minimum and maximum discharge were collected. Thus, we need to be careful for these types of errors using this method. Let's look at an alternative method and see whether it works better...
Paste Special...
from the context menu.text
and hit OK
. Notice that the data are in the correct columns!Raw
" and save your workbook.
(Note, this method is somewhat buggy and may take a bit longer to run...)
Data
menu, select Get External Data
and then From Web
. New Web Query
window, copy and paste the NWIS data web page's URL into the Address:
box and click Go
.import
button. Once you have your data imported, rename the worksheet to Raw
. We'll be adding more worksheets to your Excel workbook and this will help identify which contains the original, raw data.
While we now have the data we need, we should take a moment and tidy it up to facilitate analysis. (Hadley Wickham has an excellent paper on tidy data here: http://vita.had.co.nz/papers/tidy-data.html). We'll begin by creating a tidier version of the raw data, but keeping the raw worksheet intact in case we make a mistake and need to revert to it. And with this copy, we'll remove the metadata and other extraneous items so that we have a clean, efficient dataset, ready for quick analysis.
Create a copy of your Raw
worksheet and rename it EDA
(for exploratory data analysis):
Right click the Raw
worksheet tab, and select Move or Copy...
Be sure, create a copy
is checked and click OK
.
Right click on the Raw (2)
worksheet tab and select Rename
Type in EDA
as the new name
---All further work will be done in this EDA worksheet ---
Delete the metadata rows and the data type specification row (rows 1-31 and 33):
shift-click
to make continuous selections and ctrl
-click
to make disjunct selections.Delete
. As we are only interested in mean discharge, we can remove the other columns:
Delete all columns but site_no
, datetime
, and mean discharge, currently labeled 85235_00060_00003
.
Rename the 85235_00060_00003
as Mean flow (cfs)
* Note: It's always good to include the units in a field name!
While we now have our data in Excel, we should explore and inspect the data before we dive into our analysis. This may expose irregularities in the data, stemming either from our import process or in the dataset itself. Plots and summary statistics are a quick way to reveal any data gaps or outliers. We'll begin by looking at scatterplots of the data.
Choose the data you want to plot:
ctrl
+shift
+↓
to select all data cells beneath your initial selection.Create a scatter plot with straight lines from your selected data:
Notice that the X-axis contains an odd sampling of dates and lots of dead space on either side. Different plot types behave differently in Excel, and scatterplots are finicky with dates, so we need to either manually adjust the x axis or chose a different plot type. We'll examine both ways...
Format the X-axis to eliminate empty data space:
1930-10-01
and 2017-09-30
. Or, you can convert your scatterplot to a 2-d line plot.
Design
tab active, select Change Chart Type
to switch your scatterplot to a line plot. Notice you can preview the design of a chart before committing to it.Double-click the chart title and change it to something meaningful, such as “Neuse Streamflow near Clayton, NC”
Add a y-axis label: Design -> Add Chart Element -> Axis Titles -> Primary Vertical
Change the y-axis bounds:
Play with the colors, font sizes, borders, etc.
You remembered that scientists like the metric system and you need to convert the data from cubic feet per second to cubic meters per second. You'll need to create a new column with the discharge values in these units and re-create a new plot. Here are the steps:
Add a header in column D
: "Mean flow (cms)"
Compute cms values from your cfs ones:
D2
, enter the formula =C2*0.028316847
(0.028316847 is the conversion rate from cfs to cms). D2
cell.Check that the new values appear correct:
First, lock the header row so it doesn't disappear when scrolling
A1
(tip: try clickingctrl
-home
)1
.Freeze Top Row
Re-plot the data in cubic meters per second.
Select your existing plot and click on the data line. That will indicate the columns of data on which the plot was based.
Click on the side of the blue rectangle and drag it so that it covers the column D, not C.
Change the y-axis label and bounds.
Reformat other aesthetics as needed...
Your project manager looks at the chart but doesn’t like the metric system. Add a new column to convert discharge to millions of gallons per day. Make a new plot and show side by side to the "cfs" or "cms" plot. (1
CFS = 0.53817
MGD)
Know your data. Whenever you get data, it’s good practice to look at and understand the raw data to make sure it looks reasonable and you understand something about its quality and content. One of the first things to assess in the metadata is how confident you are in the quality of the data being used in your analysis. Is the data fit for purpose?
Our data included data-value qualification codes, and indication of the confidence we can have in the values reported. Let's examine our data in terms of those codes. We deleted these data from the table, so we'll have to add them back in to the data in our EDA worksheet.
Add a new column header to the table in the EDA worksheet; name it Confidence
.
We could copy and paste the data back into our table, as we haven't moved things around. However, it's useful to know how the vertical lookup, or VLOOKUP
function works to join data from on table to another using a common joining field.
Under the Confidence
header you just created, enter the formula: =VLOOKUP(B2,Raw!$C$33:$I$31809,7,FALSE)
There's a lot going on in that VLOOKUP formula, so lets explain it.
First, the VLOOKUP function looks up a value by matching a given value in one table to a given value in another table, specified by a range of cells. In our example, this matching value is in cell B2
, i.e. the date 10/1/1930
. VLOOKUP searches the range of cells specified in the second argument, Raw!$C$33:$I$31809
for that value, and returns the value in the 7
th column of that range.
What still may be confusing is how the range of cells is defined by Raw!$C$33:$I$31809
. Here, Raw!
tells Excel that the range of values is in the worksheet named "Raw", not the current worksheet. The range of cells in the "Raw" worksheet is C33
to 31809
, but the $
indicates that, as we copy this formula to other cells, this range should remain static. Otherwise, if we copied the formula to the cell below it the range would dynamical update from C33:I31809
to C34:I31810
, but in our case, we want the lookup range to be locked in.
Now, let's add into our worksheet a table listing how many records are associated with each confidence value. A look at the metadata indicate three data-value qualification codes: A, P, and e. Let's first confirm what values are contained in our dataset. We can do this quickly by setting up a data filter:
Create a Filter for the Confidence
column to reveal a list of unique values
You'll now see that the header cell has a dropdown arrow. Click this arrow and it will list all the unique values, and you'll see that our data indeed has three values, but they are slightly different than what was listed in the metadata. They are: A and A:e.
Now to create the table listing how many records are associated with these three values. The countif
tool is useful here.
Create a table listing the number of records associated with each Confidence code using COUNTIF
Confidence code
and Count
.Confidence Code
header cell, enter three label cells, one for each confidence code: A
and A:e
.=countif(
shift
+ ctrl
+↓
. (It should result in E2:E31778
)A
should be 31601
.$
to your range so that it remains locked in: Edit the range in your formula from E2:E31778
to E$2:E$31778
.With this table you can interpret the results. What proportion of the data is reliable? Would a plot be helpful?
Summary statistics provide another quick way to examine our data for peculiarities. Here, we'll outline the process for computing min, max, mean, and percentiles from our data in Excel. We'll simplify this procedure by creating a name for the range of Mean flow (cms)
data cell.
Assign CMS
as the name of the range of Mean flow (cms)
cells:
Mean flow (cms)
headerctrl
-F3
to open the Name Manager and click New
to add a new named range of cells.OK
.CMS
in any formula and it will refer to that range of cells. We'll see below how this works. Create a skeleton table with labels: Min
, P10
, P25
, P75
, P90
, Median
, Average
, and Max
Insert the appropriate formula for each, using the range of Mean flow (cms)
cells for each.
min: =MIN(CMS)
P10: =PERCENTILE.INC(CMS,0.1)
INC
means inclusive: where the percentile value must be between 0 and 1EXC
requires the percentile to be between 1/n and 1-1/n, where n is the number of elements median: =MEDIAN(CMS)
average: =AVERAGE(CMS)
max: =MAX(CMS)
Now that we know how to compute summary statistics, let's examine whether building Falls Lake had a noticeable impact on these values. We can do this by computing summary statistics for subsets of the discharge data, with one subset containing the discharge before 1980, when construction of the lake started, and after 1984, after the dam was completed.
Create named ranges for discharge data before 1980 and after 1984
shift
+ctrl
+↑
. With these records selected, assign a name as we did above. shift
+ctrl
+↓
), and name that range. Compute summary stats for these new ranges as you did above. (Tip: you can use ctrl
+H
to replace the named range CMS
with the new names you created for easy update.)
What do you notice? Particularly with regards to min and max streamflow. Does that make sense given what you know about reservoirs?
To examine seasonal (or monthly) variation in streamflow, we need to extract the months from our date values. We'll also convert calendar years into water years, which run from October thru September.
Parse out the Date to include Year and Month
Insert
three times.year
, month
, and water year
=Year()
function to extract the year from the corresponding cell in the datetime
column. The result will look weird, but we'll fix that in a moment.=Month()
. Number
from the dropdown list in the Number panel in the Home menu. Water year runs from October to September. We adjust the year column to account for this information using IF
IF
function to assign the value to the cell in the year
column if the month value is < 10
, otherwise, set the value of the year column plus 1: =IF(D2<10,C2,C2+1)
.Create a table skeleton of average streamflow by month (1-12) over the different time periods (period of record, pre1980, and post1984)
Use the AVERAGEIF()
function to compute the average discharge of records where the month equals the specified month. Here, the range will be the range of cells in the month column, the criteria will be the month (e.g. '1'), and the average range will be the range of discharge values.
What do you observe? Would a plot of the results facilitate interpretation?
Repeat the above using SUMIF
for annual streamflow based on the water year.
END