L. Patterson & John Fay
Spring 2018
In this notebook, we examine the two datsets generated from our query of the Water Quality Data portal for nutrient data in HUC 03030002 from 1970 trough 2017. The first dataset (sites.csv) contains data on the locations where samples were collected, and the second dataset (results.csv) contains data on the nutrient samples collected at these sites.
Our goals here are to upload, prepare (tidy, explore, and merge) these data, and ultimately provide some visualizations that reveal the state of water quality in Jordan lake. We'll begin with the sites dataset, examining a few attributes about these data, and the move to the results where we filter our data for 'clean' records and tidy it for analysis.
#Import modules
import pandas as pd
import folium
from matplotlib import pyplot as plt
#Enable notebook plotting
%matplotlib inline
We'll first load in the sites data into a dataframe, doing a few checks to ensure the data import ok and get a feeling for what we're looking at. The data are located in the data folder as the file station.csv
.
#Read in the data into the 'sites' dataframe
sites = pd.read_csv('../data/station.csv',
dtype={'HUCEightDigitCode':'str'})
#Display the dimensions of the data frame
sites.shape
#Display the columns and their data types
sites.dtypes
Plot a map to see where water quality data were present from this portal.
First we'll plot all the points, then we'll remove sites with small drainage areas and see how many remain...
#Plot all sites on a map
#Find center coordinates from medians of lat and long columns
medLat = sites['LatitudeMeasure'].median()
medLng = sites['LongitudeMeasure'].median()
#Create the initial map
m = folium.Map(location=[medLat,medLng],
zoom_start=9,
tiles='stamenterrain')
#Loop through all features and add them to the map as markers
for row in sites.itertuples():
#Get info for the record
lat = row.LatitudeMeasure
lng = row.LongitudeMeasure
name = row.MonitoringLocationName
#Create the marker object, adding them to the map object
folium.CircleMarker(location=[lat,lng],
popup=name,
color='red',
fill=True,
fill_opacity=0.6,
radius=3,
stroke=False).add_to(m)
#Show map
m
#Plot sites with an area > 25 sq mi
sites2 = sites[sites['DrainageAreaMeasure/MeasureValue'] > 25]
sites2.shape
Only 10 records! This seems small. Let's plot these in blue on top of the map created above to see where these 10 sites occur.
#Loop through all features and add them to the map as markers
for row in sites2.itertuples():
#Get info for the record
lat = row.LatitudeMeasure
lng = row.LongitudeMeasure
name = row.MonitoringLocationName
#Create the marker object, adding them to the map object
folium.CircleMarker(location=[lat,lng],
popup=name,
color='blue',
fill=True,
fill_opacity=0.8,
radius=3,
stroke=False).add_to(m)
#Show map
m
We take a closer look at the drainage measurement values and see there are numerous NA
values. Clearly, filtering by drainage area will not work. The main point I want to make here is **don’t be afraid to play with the data.**
Let’s replot the site map but now change the popup name to see which site names are located within each branch of Jordan Lake.
#Show the map
m = folium.Map(location=[medLat,medLng],
zoom_start=9,
tiles='stamenterrain')
#Loop through all features and add them to the map as markers
for row in sites.itertuples():
#Get info for the record
lat = row.LatitudeMeasure
lng = row.LongitudeMeasure
name = row.MonitoringLocationIdentifier #<-- Change the field to show as the popup
#Create the marker object, adding them to the map object
folium.CircleMarker(location=[lat,lng],
popup=name,
color='red',
fill=True,
fill_opacity=0.6,
radius=3,
stroke=False).add_to(m)
#Show map
m
Finally, let's cull some unused columns from our dataset
keepColumns = [0,2,3,7,8,11,12]
sites = sites.iloc[:,keepColumns]
sites.columns
The second dataset downloaded from the Water Quality Data portal were the water quality results, stored in the result.csv
file. Let's now import this [large] file and tidy it up for analysis.
#Read in the results.csv
results = pd.read_csv('../data/result.csv',
low_memory=False #This is required as it's a large file...
)
#Check the dimensions of the dataframe: it's BIG with close to 100k records!
results.shape
Once we load in the measurements data, we need to do some cleaning. The first step is to filter the data to what we want to use. For example,
We want to restrict our analysis to streams, filtering out all other media.
We want to make sure we are using routine samples and not extreme event sampling (biased for specific occasions and not for estimating annual average load). So, we'd like to filter out events like storms, droughts, floods, and spring break ups from our analysis.
We determine what type of Nitrogen we want to use. From the literature we found that regulations for Nitrogen include: nitrate, nitrite, ammonia and organic forms. Doing some reading about the WQX standards, you learn that Nitrogen, mixed froms incorporates all of the above forms of nitrogen.
We also want to make sure we are looking at total nitrogen, so we want to make sure the Results Sample Fraction Text
only includes those with Total
.
It's handy to know the to what each field in our data refers and what it's values are. An easy way to inspect the latter is with the Pandas unique()
function which lists the unique values found in a give column. Or,, better yet, the value_counts()
function with lists each unique value along with the number of records having that value. Or, better still, we could plot those value_count() values...
#List the unique values inthe HydrologicEvent column
results['HydrologicEvent'].unique()
#List the unique values inthe HydrologicEvent column AND COUNT THE NUMBER OF RECORDS IN EACH
results['HydrologicEvent'].value_counts()
#List the unique values inthe CharacteristicName column
results['CharacteristicName'].value_counts()
Filtering in Pandas can be done by creating bitwise (i.e. True/False) "mask" and then combining them with logical operations..
#Step 1 in filtering: creating a mask for each criteria
mediaMask = results['ActivityMediaSubdivisionName'] == "Surface Water"
hydroMask = ~results['HydrologicEvent'].isin(("Storm","Flood","Spring breakup","Drought"))
charMask = results['CharacteristicName'] == "Nitrogen, mixed forms (NH3), (NH4), organic, (NO2) and (NO3)"
sampFracMask = results['ResultSampleFractionText'] == 'Total'
#Step 2 in filtering: Applying the masks using logical combinations
nitrogen = results[mediaMask & hydroMask & charMask & sampFracMask]
nitrogen.shape
#Check that the filters worked
nitrogen['ActivityMediaSubdivisionName'].unique()
nitrogen['HydrologicEvent'].unique()
nitrogen['CharacteristicName'].unique()
nitrogen['ResultSampleFractionText'].unique()
#Subset columns
usecols=[0,6,21,30,31,32,33,34,58,59,60]
nitrogen = nitrogen.iloc[:,usecols]
nitrogen.columns
You may have noticed that many sample sites state “not detected”. This is important data that are not currently being represented. Create a new column and set the value equal to the results, unless it is below the detection limit - in which case set it equal to ½ of the detection limit.
You may also have noted that the total nitrogen was sometimes reported as mg/l or mg/l NO3. We want mg/l. To convert to mg/l, we know the atomic weight of nitrogen is 14.0067 and the molar mass of nitrate anion (NO3) is 62.0049 g/mole. Therefore, to convert between units:
♦ Set data below the detection limit equal to 1/2 the detection limit
#Step 1: Set default values as equal to the ResultMeasureValue
nitrogen['TotalN'] = nitrogen['ResultMeasureValue']
#Step 2: Create a mask of values that are "Not Detected" and update those valuse
ndMask = nitrogen['ResultDetectionConditionText'] == 'Not Detected'
#Step 3: Apply the mask to select only the 'Not Detected' rows
# and calculate the TotalN values to be 1/2 the detection value
nitrogen.loc[ndMask,"TotalN"] = nitrogen['DetectionQuantitationLimitMeasure/MeasureValue'] / 2
#Finally: have a look at the mean Total N value
nitrogen['TotalN'].mean()
♦ Convert mg/l as NO3 to mg/l as N
#Create a mask of rows measuring NO3
no3Mask = nitrogen['DetectionQuantitationLimitMeasure/MeasureUnitCode'] == 'mg/l NO3'
#Convert the TotalN values in those rows from NO3 to N
nitrogen.loc[no3Mask,'TotalN'] = nitrogen['TotalN'] * 0.2259
Now that we have explored and tidied our sites
and results
datasets pulled from the Water Quality Data portal, we'll merge them together and generate our visualizations. Recalling our objective is to reveal trends in water quality, particularly in repsonse to water quality rules issued in 2009, we'll want to constrain (i.e. filter) our analyses to sites with data collected before and after 2009. It's these remaining sites that we'll construct plots of total nitrogen: line plots of N over time, box plots by year, and box plots by month. We'll also plot our concentrations on a map.
Additionally, we'll convert our N concentrations to annual load (lbs/year), which will require pulling in discharge data (as we did in previous sessions). We'll do this for a few sites, comparing the findings to allowable levels to answer: are these sites in compliance with Total Daily Maximum Loads (TMDLs)?
Currently, site and measurement data are not connected together. However, we may want to show the nitrate values on a map. To do this, we merge the data together based on a unique identifier shared between the two data sets. In this case, it is the MonitoringLocationIdentifier column.
#Merge the two datasets, joining the sites to the results (as sites may have >1 result)
nitrodata = pd.merge(left=sites,
right=nitrogen,
how='right',
on='MonitoringLocationIdentifier')
nitrodata.shape
Convert the ActivityStartDate
from a string to an actual datetime object. Then we can extract year and month to new columns.
#Convert the ActivityStartDate to a datetime object (currently a string)
nitrodata['ActivityStartDate'] = pd.to_datetime(nitrodata['ActivityStartDate'],format="%Y-%m-%d")
#Create year values from the ActivityStartDate column
nitrodata['Year'] = pd.DatetimeIndex(nitrodata['ActivityStartDate']).year
nitrodata['Month'] = pd.DatetimeIndex(nitrodata['ActivityStartDate']).month
#List the counts of records by month (easily changed to show year year)
nitrodata['Month'].value_counts().sort_index()
With the data merged, we now want to know how many sites are collecting data of interest. We find there are 22 sites (using the nunique()
function below). Of those sites, we want to know which were collecting data after the Jordan Lake Rules were passed and how much data are being collected at this site.
Based on this, we see several sites stopped collecting data prior to 2009, when the first iteration of Jordan Rules were passed. We also see that some of these sites collected only a few years of data. Remove those sites and plot the remaining sites on the map.
#Tally the number of unique sites
nitrodata['MonitoringLocationIdentifier'].nunique()
#Group data by site
siteGroup = nitrodata.groupby('MonitoringLocationIdentifier')['Year']
#Compute min and max years from our group
siteInfo = siteGroup.agg(['min','max'])
#Rename Colummns from min and max to StartYear and EndYear
siteInfo.columns = ['StartYear','EndYear']
#Compute Range
siteInfo['Range'] = siteInfo['EndYear'] - siteInfo['StartYear']
#Show our result
siteInfo
#Limit data to those with more than 10 years data and still operating
siteInfo = siteInfo.query('Range >= 10 & EndYear >= 2017')
siteInfo
We're down to 10 sites that fit our criteria! Now let's pull the water quality data for just those sites into a new dataframe called dfSubSites
and then see where those sites fall on a map.
#Filter our merged sites-results dataframe to contain only the sites identified above
siteMask = nitrodata['MonitoringLocationIdentifier'].isin(siteInfo.index)
dfSubSites = nitrodata[siteMask]
dfSubSites.shape
We see there's 2366 water quality records for those 10 sites. Now we'll plot them.
#Find center coordinates from medians of lat and long columns
medLat = dfSubSites['LatitudeMeasure'].median()
medLng = dfSubSites['LongitudeMeasure'].median()
#Create the initial map
m = folium.Map(location=[medLat,medLng],
zoom_start=10,
tiles='stamenterrain')
#Group and map these sites; this will speed mapping by removing duplicate locations
dfSubSites2 = dfSubSites.groupby('MonitoringLocationIdentifier').first()
#Loop through all features and add them to the map as markers
for row in dfSubSites2.itertuples():
#Get info for the record
lat = row.LatitudeMeasure
lng = row.LongitudeMeasure
name = row.MonitoringLocationName
#Create the marker object, adding them to the map object
folium.CircleMarker(location=[lat,lng],
popup=name,
color='red',
fill=True,
fill_opacity=0.8,
radius=5,
stroke=False).add_to(m)
#Show map
m
Now that we've seen what sites were dealing with, let's move on to create some plots of these sites. For these 10 sites, we want to display:
This get's a bit convoluted but the approach is as follows:
#Create list of siteIDs
siteIDs = dfSubSites['MonitoringLocationIdentifier'].unique().tolist()
siteIDs
#Iterate through each siteID
for i,siteID in enumerate(siteIDs):
#Create the figure for the current site
fig, ax = plt.subplots(1,3)
fig.set_figheight(3)
fig.set_figwidth(15)
#Subset the data for the given site
siteMask = dfSubSites['MonitoringLocationIdentifier'] == siteID
dfSite = dfSubSites[siteMask].sort_index()
#Create sub plots
dfSite['TotalN'].plot(title=siteID,ax=ax[0])
dfSite.boxplot(column='TotalN',by='Month',rot=90,ax=ax[1])
dfSite.boxplot(column='TotalN',by='Year',rot=90,ax=ax[2])
Here we plot the 10 sites on a map, but this time we size the markers to reveal mean total nitrogen for year 2017 records...
#Select records for 2017
df2017 = dfSubSites[dfSubSites.Year == 2017]
#Group by site & compute mean TotalN
df2017sites = df2017.groupby('MonitoringLocationIdentifier').mean()
df2017sites.TotalN
#Find center coordinates from medians of lat and long columns
medLat = df2017sites['LatitudeMeasure'].median()
medLng = df2017sites['LongitudeMeasure'].median()
#Create the initial map
m = folium.Map(location=[medLat,medLng],
zoom_start=10,
tiles='stamenterrain')
#Loop through all features and add them to the map as markers
for row in df2017sites.itertuples():
#Get info for the record
lat = row.LatitudeMeasure
lng = row.LongitudeMeasure
avgN = row.TotalN
#Create the marker object, adding them to the map object
folium.CircleMarker(location=[lat,lng],
popup=str(avgN),
color='red',
fill=True,
fill_opacity=0.8,
radius=5*avgN, #<-- we set the marker radius with avgN
stroke=False).add_to(m)
#Show map
m
The water quality data reports Nitrogen as $mg/l$. In order to convert to an annual load ($lbs/yr$), we need to know the volume of water flowing through each site. You could go to the NWIS Mapper to find which USGS gauges are closest to the Haw River arm (1 site) and the New Hope arm (3 sites), but we'll give them to you here: Haw River arm 02096960
; Hope River arm: {02097314
,02097517
,0209741955
}
Now our task is to extract the data for these sites, for which we'll create an apply a Python function called getNWISData
.
x = list(range(26))#.append(27)
x.append(27)
#Function to retrieve discharge data for the specified site
def getNWISData(siteNo):
#import libraries
import requests, io
#Construct the service URL and parameters
url = 'https://waterservices.usgs.gov/nwis/dv'
params = {'sites':siteNo, # <--The site number provided
'parameterCd':'00060', # <--Discharge
'statCd':'00003', # <--Mean
'startDT':'1930-10-01',# <--Start date
'endDT':'2017-12-31', # <--End date
'format':'rdb', # <--Output as table
'siteStatus':'all' # <--Get all sites (active and inactive)
}
response_raw = requests.get(url,params)
response_clean =response_raw.content.decode('utf-8')
#**Convert the data into a data frame**
#Build a list of line numbers to skip from comments and data types
rowsToSkip= [] # <--Create an empty list to hold line numbers
lineNumber = 0 # <--Initialize a line number variable
#Iterate through lines, adding the line number to the list for comment lines
for lineNumber, lineString in enumerate(response_clean.split("\n")):
if lineString.startswith('#'):
rowsToSkip.append(lineNumber)
else:
break
#Add another line 2 greater than the last
dataTypeLineNumber = rowsToSkip[-1] + 2
rowsToSkip.append(dataTypeLineNumber)
#Create a dataframe from the downloaded data
dfSite = pd.read_csv(io.StringIO(response_clean),
skiprows=rowsToSkip,
delimiter='\t',
dtype={'site_no':'str'})
#Convert datatype for datetime
dfSite['datetime'] = pd.to_datetime(dfSite['datetime'])
#Rename the columns
dfSite.columns = ['agency_cd','site_no','datetime','meanflow_cfs','confidence']
#Add year and month columns
dfSite['Year'] = dfSite['datetime'].map(lambda x: x.year)
dfSite['Month'] = dfSite['datetime'].map(lambda x: x.month)
#Set the index as the datetime column
dfSite.index= dfSite.datetime
#Fix any errors
dfSite['meanflow_cfs'] = pd.to_numeric(dfSite['meanflow_cfs'],errors='coerce')
return dfSite
#Function to get site info for the specified site
def getNWISSiteData(siteNo):
url=('https://waterdata.usgs.gov/nwis/inventory?'+\
'search_site_no={}'.format(siteNo)+\
'&search_site_no_match_type=exact'+\
'&group_key=NONE'+\
'&format=sitefile_output'+\
'&sitefile_output_format=rdb'+\
'&column_name=dec_lat_va'+\
'&column_name=dec_long_va'+\
'&column_name=drain_area_va'+\
'&list_of_search_criteria=search_site_no')
dfSiteX = pd.read_csv(url,skiprows=range(26),
sep='\t',
names=['lat','lng','agcy','datum','d_area'])
dfSiteX.index = [siteNo]
return dfSiteX[['lat','lng','d_area']]
Get the NWIS Discharge data for the Haw
#Get data for the Haw River (siteNo = '02096960') using the function above
dfHaw_Discharge = getNWISData('02096960')
dfHaw_Discharge.head()
#Get the site info for the Haw River site
dfHaw_SiteInfo = getNWISSiteData('02096960')
dfHaw_SiteInfo
Get the Water Quality data for the Haw
#Filter site data for the Haw River location
dfHawWQ = dfSubSites[dfSubSites['MonitoringLocationIdentifier'] == 'USGS-0209699999']
To calculate the annual load we need to convert cfs to MGD. Then we use pipes and dplyr to calculate the total annual flow at Haw River. Next, we calculate the average Nitrogen load for samples taken during each year. This is a rough proxy. A finer analysis can be undertaken by summarizing monthly flow and water quality, aggregating to the year as the last step. The annual load is then the: Total Flow average Nitrogen 8.34 lbs per gallon
We can then plot the annual load with the threshold of 2.567 Million pounds per year.
#Add the MGD column to the dfHaw table
dfHaw_Discharge['flow_MGD'] = dfHaw_Discharge['meanflow_cfs'] * 0.64631688969744
#Group data by year and compute the total flow and count of records
dfAnnualMGD = dfHaw_Discharge.groupby('Year')['flow_MGD'].agg(['sum','count'])
#Remove records with fewer than 350 records
dfAnnualMGD = dfAnnualMGD[dfAnnualMGD['count'] > 350]
#Rename the 'sum' column
dfAnnualMGD.columns = ['AnnualFlow_MGD','Count']
#Compute annual Nitrate concentration
wqYear = dfHawWQ.groupby('Year')['TotalN'].mean()
dfAnnualN = pd.DataFrame(wqYear)
haw = pd.merge(left=dfAnnualN,right=dfAnnualMGD,how='inner',left_index=True,right_index=True)
haw.columns
MGD
* avg N
* 8.34 lbs/gal
)haw['M_lbs'] = haw["AnnualFlow_MGD"] * haw['TotalN'] * 8.34 / 1000000
#Construct the figure object
fig = plt.figure()
#Create an axis object, which is what we plot on
ax = plt.axes()
#Create a line plot of M_lbs against our index (years), set markers to be points
plt.plot(haw['M_lbs'],marker='o')
#Add aesthetics
plt.title("Haw River Branch")
plt.ylabel("Annual Nitrogen Load (million lbs)")
#Add a line at the TMLD limit, color it in red an make it dashed
plt.axhline(y=2.567,color='red',linestyle='--')
#Finally, reveal the plot
plt.show()
#Write out data to csv
haw.to_csv('./haw_python.csv',index=True)
Here there are 3 upstream gauges for New Hope Creek. We will download that information into a single file. We repeat the above analyis with the New Hope stream gauges. The three gage site IDs are 02097314
, 02097517
, and 0209741955
. Use the Total N values from this WQ site USGS-0209768310
.
getNWISData()
function already created to pull NWIS data for each site#Get the data
dfQ1 = getNWISData('02097314')
dfQ2 = getNWISData('02097517')
dfQ3 = getNWISData('0209741955')
dfS1 = getNWISSiteData('02096960')
dfS2 = getNWISSiteData('02097517')
dfS3 = getNWISSiteData('0209741955')
dfWQ = dfSubSites[dfSubSites.MonitoringLocationIdentifier == 'USGS-0209768310']
#Add MGD columns
dfQ1['flow_MGD'] = dfQ1['meanflow_cfs'] * 0.64631688969744
dfQ2['flow_MGD'] = dfQ2['meanflow_cfs'] * 0.64631688969744
dfQ3['flow_MGD'] = dfQ3['meanflow_cfs'] * 0.64631688969744
#Compute annual flow for each dataset
dfQ1_annual = dfQ1.groupby('Year').agg({'flow_MGD':('sum','count')})
dfQ1_annual.columns = dfQ1_annual.columns.droplevel()
dfQ1_annual = dfQ1_annual[dfQ1_annual['count'] > 350]
dfQ1_annual.columns = ['AnnualFlow_MGD','Count']
dfQ2_annual = dfQ2.groupby('Year').agg({'flow_MGD':('sum','count')})
dfQ2_annual.columns = dfQ2_annual.columns.droplevel()
dfQ2_annual = dfQ2_annual[dfQ2_annual['count'] > 350]
dfQ2_annual.columns = ['AnnualFlow_MGD','Count']
dfQ3_annual = dfQ3.groupby('Year').agg({'flow_MGD':('sum','count')})
dfQ3_annual.columns = dfQ3_annual.columns.droplevel()
dfQ3_annual = dfQ3_annual[dfQ3_annual['count'] > 350]
dfQ3_annual.columns = ['AnnualFlow_MGD','Count']
#Compute annual Nitrate concentration
wqYear = dfWQ.groupby('Year')['TotalN'].mean()
dfWQ_annual = pd.DataFrame(wqYear)
#Join the tables
df1 = pd.merge(left=dfWQ_annual,right=dfQ1_annual,how='inner',left_index=True,right_index=True)
df2 = pd.merge(left=dfWQ_annual,right=dfQ2_annual,how='inner',left_index=True,right_index=True)
df3 = pd.merge(left=dfWQ_annual,right=dfQ3_annual,how='inner',left_index=True,right_index=True)
#Compute 1000 lbs/day from MGD and TotalN
df1['K_lbs-Site1'] = df1["AnnualFlow_MGD"] * df1['TotalN'] * 8.34 / 100000
df2['K_lbs-Site2'] = df2["AnnualFlow_MGD"] * df2['TotalN'] * 8.34 / 100000
df3['K_lbs-Site3'] = df3["AnnualFlow_MGD"] * df3['TotalN'] * 8.34 / 100000
#Plot the data
fig = plt.figure(figsize=(20,8))
ax = plt.axes()
plt.plot(df1['K_lbs-Site1'],marker='o',color='black')
plt.plot(df2['K_lbs-Site2'],marker='o',color='green')
plt.plot(df3['K_lbs-Site3'],marker='o',color='blue')
plt.title("New Hope Reaches")
plt.ylabel("Annual Nitrogen Load (million lbs)")
plt.axhline(y=2.567,color='red',linestyle='--')
plt.legend()
plt.show()