Image('NYC.jpg', width = 1000)
This project examines the different type of pollutants throughout the neighborhoods of New York City throughout the years. It also consists of the correlation between health cases related to the air pollutants. I used seaborn and matplotlib to visualize how much each concentration affects each neighborhood. As well as foilum to create maps to visualize the health cases and pollution across New York. It can be assumed that the more polluted areas in New York City do not have high health cases correlated to pollution.
To accomplish this, I combined string values from three columns into one and into a new column. I also dropped unnessacary values, eventually merging with the neighborhood dataset based on the neighborhood names. I also used functions to create multiple bar graphs, line graphs, and scatterplots. Eventually, ending it off with two foilum maps that generally views the impact of pollution and health cases in New York City.
This dataset consists of air pollutants and health cases related to the pollutants. Also, contain the locations, measurements, and time intervals of each pollutant/health cases. I eventually use this dataset to merge with the NYC UNF 42 dataset.
This dataset contains all 42 neighborhoods of New York City. It also contains boroughs and locations. I also used the geojson file from this website for the foilum maps.
import warnings
from IPython.display import Image
import pandas as pd
import seaborn as sns
import geopandas as gpd
import folium
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import files
warnings.filterwarnings('ignore')
uploaded = files.upload()
airquality = pd.read_csv("Air_Quality.csv")
airquality
The Name column from the air quality dataset is made up of air pollutants names as well as health stats that are attributtal from the pollutants. The health stats deserve their own columns as we will be comparing them. Also, we won't be needing the traffic density as it only measures distance traveled of vehicles within the area. It could be used when we anaylze the data throughout the visualizations as it can possibly be the cause. There are no measures for health related data for the time interval for 2005-2007 so we remove it from our data. Also, we want to create a foilum map and we would want to exclude citywide data and boroughs, etc as it is too general. This only leaves UNF42 as we would only want to view neighborhoods.
#This removes rows containing 'Traffic Density', year intervals '2005-2007', and keeps rows contain geo type of UHF42.
airquality = airquality[~airquality['Name'].str.startswith('Traffic Density')]
airquality = airquality[~airquality['Time Period'].str.contains('2005-2007')]
airquality = airquality[airquality['Geo Type Name'] == 'UHF42'].reset_index()
airquality
As seen in the dataset, there are columns that contain the name, time period, and measure of the pollutant/health cases. In order to get the specific values, the best way would be to combine the strings of each column so later on we can make columns containing those specific measurements within the specific time period, measurement, and pollutant/health cases.
#This adds the strings from three columns together removing any unnessecary punctuation
airquality['Air Pollutant'] = (airquality['Name']
+ ' ' + airquality['Time Period']
+ ' ' + airquality['Measure']).str\
.replace(',', '')\
.replace('-', '')\
.replace(' ', '_')\
airquality.head()
From the air quality dataset, we see that the Message feature isn't going to be needed especially when all it consists of is NaN values.
#Drops the message column
airquality.drop('Message', axis=1, inplace=True)
The New York City UNF 42 dataset is imported containing the 42 neighborhoods of New York.
uploaded = files.upload()
UNF = pd.read_csv('uhf_42_dohmh_2009.csv').dropna()
UNF
The neighborhoods of New York City will play an important part when displaying the maps.
UNF = UNF.set_index('uhf_neigh')
UNF
#The geo place names are grouped as keys in a Pretty Dictionary where the keys containing the 42 neighborhoods will be used.
neighborhood_groups = airquality.groupby('Geo Place Name').groups
Here the Pretty Dictionary containing the 42 neighborhoods will be used to compare with the UNF index and if the strings in either do not match then the geo place name will be replaced by the corresponding UNF index value containing the correct spelling, spacing, etc. This is important as the airquality and UNF datasets will merge based on the 42 neighborhoods from their columns.
airquality = airquality.replace({x: y for x, y in zip(
sorted(list(neighborhood_groups.keys())),
sorted(list(UNF.index))) if x!=y})
#The air pollutants/health cases with name and time period are grouped as keys in a Pretty Dictionary.
airpollutants = airquality.groupby('Air Pollutant').groups
#Iterating through the Pretty Dictionary to obtain the keys (pollutants/health cases)
pollutants = [x for x in airpollutants]
#The airquality and UNF datasets are merged based on the neighborhoods, including setting the airpollutants as new columns
for n in airpollutants:
UNF = pd.concat(
[UNF,
airquality.loc[airpollutants[n]]
.drop('index', axis=1)
.drop_duplicates()
.set_index('Geo Place Name')['Data Value']],
axis = 1).reindex(UNF.index).rename({'Data Value': n}, axis = 1)
UNF
Now, we can see columns for the specific pollutants and health cases with their own time and type of measurements. Also, it has its own average measurement for each of the 42 neighborhoods.
For the visualization part, creating maps with the use of folium will help analyze what pollutants affect which areas of New York City the most and the least. The same goes for the health cases related to the air pollutants. The best way to accomplish this is to first split the map - one with the concentration of the air pollutants and the other with the health cases related to these pollutants.
#Iterates through pollutants to separate pollutants columns and health case columns
pollCont = [pol for pol in pollutants if 'Attributable' not in pol]
healthEff = [pol for pol in pollutants if pol not in pollCont and 'Ozone' not in pol]
#Iterates through pollCont for any columns that contain benzene and formaldehyde
BenzeneAvgs = [bz for bz in pollCont if 'Benzene' in bz]
FormaldehydeAvgs = [f for f in pollCont if 'Formaldehyde' in f]
#Iterates through pollCont for any columns containing Boiler Emission and each of three pollutants asscoiated to it
BoilerEmissionAvgs = [be for be in pollCont if 'Boiler' in be]
BoilerEmissionSO2Avgs = [be for be in pollCont if 'Boiler' in be and 'SO2' in be]
BoilerEmissionPMAvgs = [be for be in pollCont if 'Boiler' in be and 'PM' in be]
#Interates through pollCont for each of the air pollutants including annual, summer and winter averages
PMAnnualAvgs = [pm for pm in pollCont if 'Fine Particulate Matter' in pm]
OZAnnualAvgs = [oz for oz in pollCont if 'Ozone' in oz]
NOAnnualAvgs = [no for no in pollCont if 'Nitrogen Dioxide' in no]
SO2AnnualAvgs = [so for so in pollCont if 'Sulfur Dioxide' in so]
#Iterates through healthEff for each of the health cases correlated with the two air pollutants
PMHealthCases = [ast for ast in healthEff if 'PM2.5-Attributable' in ast]
OZHealthCases = [ast for ast in healthEff if 'O3' in ast]
UNF.reset_index(inplace=True)
Some of the data could have large ranges between each other within the air pollutants and the health case data values. The best way to keep within range is to normalize it into a range of 0 and 1. This would keep all the measurements in a shorter range and would be better visualized when applying maps and graphs.
#Function to normalize the values between a 0 to 1 range of each of the split data
def simplify(df):
avg = (df - df.min()) / (df.max() - df.min())
return avg
#We apply the function so that all 42 values in each column is normalized to a value between a range of 0 and 1
#After, they got summed up and divided by the number of columns iterated, which will be the average of certain pollutant/health case.
UNF['HealthCaseAvg'] = sum(simplify(UNF[column]) \
for column in healthEff) / len(healthEff)
UNF['ConcentrationAvg'] = sum([simplify(UNF[column]) \
for column in pollCont]) / len(pollCont)
UNF['PM 2.5 Annual Avgs'] = sum([simplify(UNF[column]) \
for column in PMAnnualAvgs]) / len(PMAnnualAvgs)
UNF['OZ Annual Avgs'] = sum([simplify(UNF[column]) \
for column in OZAnnualAvgs]) / len(OZAnnualAvgs)
UNF['NO Annual Avgs'] = sum([simplify(UNF[column]) \
for column in NOAnnualAvgs]) / len(NOAnnualAvgs)
UNF['SO2 Annual Avgs'] = sum([simplify(UNF[column]) \
for column in SO2AnnualAvgs]) / len(SO2AnnualAvgs)
UNF['PM 2.5 Health Cases'] = sum([simplify(UNF[column]) \
for column in PMHealthCases]) / len(PMHealthCases)
UNF['OZ Health Cases'] = sum([simplify(UNF[column]) \
for column in OZHealthCases]) / len(OZHealthCases)
UNF['Benzene Avgs'] = sum([simplify(UNF[column]) \
for column in BenzeneAvgs]) / len(BenzeneAvgs)
UNF['Formaldehyde Avgs'] = sum([simplify(UNF[column]) \
for column in FormaldehydeAvgs]) / len(FormaldehydeAvgs)
UNF['Boiler Emission Avgs'] = sum([simplify(UNF[column]) \
for column in BoilerEmissionAvgs]) / len(BoilerEmissionAvgs)
UNF['Boiler Emission PM 2.5 Avgs'] = sum([simplify(UNF[column]) \
for column in BoilerEmissionPMAvgs]) / len(BoilerEmissionPMAvgs)
UNF['Boiler Emission SO2 Avgs'] = sum([simplify(UNF[column]) \
for column in BoilerEmissionSO2Avgs]) / len(BoilerEmissionSO2Avgs)
UNF
We see these normalized values for the new columns added. All are between the range of 0 and 1 which will make better for the visualizations to be accurate.
#Bar plot for first 21 neighborhoods
def barPlotFirst22(df, col1, col2, name, col3):
sns.barplot(df[col1], df[col2][:22], hue = df[col3])
plt.title(f"{name} Averages in NYC neighborhoods First 22 UHF")
#Bar plot for last 21 neighborhoods
def barPlotLast22(df, col1, col2, name, col3):
sns.barplot(df[col1], df[col2][22:], hue = df[col3])
plt.title(f"{name} Averages in NYC neighborhoods Last 22 UHF")
barPlotFirst22(UNF,'Benzene Avgs', 'uhf_neigh', 'Benzene', 'borough')
barPlotLast22(UNF,'Benzene Avgs', 'uhf_neigh', 'Benzene', 'borough')
We see that benzene affects the neighborhoods of Manhattan the most, followed by Brooklyn. Benzene comes from gasoline fumes and crude oils. Benzene is released in the air by automobile exhaust and Manhattan happens to have the most traffic in New York City.
barPlotFirst22(UNF,'Formaldehyde Avgs', 'uhf_neigh', 'Formaldehyde', 'borough')
barPlotLast22(UNF,'Formaldehyde Avgs', 'uhf_neigh', 'Formaldehyde', 'borough')
The Bronx and Manhattan have the most neighborhoods affected by formaldehyde. Formaldehyde is produced from during the combustion of fuels in vehicles and bulidings. It is also formed in the atmosphere with among other pollutants. Manhattan does have the most buildings and vehicles among the boroughs. The Bronx also consists of apartment bulidings.
barPlotFirst22(UNF,'PM 2.5 Annual Avgs', 'uhf_neigh', 'PM 2.5', 'borough')
barPlotLast22(UNF,'PM 2.5 Annual Avgs', 'uhf_neigh', 'PM 2.5', 'borough')
Fine Particulate Matter PM 2.5 primarly comes from automobile exhausts. Just like the benzene and formaldehyde pollutants, some of the neighborhoods of Manhattan, Bronx and Brooklyn mostly consist of PM 2.5 than other neighborhoods.
barPlotFirst22(UNF,'OZ Annual Avgs', 'uhf_neigh', 'Ozone (O3)', 'borough')
barPlotLast22(UNF,'OZ Annual Avgs', 'uhf_neigh', 'Ozone (O3)', 'borough')
Ozone pollution is mostly present in Queens and Staten Island where it gets emitted from cars, power plants, and industrial boilers. There is a power plant in Astoria which contibute to the concentration in Queens. Also, it occurs when other organic compounds combine with the sunlight which comes from chemical emitted from cars.
barPlotFirst22(UNF,'NO Annual Avgs', 'uhf_neigh', 'Nitrogen Dioxide (NO2)', 'borough')
barPlotLast22(UNF,'NO Annual Avgs', 'uhf_neigh', 'Nitrogen Dioxide (NO2)', 'borough')
Nitrogen dioxide is easily produced by combustion within stoves, water heaters, furnaces, and boilers. It is also produced from vehicles and power plants, as Brooklyn and the Bronx are more urban like therefore contributes to more homes producing nitrogen dioxide.
barPlotFirst22(UNF,'SO2 Annual Avgs', 'uhf_neigh', 'Sulfur Dioxide (SO2)', 'borough')
barPlotLast22(UNF,'SO2 Annual Avgs', 'uhf_neigh', 'Sulfur Dioxide (SO2)', 'borough')
Sulfur dioxide comes from electric utilities, mostly the ones that burn coal. Also, boiler emissions produce sulfur dioxide, as well as other industrial facilities. Sulfur dioxide is mostly present within the neighborhoods of Manhattan and the Bronx.
def LinePlotFirst22(df, col1, col2, name):
x = df[col1][:22]
y = df[col2][:22]
plt.xlabel(col1)
plt.ylabel(col2)
plt.title(f"{name} Averages in NYC neighborhoods First 22 UHF")
plt.xticks(rotation=90)
plt.plot(x, y, color = 'black',
linestyle = 'solid', marker = 'o',
markerfacecolor = 'red', markersize = 12)
def LinePlotLast22(df, col1, col2, name):
x = df[col1][22:]
y = df[col2][22:]
plt.xlabel(col1)
plt.ylabel(col2)
plt.title(f"{name} Averages in NYC neighborhoods Last 22 UHF")
plt.xticks(rotation=90)
plt.plot(x, y, color = 'black',
linestyle = 'solid', marker = 'o',
markerfacecolor = 'red', markersize = 12)
LinePlotFirst22(UNF,'uhf_neigh', 'Boiler Emission Avgs', 'Boiler Emission')
LinePlotLast22(UNF,'uhf_neigh', 'Boiler Emission Avgs', 'Boiler Emission')
Boiler emissions are common in homes and bulidings that use gas. From the visualizations, we have that some neighborhoods in Manhattan and the Bronx consists of more pollutants emitted from boiler emissions. This could be because Manhattan and the Bronx consist of more bulidings which have more boilers.
def ScatterPlot(df, col1, col2):
sns.scatterplot(x = df[col1], y = df[col2]);
ScatterPlot(UNF, 'PM 2.5 Annual Avgs', 'PM 2.5 Health Cases')
The health cases correlate with the averages of the PM 2.5 concentrations throughout the years. The cases do increase as the average concentrations of PM 2.5 increases. There are two outliers where the highest concentrations of PM 2.5 didn't have as much health cases for that certain year.
ScatterPlot(UNF, 'Boiler Emission PM 2.5 Avgs', 'PM 2.5 Health Cases')
As seen before, the boiler emissions are more common in Manhattan and the Bronx. It could be seen boiler emissions PM 2.5 is somewhat correlated with the health cases. Not as much though, it could more seen as a small cause of these health cases. The health cases correlated to PM 2.5 could be more correlated to PM 2.5 emitted from car exhaust.
ScatterPlot(UNF, 'Boiler Emission PM 2.5 Avgs', 'PM 2.5 Annual Avgs')
This shows correlation as the boiler emissions are one of the main causes of PM 2.5 concentrations. The correlation could be most likely in Manhattan and the Bronx.
ScatterPlot(UNF, 'Boiler Emission SO2 Avgs', 'SO2 Annual Avgs')
You can see that the boiler emissions and sulfur dioxide are heavily correlated as both increase. So it can be seen as one of the main causes of sulfur dioxide being in the air of New York City.
Here, the geojson file of the UHF42 of New York City is imported. It was obtained from the same link as the UHF42 dataset.
uploaded = files.upload()
geo_data = gpd.read_file('uhf_42_dohmh_2009.geojson').dropna()
Here is the map of the concentrations of the air pollutants all across New York City.
map = folium.Map([40.7678,-73.9645], zoom_start = 10, tiles = "cartodbpositron")
tiles = ['stamenwatercolor', 'cartodbpositron', 'openstreetmap', 'stamenterrain']
for tile in tiles:
folium.TileLayer(tile).add_to(map)
legendTitle = "Pollution in NYC"
map.add_child(folium.Choropleth(
geo_data = geo_data,
name = 'Pollution in NYC',
data = UNF,
columns = ['uhf_neigh', 'ConcentrationAvg'],
key_on = 'properties.uhf_neigh',
fill_color = 'YlOrRd',
threshold_scale = [0,0.2,0.4,0.6,0.8,1],
fill_opacity = 0.9,
line_opacity = 0.4,
legend_name = legendTitle,
highlight = True
)
)
folium.LayerControl().add_to(map)
map
Also, here we have a map of the health cases relating to the air pollutants across New York City.
map = folium.Map([40.7678,-73.9645], zoom_start = 10, tiles = "cartodbpositron")
tiles = ['stamenwatercolor', 'cartodbpositron', 'openstreetmap', 'stamenterrain']
for tile in tiles:
folium.TileLayer(tile).add_to(map)
LegendTitle = "Health Cases in NYC"
map.add_child(folium.Choropleth(
geo_data = geo_data,
name = 'Health Cases in NYC',
data = UNF,
columns = ['uhf_neigh', 'HealthCaseAvg'],
key_on = 'properties.uhf_neigh',
legend_name = LegendTitle,
fill_color = 'BuPu',
threshold_scale = [0,0.2,0.4,0.6,0.8,1],
fill_opacity = 0.9,
line_opacity = .4,
highlight = True
)
)
folium.LayerControl().add_to(map)
map
Overall, we see that pollution mainly affects Manhattan but the health cases are spread mostly in the Bronx. It could be because people commute to work in Manhattan then back to the borough they originally live in. The solution would be to use less fossil fuels and maybe invest in solar panels and electric cars as vehicle exhaust affect a lot of New Yorkers based on the pollutants the gas releases.
!jupyter nbconvert --to html AirQualityControlinNewYorkCity.ipynb