EDA & Feature
Engineering
Before conducting the analysis it is important to make some edits to
the variables in the data set. First, starting with the variable Day, it
is currently a character variable, meaning we can not really use it to
conduct meaningful analysis. To fix the variable to make it more useful,
it needs to be changed to a factor variable, where each day of the week
is its own factor, “Monday”, “Tuesday”, etc. Next, we are going to
combine the variables HighTemp and LowTemp to create a new variable
called AvgTemp, which will take the average of the high and low
temperature.
bridge.new = bridge
bridge.new$Day = as.factor(bridge$Day)
bridge.new$AvgTemp = (bridge$HighTemp + bridge$LowTemp)/2
bridge.new$NewPrecip <- NA
for(i in 1:nrow(bridge)) {
if (bridge$Precipitation[i] > 0) {
bridge.new$NewPrecip[i] <- 1
} else if (bridge$Precipitation[i] == 0) {
bridge.new$NewPrecip[i] <- 0
}
}
bridge.final = subset(bridge.new, select = -c(HighTemp, LowTemp, Precipitation))
Now, we have our final data set, bridge.final which still consists of
the same 30 observations but now only has 6 variables which are as
followed. 1. Date: (chr) Used as the ID variable, the date that the
observation was recorded 2. Day (factor) The day of the week the
observation was recorded 3. QueensboroBridge (int) The total amount of
cyclists who entered and exited the Queensboro Bridge on the date of the
observation 4. Total (int) The total amount of cyclists who entered and
exited any of the four bridges on the date of the observation 5. AvgTemp
(num) The average of the HighTemp and LowTemp variables. 6. NewPrecip
(int) This is a binary categorical variable, if NewPrecip equals 0 that
means there was no precipitation on that day. If NewPrecip equals 1 that
means there was precipitation on that day.
Methodology &
Analysis
pois.model.1 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip,
family = poisson(link="log"), data =bridge.final)
yhat.1 = pois.model.1$fitted.values
pearson.resid.1 = (bridge.final$QueensboroBridge - yhat.1)/sqrt(yhat.1)
Pearson.disp.1 = sum(pearson.resid.1^2)/pois.model.1$df.residual
Deviance.disp.1 = (pois.model.1$deviance)/pois.model.1$df.residual
disp = cbind(Pearson.disp.1 = Pearson.disp.1, Deviance.disp.1 = Deviance.disp.1)
kable(disp, caption="Dispersion parameter for Primary Model", align = 'c')
Dispersion parameter for Primary Model
154.3321 |
158.8547 |
As seen above the poisson model resulted in a Pearson Dispersion of
154.3321 and a Residual Dispersion of 158.8547. Considering that the
preferred dispersion is 1, this model highly violates the model
assumptions. Since the model assumptions are so heavily violated, we
have to look into using a different model. Considering the models we
used in the previous assignment, the next model we are going to test is
going to consider the total cyclists that day.
pois.model.2 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total),
family = poisson(link = "log"), data = bridge.final)
yhat.2 = pois.model.2$fitted.values
pearson.resid.2 = (bridge.final$QueensboroBridge - yhat.2)/sqrt(yhat.2)
Pearson.disp.2 = sum(pearson.resid.2^2)/pois.model.2$df.residual
Deviance.disp.2 = (pois.model.2$deviance)/pois.model.2$df.residual
disp = cbind(Pearson.disp.2 = Pearson.disp.2, Deviance.disp.2 = Deviance.disp.2)
kable(disp, caption="Dispersion parameter for Secondary Model", align = 'c')
Dispersion parameter for Secondary Model
7.059385 |
6.980342 |
As seen above, the secondary model, which considers an offset of the
Total, produced a much better dispersion parameter, compared to the
primary model. The Pearson Dispersion for the model is 7.059385, and the
Deviance Dispersion is 6.980342. While still not ideal dispersion
parameters, considering that the previous model had dispersion
parameters in the 150s, these single digit dispersion parameters clearly
show that this second model is the better model. To help adjust for the
standard error, we are going to be fitting the quasi-poisson model when
summarizing the inferential statistics.
The next step will be to summarize the inferential statistics in the
table below.
quasi.model = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total),
family = quasipoisson, data =bridge.new)
summary(quasi.model)
##
## Call:
## glm(formula = QueensboroBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson,
## data = bridge.new, offset = log(Total))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.226290 0.066584 -18.417 1.94e-14 ***
## DayMonday -0.030650 0.030089 -1.019 0.31997
## DaySaturday -0.020246 0.031527 -0.642 0.52771
## DaySunday -0.046129 0.031043 -1.486 0.15215
## DayThursday -0.001740 0.031739 -0.055 0.95679
## DayTuesday -0.001009 0.031764 -0.032 0.97495
## DayWednesday -0.016404 0.031554 -0.520 0.60859
## AvgTemp -0.003966 0.001048 -3.783 0.00109 **
## NewPrecip 0.054764 0.020018 2.736 0.01239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 7.059392)
##
## Null deviance: 412.64 on 29 degrees of freedom
## Residual deviance: 146.59 on 21 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")
Summary statistics of quasi-poisson regression model
(Intercept) |
-1.2262901 |
0.0665841 |
-18.4171596 |
0.0000000 |
DayMonday |
-0.0306504 |
0.0300895 |
-1.0186431 |
0.3199652 |
DaySaturday |
-0.0202462 |
0.0315274 |
-0.6421772 |
0.5277070 |
DaySunday |
-0.0461286 |
0.0310428 |
-1.4859659 |
0.1521474 |
DayThursday |
-0.0017402 |
0.0317393 |
-0.0548269 |
0.9567946 |
DayTuesday |
-0.0010095 |
0.0317640 |
-0.0317805 |
0.9749472 |
DayWednesday |
-0.0164039 |
0.0315536 |
-0.5198750 |
0.6085879 |
AvgTemp |
-0.0039665 |
0.0010484 |
-3.7833396 |
0.0010893 |
NewPrecip |
0.0547639 |
0.0200177 |
2.7357770 |
0.0123852 |
The estimated dispersion index based on the Pearson residuals is
7.06. Looking at the quasi-poisson regression above, all of the factors
for Day variable are insignificant, with p-values ranging from p-value =
0.15215 to p-value = 0.97495. Since all the factors are insignificant,
we are going to refit the quasi-poisson model by dropping the variable
Day.
quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of
the Quasi-Poisson regression coefficients in the final model.")
Inferential statistics of the Quasi-Poisson regression
coefficients in the final model.
(Intercept) |
-1.2462434 |
0.0604994 |
-20.599270 |
0.0000000 |
AvgTemp |
-0.0039704 |
0.0009886 |
-4.016345 |
0.0004238 |
NewPrecip |
0.0619572 |
0.0171914 |
3.603953 |
0.0012491 |
The new model, shown above, that has been refitted without the
variable day, will be used as the final model. Since the response
variable of the model is on a log scale, and additionally there is an
offset of the log Total, the interpretation of the regression
coefficient of the poisson model is not a simple or straightforward
procedure.
In context, the coefficient for NewPrecip is 0.0619572, this is the
estimated poisson regression coefficient comparing days with and without
precipitation, given the other variable are being held constant. The
difference in log of cyclists on the Queensboro Bridge, offset by the
log of total cyclists, is expected to be 0.0619572 units higher on days
with precipitation compared to days without precipitation while other
variables are held constant in the model.
Visual Analysis
To conduct this visual analysis there is going to be an additional
breakdown of the model. A new variable called HighLowTemp is going to be
created. This variable will be a binary categorical variable in which
the variable will equal 0 if the AvgTemp was lower than the mean
temperature across the entire month of April and it will equal 1 if it
is higher or equal to the mean temperature. After running some
additional code, it has been calculated that the mean temperature for
the month of April is 57.23 degrees Fahrenheit.
mean(bridge.new$AvgTemp)
## [1] 57.23167
for(i in 1:nrow(bridge)) {
if (bridge.new$AvgTemp[i] >= 57.23) {
bridge.final$HighLowTemp[i] <- 1
} else if (bridge.new$AvgTemp[i] < 57.23) {
bridge.final$HighLowTemp[i] <- 0
}
}
quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of
the Quasi-Poisson regression coefficients in the final model.")
Inferential statistics of the Quasi-Poisson regression
coefficients in the final model.
(Intercept) |
-1.2462434 |
0.0604994 |
-20.599270 |
0.0000000 |
AvgTemp |
-0.0039704 |
0.0009886 |
-4.016345 |
0.0004238 |
NewPrecip |
0.0619572 |
0.0171914 |
3.603953 |
0.0012491 |
After readjusting AvgTemp to HighLowTemp, we can see that the
variable is still statistically significant and we can move forward with
the visual analysis that is planned.
The goal of the visualization is to show how the explanatory
variables in the final model impact the actual number of cyclists on the
Queensboro Bridge.
For this visualization all days will be classified as one of four of
the following groups, by precipitation and temperature.
day.0 = days with no precipitation
day.1 = days with precipitation
Now, it is time to exponentiate the log count of cyclists on the
Queensboro Bridge to the actual number of cyclists. Then we will create
a graph that shows the relationship between cyclists and different
weather conditions in terms of number of cyclists riding in each of the
groups above.
temp = range(bridge.final$AvgTemp)[1]:range(bridge.final$AvgTemp)[2]
day.0 = -1.2462434 - 0.0039704*temp + (offset = log(bridge.final$Total))
day.1 = -1.2462434 + 0.0619572 - 0.0039704*temp + (offset = log(bridge.final$Total))
##
plot(temp, exp(day.0), ylim=c(0,6000),
type = "l",
col = "red",
lty = 1,
ylab = "Queensboro Bridge Cyclists",
xlab = "Average Temperature",
main = "Factors That Impact Cyclist Turnout")
lines(temp, exp(day.1), col = "blue", lty = 2)
legend("topleft", c("no precipitation", "precipitation"),
col=c("red", "blue"), lty=1:2, bty="n", cex=0.8)

The graph shows there does not seem to be a large significant
difference between days with no precipitation vs days with
precipitation. However, out of the two there does to be slightly less
cyclist turnout on days without precipitation. Additionally, the graph
shows that there is not a consistent increase or decrease in cyclist
turnout as average temperature increases or decreases. Instead, the
cyclist turnout seems to have a very sporadic relationship to average
temperature, with high spikes and low divots.
General
Discussions
It is important to acknowledge the shortcomings of this model,
starting with the dispersion parameters. While a dispersion parameter of
approximately 7 looked great in comparison to the first models
dispersion parameter of over 150, a 7 is still without side a preferred
range. Ideally, since getting a dispersion parameter any closer to 1
seems highly unlikely, especially considering this project had variable
parameters that were required, realistically a different model should
have been used, most likely a negative binomial. However, since this
project requested a poisson model be used we stuck with the model that
had the least egregious model assumptions.
This project also required the creation and use of two new variables
in our model, AvgTemp and NewPrecip. These new variables were meant to
replace the variables HighTemp, LowTemp and Precipitation, and to me
used in the model building with the variable Day. In our previous
analysis, we saw some factors of the variable day have some statistical
significance, and even talked about making an argument for grouping them
with further analysis. Unfortunately, in this analysis, the new model
with the new variables did not result in any of the factors of the
variable day yielding any statistical significance, making grouping a
moot point. Since none of the factors were statistically significant, we
had to drop the variable Day, leaving us with only two variables, which
is less than ideal. Only analysis two variables gives us less
opportunities to analyse the relationship to other variables and the
response, so ideally it would have been better to use a model that
showed at least some of the factors of day to be significant so the
variable did not have to be dropped.
Finally, as mentioned earlier and in the previous report, this is a
subset of a much larger data set. This subset only covers one month for
one bridge, of a data set that spans one year and covers four different
bridges. Without being able to compare this to other sections of the
data set we are missing crucial information. How does the Queensboro
Bridge compare in cyclist count to other bridges? How does the month of
April compare in cyclist counts to other months? Is the Queensboro
Bridge or the month of April and outlier? Seeing a significantly higher
or lower amount of cyclists compared to other bridges or months. This is
information we can not know without being able to analyze, the full, or
at the very least additionally subsets of, the data set. For the
conclusions drawn regarding this subset to be meaningful and hold proper
importance, it needs to be interpreted in comparison to the larger data
set so that both those conducting the analysis and those reading the
report have the best understanding of the context surrounding the data
set.
Overall, while not the perfect model or perfect circumstances for
analysis, we have to make the best conclusions with the information we
have. Thus, this model dictates that the AvgTemp and NewPrecip have a
statisticaly significant impact on the cyclist turnout for the
Queensboro Bridge in April.
