1 Introduction

This assignment will further analyze bike count data for several bridges, keeping a count of cyclists entering and leaving the bridge. This assignment will utilize a dispersed poisson regression to look at the cyclist count for a particular bridge on different days of the week. Additionally, this assignment will modify some of the variables to better fit a dispersed poisson regression analysis.

2 Materials

2.1 Data Set

A daily total bike count was conducted monthly on four different bridges in New York, the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. The bike total kept count of the total cyclists entering and leaving each bridge on a given day. The count data was collected by the Traffic Information Management System (TIMS). Each recorded response represents the total number of cyclists that entered and left that bridge in a 24 hour period.

url="https://ChloeWinters79.github.io/STA321/Data/QueensboroBridge.csv"
bridge = read.csv(url, header = TRUE)

This project did a random assignment of a specific subset of the data for our analysis. The subset is for one of the four bridges over a one month period from the first day of the month to last day of the month. The specific subset assigned for this project was for the Queensboro bridge for the month of April. The data set is called bridge and consists of 7 variables and 30 total observations. When the data set is first read in the variables are as followed.

  1. Date: (chr) Used as the ID variable, the date that the observation was recorded
  2. Day (chr) The day of the week the observation was recorded
  3. HighTemp (num) The high temperature on the date of the observation
  4. LowTemp (num) The low temperature on the date of the observation
  5. Precipitation (num) The amount of precipitation received on the date of the observation
  6. QueensboroBridge (int) The total amount of cyclists who entered and exited the Queensboro Bridge on the date of the observation
  7. Total (int) The total amount of cyclists who entered and exited any of the four bridges on the date of the observation

3 EDA & Feature Engineering

Before conducting the analysis it is important to make some edits to the variables in the data set. First, starting with the variable Day, it is currently a character variable, meaning we can not really use it to conduct meaningful analysis. To fix the variable to make it more useful, it needs to be changed to a factor variable, where each day of the week is its own factor, “Monday”, “Tuesday”, etc. Next, we are going to combine the variables HighTemp and LowTemp to create a new variable called AvgTemp, which will take the average of the high and low temperature.

bridge.new = bridge
bridge.new$Day = as.factor(bridge$Day)
bridge.new$AvgTemp = (bridge$HighTemp + bridge$LowTemp)/2
bridge.new$NewPrecip <- NA

for(i in 1:nrow(bridge)) {
  if (bridge$Precipitation[i] > 0) {
    bridge.new$NewPrecip[i] <- 1  
  } else if (bridge$Precipitation[i] == 0) {
    bridge.new$NewPrecip[i] <- 0  
  }
}

bridge.final = subset(bridge.new, select = -c(HighTemp, LowTemp, Precipitation))

Now, we have our final data set, bridge.final which still consists of the same 30 observations but now only has 6 variables which are as followed. 1. Date: (chr) Used as the ID variable, the date that the observation was recorded 2. Day (factor) The day of the week the observation was recorded 3. QueensboroBridge (int) The total amount of cyclists who entered and exited the Queensboro Bridge on the date of the observation 4. Total (int) The total amount of cyclists who entered and exited any of the four bridges on the date of the observation 5. AvgTemp (num) The average of the HighTemp and LowTemp variables. 6. NewPrecip (int) This is a binary categorical variable, if NewPrecip equals 0 that means there was no precipitation on that day. If NewPrecip equals 1 that means there was precipitation on that day.

4 Methodology & Analysis

pois.model.1 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, 
                 family = poisson(link="log"), data =bridge.final)  

yhat.1 = pois.model.1$fitted.values
pearson.resid.1 = (bridge.final$QueensboroBridge - yhat.1)/sqrt(yhat.1)
Pearson.disp.1 = sum(pearson.resid.1^2)/pois.model.1$df.residual

Deviance.disp.1 = (pois.model.1$deviance)/pois.model.1$df.residual

disp = cbind(Pearson.disp.1 = Pearson.disp.1, Deviance.disp.1 = Deviance.disp.1)
kable(disp, caption="Dispersion parameter for Primary Model", align = 'c')
Dispersion parameter for Primary Model
Pearson.disp.1 Deviance.disp.1
154.3321 158.8547

As seen above the poisson model resulted in a Pearson Dispersion of 154.3321 and a Residual Dispersion of 158.8547. Considering that the preferred dispersion is 1, this model highly violates the model assumptions. Since the model assumptions are so heavily violated, we have to look into using a different model. Considering the models we used in the previous assignment, the next model we are going to test is going to consider the total cyclists that day.

pois.model.2 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total), 
                   family = poisson(link = "log"), data = bridge.final)  

yhat.2 = pois.model.2$fitted.values
pearson.resid.2 = (bridge.final$QueensboroBridge - yhat.2)/sqrt(yhat.2)
Pearson.disp.2 = sum(pearson.resid.2^2)/pois.model.2$df.residual

Deviance.disp.2 = (pois.model.2$deviance)/pois.model.2$df.residual

disp = cbind(Pearson.disp.2 = Pearson.disp.2, Deviance.disp.2 = Deviance.disp.2)
kable(disp, caption="Dispersion parameter for Secondary Model", align = 'c')
Dispersion parameter for Secondary Model
Pearson.disp.2 Deviance.disp.2
7.059385 6.980342

As seen above, the secondary model, which considers an offset of the Total, produced a much better dispersion parameter, compared to the primary model. The Pearson Dispersion for the model is 7.059385, and the Deviance Dispersion is 6.980342. While still not ideal dispersion parameters, considering that the previous model had dispersion parameters in the 150s, these single digit dispersion parameters clearly show that this second model is the better model. To help adjust for the standard error, we are going to be fitting the quasi-poisson model when summarizing the inferential statistics.

The next step will be to summarize the inferential statistics in the table below.

quasi.model = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.new)  
summary(quasi.model)
## 
## Call:
## glm(formula = QueensboroBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson, 
##     data = bridge.new, offset = log(Total))
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.226290   0.066584 -18.417 1.94e-14 ***
## DayMonday    -0.030650   0.030089  -1.019  0.31997    
## DaySaturday  -0.020246   0.031527  -0.642  0.52771    
## DaySunday    -0.046129   0.031043  -1.486  0.15215    
## DayThursday  -0.001740   0.031739  -0.055  0.95679    
## DayTuesday   -0.001009   0.031764  -0.032  0.97495    
## DayWednesday -0.016404   0.031554  -0.520  0.60859    
## AvgTemp      -0.003966   0.001048  -3.783  0.00109 ** 
## NewPrecip     0.054764   0.020018   2.736  0.01239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.059392)
## 
##     Null deviance: 412.64  on 29  degrees of freedom
## Residual deviance: 146.59  on 21  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3
SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")
Summary statistics of quasi-poisson regression model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.2262901 0.0665841 -18.4171596 0.0000000
DayMonday -0.0306504 0.0300895 -1.0186431 0.3199652
DaySaturday -0.0202462 0.0315274 -0.6421772 0.5277070
DaySunday -0.0461286 0.0310428 -1.4859659 0.1521474
DayThursday -0.0017402 0.0317393 -0.0548269 0.9567946
DayTuesday -0.0010095 0.0317640 -0.0317805 0.9749472
DayWednesday -0.0164039 0.0315536 -0.5198750 0.6085879
AvgTemp -0.0039665 0.0010484 -3.7833396 0.0010893
NewPrecip 0.0547639 0.0200177 2.7357770 0.0123852

The estimated dispersion index based on the Pearson residuals is 7.06. Looking at the quasi-poisson regression above, all of the factors for Day variable are insignificant, with p-values ranging from p-value = 0.15215 to p-value = 0.97495. Since all the factors are insignificant, we are going to refit the quasi-poisson model by dropping the variable Day.

quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of 
the Quasi-Poisson regression coefficients in the final model.")
Inferential statistics of the Quasi-Poisson regression coefficients in the final model.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.2462434 0.0604994 -20.599270 0.0000000
AvgTemp -0.0039704 0.0009886 -4.016345 0.0004238
NewPrecip 0.0619572 0.0171914 3.603953 0.0012491

The new model, shown above, that has been refitted without the variable day, will be used as the final model. Since the response variable of the model is on a log scale, and additionally there is an offset of the log Total, the interpretation of the regression coefficient of the poisson model is not a simple or straightforward procedure.

In context, the coefficient for NewPrecip is 0.0619572, this is the estimated poisson regression coefficient comparing days with and without precipitation, given the other variable are being held constant. The difference in log of cyclists on the Queensboro Bridge, offset by the log of total cyclists, is expected to be 0.0619572 units higher on days with precipitation compared to days without precipitation while other variables are held constant in the model.

4.1 Visual Analysis

To conduct this visual analysis there is going to be an additional breakdown of the model. A new variable called HighLowTemp is going to be created. This variable will be a binary categorical variable in which the variable will equal 0 if the AvgTemp was lower than the mean temperature across the entire month of April and it will equal 1 if it is higher or equal to the mean temperature. After running some additional code, it has been calculated that the mean temperature for the month of April is 57.23 degrees Fahrenheit.

mean(bridge.new$AvgTemp)
## [1] 57.23167
for(i in 1:nrow(bridge)) {
  if (bridge.new$AvgTemp[i] >= 57.23) {
    bridge.final$HighLowTemp[i] <- 1  
  } else if (bridge.new$AvgTemp[i] < 57.23) {
    bridge.final$HighLowTemp[i] <- 0  
  }
}
quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of 
the Quasi-Poisson regression coefficients in the final model.")
Inferential statistics of the Quasi-Poisson regression coefficients in the final model.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.2462434 0.0604994 -20.599270 0.0000000
AvgTemp -0.0039704 0.0009886 -4.016345 0.0004238
NewPrecip 0.0619572 0.0171914 3.603953 0.0012491

After readjusting AvgTemp to HighLowTemp, we can see that the variable is still statistically significant and we can move forward with the visual analysis that is planned.

The goal of the visualization is to show how the explanatory variables in the final model impact the actual number of cyclists on the Queensboro Bridge.

For this visualization all days will be classified as one of four of the following groups, by precipitation and temperature.

day.0 = days with no precipitation

day.1 = days with precipitation

Now, it is time to exponentiate the log count of cyclists on the Queensboro Bridge to the actual number of cyclists. Then we will create a graph that shows the relationship between cyclists and different weather conditions in terms of number of cyclists riding in each of the groups above.

temp =  range(bridge.final$AvgTemp)[1]:range(bridge.final$AvgTemp)[2]
day.0 = -1.2462434 - 0.0039704*temp + (offset = log(bridge.final$Total))
day.1 = -1.2462434 + 0.0619572  - 0.0039704*temp + (offset = log(bridge.final$Total))

##
plot(temp, exp(day.0), ylim=c(0,6000),
     type = "l",
     col = "red",
     lty = 1,
     ylab = "Queensboro Bridge Cyclists",
     xlab = "Average Temperature",
     main = "Factors That Impact Cyclist Turnout")
lines(temp, exp(day.1), col = "blue", lty = 2)
legend("topleft", c("no precipitation", "precipitation"), 
       col=c("red", "blue"),  lty=1:2, bty="n", cex=0.8)

The graph shows there does not seem to be a large significant difference between days with no precipitation vs days with precipitation. However, out of the two there does to be slightly less cyclist turnout on days without precipitation. Additionally, the graph shows that there is not a consistent increase or decrease in cyclist turnout as average temperature increases or decreases. Instead, the cyclist turnout seems to have a very sporadic relationship to average temperature, with high spikes and low divots.

5 Results & Conclusion

The best model for total cyclists on the Queensboro bridge, is a poisson model that offsets with the log of the total cyclists and uses NewPrecip and AvgTemp as predictors variables and with the variable Day being dropped for lacking statistical significance across all factors. Additionally, the graph shows that the relationship between AvgTemp and QueensboroBridge, while statistically significant, is not linear. The graph also shows that while the variable NewPrecip is statistically significant in predicting the total cyclists on the Queensboro Bridge, they different levels do not appear to be largely different from each other.

6 General Discussions

It is important to acknowledge the shortcomings of this model, starting with the dispersion parameters. While a dispersion parameter of approximately 7 looked great in comparison to the first models dispersion parameter of over 150, a 7 is still without side a preferred range. Ideally, since getting a dispersion parameter any closer to 1 seems highly unlikely, especially considering this project had variable parameters that were required, realistically a different model should have been used, most likely a negative binomial. However, since this project requested a poisson model be used we stuck with the model that had the least egregious model assumptions.

This project also required the creation and use of two new variables in our model, AvgTemp and NewPrecip. These new variables were meant to replace the variables HighTemp, LowTemp and Precipitation, and to me used in the model building with the variable Day. In our previous analysis, we saw some factors of the variable day have some statistical significance, and even talked about making an argument for grouping them with further analysis. Unfortunately, in this analysis, the new model with the new variables did not result in any of the factors of the variable day yielding any statistical significance, making grouping a moot point. Since none of the factors were statistically significant, we had to drop the variable Day, leaving us with only two variables, which is less than ideal. Only analysis two variables gives us less opportunities to analyse the relationship to other variables and the response, so ideally it would have been better to use a model that showed at least some of the factors of day to be significant so the variable did not have to be dropped.

Finally, as mentioned earlier and in the previous report, this is a subset of a much larger data set. This subset only covers one month for one bridge, of a data set that spans one year and covers four different bridges. Without being able to compare this to other sections of the data set we are missing crucial information. How does the Queensboro Bridge compare in cyclist count to other bridges? How does the month of April compare in cyclist counts to other months? Is the Queensboro Bridge or the month of April and outlier? Seeing a significantly higher or lower amount of cyclists compared to other bridges or months. This is information we can not know without being able to analyze, the full, or at the very least additionally subsets of, the data set. For the conclusions drawn regarding this subset to be meaningful and hold proper importance, it needs to be interpreted in comparison to the larger data set so that both those conducting the analysis and those reading the report have the best understanding of the context surrounding the data set.

Overall, while not the perfect model or perfect circumstances for analysis, we have to make the best conclusions with the information we have. Thus, this model dictates that the AvgTemp and NewPrecip have a statisticaly significant impact on the cyclist turnout for the Queensboro Bridge in April.

---
title: "Dispersed Poisson Regression for Cyclist Count"
author: "Chloé Winters"
date: "11/10/2024"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
  word_document:
    toc: yes
    toc_depth: '4'
---
```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}

#
# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      
                      warning = FALSE,   
                      message = FALSE,  
                      results  = TRUE     
                      )   

library(knitr)
library(pander)
library(mlbench)
library(MASS)
library(openxlsx)
```

# Introduction

This assignment will further analyze bike count data for several bridges, keeping a count of cyclists entering and leaving the bridge. This assignment will utilize a dispersed poisson regression to look at the cyclist count for a particular bridge on different days of the week. Additionally, this assignment will modify some of the variables to better fit a dispersed poisson regression analysis. 



# Materials

## Data Set

A daily total bike count was conducted monthly on four different bridges in New York, the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. The bike total kept count of the total cyclists entering and leaving each bridge on a given day. The count data was collected by the Traffic Information Management System (TIMS). Each recorded response represents the total number of cyclists that entered and left that bridge in a 24 hour period. 


```{r}
url="https://ChloeWinters79.github.io/STA321/Data/QueensboroBridge.csv"
bridge = read.csv(url, header = TRUE)
```


This project did a random assignment of a specific subset of the data for our analysis. The subset is for one of the four bridges over a one month period from the first day of the month to last day of the month. The specific subset assigned for this project was for the Queensboro bridge for the month of April. The data set is called bridge and consists of 7 variables and 30 total observations. When the data set is first read in the variables are as followed. 

1.  Date: (chr) Used as the ID variable, the date that the observation was recorded
2.  Day (chr) The day of the week the observation was recorded
3.  HighTemp (num) The high temperature on the date of the observation
4.  LowTemp (num) The low temperature on the date of the observation
5.  Precipitation (num) The amount of precipitation received on the date of the observation
6.  QueensboroBridge (int) The total amount of cyclists who entered and exited the Queensboro Bridge on the date of the observation
7. Total (int) The total amount of cyclists who entered and exited any of the four bridges on the date of the observation

# EDA & Feature Engineering

Before conducting the analysis it is important to make some edits to the variables in the data set. First, starting with the variable Day, it is currently a character variable, meaning we can not really use it to conduct meaningful analysis. To fix the variable to make it more useful, it needs to be changed to a factor variable, where each day of the week is its own factor, "Monday", "Tuesday", etc. Next, we are going to combine the variables HighTemp and LowTemp to create a new variable called AvgTemp, which will take the average of the high and low temperature. 

```{r}
bridge.new = bridge
bridge.new$Day = as.factor(bridge$Day)
bridge.new$AvgTemp = (bridge$HighTemp + bridge$LowTemp)/2
bridge.new$NewPrecip <- NA

for(i in 1:nrow(bridge)) {
  if (bridge$Precipitation[i] > 0) {
    bridge.new$NewPrecip[i] <- 1  
  } else if (bridge$Precipitation[i] == 0) {
    bridge.new$NewPrecip[i] <- 0  
  }
}

bridge.final = subset(bridge.new, select = -c(HighTemp, LowTemp, Precipitation))
```

Now, we have our final data set, bridge.final which still consists of the same 30 observations but now only has 6 variables which are as followed. 
1.  Date: (chr) Used as the ID variable, the date that the observation was recorded
2.  Day (factor) The day of the week the observation was recorded
3.  QueensboroBridge (int) The total amount of cyclists who entered and exited the Queensboro Bridge on the date of the observation
4. Total (int) The total amount of cyclists who entered and exited any of the four bridges on the date of the observation
5. AvgTemp (num) The average of the HighTemp and LowTemp variables. 
6. NewPrecip (int) This is a binary categorical variable, if NewPrecip equals 0 that means there was no precipitation on that day. If NewPrecip equals 1 that means there was precipitation on that day.

# Methodology & Analysis

```{r}
pois.model.1 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, 
                 family = poisson(link="log"), data =bridge.final)  

yhat.1 = pois.model.1$fitted.values
pearson.resid.1 = (bridge.final$QueensboroBridge - yhat.1)/sqrt(yhat.1)
Pearson.disp.1 = sum(pearson.resid.1^2)/pois.model.1$df.residual

Deviance.disp.1 = (pois.model.1$deviance)/pois.model.1$df.residual

disp = cbind(Pearson.disp.1 = Pearson.disp.1, Deviance.disp.1 = Deviance.disp.1)
kable(disp, caption="Dispersion parameter for Primary Model", align = 'c')
```


As seen above the poisson model resulted in a Pearson Dispersion of 154.3321 and a Residual Dispersion of 158.8547. Considering that the preferred dispersion is 1, this model highly violates the model assumptions. Since the model assumptions are so heavily violated, we have to look into using a different model. Considering the models we used in the previous assignment, the next model we are going to test is going to consider the total cyclists that day. 

```{r}
pois.model.2 = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total), 
                   family = poisson(link = "log"), data = bridge.final)  

yhat.2 = pois.model.2$fitted.values
pearson.resid.2 = (bridge.final$QueensboroBridge - yhat.2)/sqrt(yhat.2)
Pearson.disp.2 = sum(pearson.resid.2^2)/pois.model.2$df.residual

Deviance.disp.2 = (pois.model.2$deviance)/pois.model.2$df.residual

disp = cbind(Pearson.disp.2 = Pearson.disp.2, Deviance.disp.2 = Deviance.disp.2)
kable(disp, caption="Dispersion parameter for Secondary Model", align = 'c')
```

As seen above, the secondary model, which considers an offset of the Total, produced a much better dispersion parameter, compared to the primary model. The Pearson Dispersion for the model is 7.059385, and the Deviance Dispersion is 6.980342. While still not ideal dispersion parameters, considering that the previous model had dispersion parameters in the 150s, these single digit dispersion parameters clearly show that this second model is the better model. To help adjust for the standard error, we are going to be fitting the quasi-poisson model when summarizing the inferential statistics.

The next step will be to summarize the inferential statistics in the table below. 

```{r}
quasi.model = glm(QueensboroBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.new)  
summary(quasi.model)
```
```{r}
SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")
```

The estimated dispersion index based on the Pearson residuals is 7.06.
Looking at the quasi-poisson regression above, all of the factors for Day variable are insignificant, with p-values ranging from p-value = 0.15215 to p-value = 0.97495. Since all the factors are insignificant, we are going to refit the quasi-poisson model by dropping the variable Day. 

```{r}
quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of 
the Quasi-Poisson regression coefficients in the final model.")
```

The new model, shown above, that has been refitted without the variable day, will be used as the final model. Since the response variable of the model is on a log scale, and additionally there is an offset of the log Total, the interpretation of the regression coefficient of the poisson model is not a simple or straightforward procedure. 

In context, the coefficient for NewPrecip is 0.0619572, this is the estimated poisson regression coefficient comparing days with and without precipitation, given the other variable are being held constant. The difference in log of cyclists on the Queensboro Bridge, offset by the log of total cyclists, is expected to be 0.0619572 units higher on days with precipitation compared to days without precipitation while other variables are held constant in the model. 


## Visual Analysis 

To conduct this visual analysis there is going to be an additional breakdown of the model. A new variable called HighLowTemp is going to be created. This variable will be a binary categorical variable in which the variable will equal 0 if the AvgTemp was lower than the mean temperature across the entire month of April and it will equal 1 if it is higher or equal to the mean temperature. After running some additional code, it has been calculated that the mean temperature for the month of April is 57.23 degrees Fahrenheit. 

```{r}
mean(bridge.new$AvgTemp)

for(i in 1:nrow(bridge)) {
  if (bridge.new$AvgTemp[i] >= 57.23) {
    bridge.final$HighLowTemp[i] <- 1  
  } else if (bridge.new$AvgTemp[i] < 57.23) {
    bridge.final$HighLowTemp[i] <- 0  
  }
}

```

```{r}
quasi.model.new = glm(QueensboroBridge ~ AvgTemp + NewPrecip, offset = log(Total),
                 family = quasipoisson, data =bridge.final)
kable(summary(quasi.model.new)$coef, caption = "Inferential statistics of 
the Quasi-Poisson regression coefficients in the final model.")

```


After readjusting AvgTemp to HighLowTemp, we can see that the variable is still statistically significant and we can move forward with the visual analysis that is planned. 

The goal of the visualization is to show how the explanatory variables in the final model impact the actual number of cyclists on the Queensboro Bridge. 

For this visualization all days will be classified as one of four of the following groups, by precipitation and temperature. 

day.0 = days with no precipitation

day.1 = days with precipitation

Now, it is time to exponentiate the log count of cyclists on the Queensboro Bridge to the actual number of cyclists. Then we will create a graph that shows the relationship between cyclists and different weather conditions in terms of number of cyclists riding in each of the groups above. 

```{r fig.align='center', fig.width=6, fig.height=4}
temp =  range(bridge.final$AvgTemp)[1]:range(bridge.final$AvgTemp)[2]
day.0 = -1.2462434 - 0.0039704*temp + (offset = log(bridge.final$Total))
day.1 = -1.2462434 + 0.0619572  - 0.0039704*temp + (offset = log(bridge.final$Total))

##
plot(temp, exp(day.0), ylim=c(0,6000),
     type = "l",
     col = "red",
     lty = 1,
     ylab = "Queensboro Bridge Cyclists",
     xlab = "Average Temperature",
     main = "Factors That Impact Cyclist Turnout")
lines(temp, exp(day.1), col = "blue", lty = 2)
legend("topleft", c("no precipitation", "precipitation"), 
       col=c("red", "blue"),  lty=1:2, bty="n", cex=0.8)
```


The graph shows there does not seem to be a large significant difference between days with no precipitation vs days with precipitation. However, out of the two there does to be slightly less cyclist turnout on days without precipitation. Additionally, the graph shows that there is not a consistent increase or decrease in cyclist turnout as average temperature increases or decreases. Instead, the cyclist turnout seems to have a very sporadic relationship to average temperature, with high spikes and low divots. 


# Results & Conclusion

The best model for total cyclists on the Queensboro bridge, is a poisson model that offsets with the log of the total cyclists and uses NewPrecip and AvgTemp as predictors variables and with the variable Day being dropped for lacking statistical significance across all factors. Additionally, the graph shows that the relationship between AvgTemp and QueensboroBridge, while statistically significant, is not linear. The graph also shows that while the variable NewPrecip is statistically significant in predicting the total cyclists on the Queensboro Bridge, they different levels do not appear to be largely different from each other. 

# General Discussions

It is important to acknowledge the shortcomings of this model, starting with the dispersion parameters. While a dispersion parameter of approximately 7 looked great in comparison to the first models dispersion parameter of over 150, a 7 is still without side a preferred range. Ideally, since getting a dispersion parameter any closer to 1 seems highly unlikely, especially considering this project had variable parameters that were required, realistically a different model should have been used, most likely a negative binomial. However, since this project requested a poisson model be used we stuck with the model that had the least egregious model assumptions. 

This project also required the creation and use of two new variables in our model, AvgTemp and NewPrecip. These new variables were meant to replace the variables HighTemp, LowTemp and Precipitation, and to me used in the model building with the variable Day. In our previous analysis, we saw some factors of the variable day have some statistical significance, and even talked about making an argument for grouping them with further analysis. Unfortunately, in this analysis, the new model with the new variables did not result in any of the factors of the variable day yielding any statistical significance, making grouping a moot point. Since none of the factors were statistically significant, we had to drop the variable Day, leaving us with only two variables, which is less than ideal. Only analysis two variables gives us less opportunities to analyse the relationship to other variables and the response, so ideally it would have been better to use a model that showed at least some of the factors of day to be significant so the variable did not have to be dropped. 

Finally, as mentioned earlier and in the previous report, this is a subset of a much larger data set. This subset only covers one month for one bridge, of a data set that spans one year and covers four different bridges. Without being able to compare this to other sections of the data set we are missing crucial information. How does the Queensboro Bridge compare in cyclist count to other bridges? How does the month of April compare in cyclist counts to other months? Is the Queensboro Bridge or the month of April and outlier? Seeing a significantly higher or lower amount of cyclists compared to other bridges or months. This is information we can not know without being able to analyze, the full, or at the very least additionally subsets of, the data set. For the conclusions drawn regarding this subset to be meaningful and hold proper importance, it needs to be interpreted in comparison to the larger data set so that both those conducting the analysis and those reading the report have the best understanding of the context surrounding the data set. 

Overall, while not the perfect model or perfect circumstances for analysis, we have to make the best conclusions with the information we have. Thus, this model dictates that the AvgTemp and NewPrecip have a statisticaly significant impact on the cyclist turnout for the Queensboro Bridge in April. 