In this assignment you will implement simple linear regression and analyze multiple linear regression. Check out Seeing Theory for help understanding these concepts!
You can click here to get the stencil code for Homework 4. Reference this guide for more information about Github and Github Classroom.
The data is located in the data folder. To ensure compatibility with the autograder, you should not
modify the stencil unless instructed otherwise. For this assignment, please write your solutions in
the respective .py
files. Failing to do so may hinder with the autograder and result in
a low grade.
Write Code: Use whatever editer or IDE you like.
Execute Code: First,
ssh
into the department machine by running
ssh [cs login]@ssh.cs.brown.edu
and typing your password when
prompted. Then, navigate to
the assignment directory and activate the
course virtual environment by running
source /course/cs1951a/venv/bin/activate
. You can now run your code for the
assignment. To deactivate this virtual environment,
simply type deactivate
.
Python requirements: Python 3.7.x. Our Gradescope Autograder uses Python 3.7.9. Using some other version of Python might lead to issues installing the dependencies for the assignment.
Virtual environment:
create_venv.sh
script in Homework 1, activate your virtual environment using cs1951a_venv
or
source ~/cs1951a_venv/bin/activate
anywhere from your terminal. If you have
installed your virtual environment elsewhere, activate your environment with
source PATH_TO_YOUR_ENVIRONMENT/bin/activate
.
Write Code: Use whatever editor or IDE you like.
Execute Code: Activate the virtual environment and run your program in the command line.
You will modify util.py
, simple_regression.py
, and multiple_regression.py
using
bike_sharing.csv
. You will be implementing most parts of the regressions from scratch.
Here are some restrictions on the use of Python libraries for this assignment:
written_questions.txt
with your answers to the questions
for Part 3.
Regression is used to predict numerical values given inputs. Linear regression attempts to model the line of best fit that explains the relationship between two or more variables by fitting a linear equation to observed data. One or more variables are considered to be the independent (or explanatory) variables, and one variable is considered to be a dependent (or response) variable. In simple linear regression, there is one independent variable and in multiple linear regression, there are two or more independent variables.
Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a linear relationship between the variables of interest. This does not necessarily imply that one variable causes the other. Remember, correlation does not necessarily imply causation. For example, moderate alcohol consumption is correlated to longevity. This doesn’t necessarily mean that moderate alcohol consumption causes longevity. Another independent characteristic of moderate drinkers (extrovert lifestyle) might be responsible for longevity. Regardless, there is some association between the two variables. The goal in linear regression is to tease apart whether a correlation is likely to be a causal relationship, by controlling for these other variables/characteristics that might be confounding the analysis.
In simple linear regression, scatter plot can be a helpful tool in visually determining the strength of the relationship between two variables. The independent variable is typically plotted on the x-axis and the dependent variable is typically plotted on the y-axis. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatter plot does not indicate any clear increasing or decreasing trends), then fitting a linear regression model to the data probably will likely not provide a useful model. (Note this is not always the case--just as confounding variables might lead a correlation to appear stronger than it is, it might also lead a relationship to appear weaker than it is.)
Shown in the picture below, in simple regression, we have a single feature x and weights w0 for the intercept and w1 for slope of the line. Our goal is to find the line that minimizes the vertical offsets, otherwise known as residuals. In other words, we define the best-fitting line as the line that minimizes the residual sum of squares (SSR or RSS) between our target variable y and our predicted output over all samples i in our training examples n.
A valuable numerical measure of association between two variables is the correlation coefficient , which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables. A simple linear regression line has an equation of the form Y = a + bX, where X is the independent (explanatory) variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0). A multiple linear regression line has an equation of the form Y = a + b_1X_1 + b_2 X_2 + … + b_n X_n for n independent variables.
Another useful metric is the R-squared value. This tells us how much of the variation in Y can be explained by the variation in X. The value for R-squared ranges from 0 to 1, and the closer to 1, the greater proportion of the variability in Y is explained by the variability in X. For more explanation on R-squared and how to calculate it, read here (this will be very helpful in Part 2!)
Do not confuse correlation and regression. Correlation is a numerical value that quantifies the degree to which two variables are related. Regression is a type of predictive analysis that uses a best fit line (Y = a + bX) that predicts Y given the value of X).
Inside the data folder, bike_sharing.csv
contains data of a two-year historical log corresponding to years 2011
and 2012
from Capital Bikeshare system, Washington D.C., USA. There are 11 different independent variables that
you are attempting to predict
and one dependent variable (the Count of Use of Bike). More information on the dataset can be found in
README_data.txt
.
The data functions described below are in util.py
and
can be used for both simple and multilple linear regressions.
In util.py
, first implement load_data
, which should load the data into X, and y variables using pandas.
Make sure to load the correct dependent and independent variables when implementing and calling this function.
Then, fill in the functions split_data_randomly
and train_test_split
to randomly split the data
into train and test datasets.
You should create the following variables:X_train
, X_test
,
y_train
, y_test
.
A common convention in data science is to make 80% of the data the training data and 20% the test data.
In train_test_split
, we have provided test_pct = 0.2
; use this in your split_data_randomly
function to obtain this 80/20 split.
You will train your model on the training data and test it on the test data.
Note that we expect you
to distribute the data randomly, look into
how you can use Python random.random to do
this. In simple_regression.py
and multiple_regression.py
, we set random.seed(1)
to guarrantee that we obtain the same
split on every run. Please do not change this.
Lastly, when splitting data make sure that X,y pairs remain intact, that is, every X and its corresponding Y should both be in the same list(either test or train) and at the same index to make sure your model is accurate. To do this, we recommend using Python zip to form pairs before splitting the data into train and test sets.
Note: calculate_r_squared
function is for the next section, and you don't need to implement it at this stage.
The management team of Capital Bikeshare system wants to know how much weather conditions can
predict bikesharing use. Please open simple_regression.py
and implement simple
linear regression model that predicts the number of total rental bikes per day (cnt) given the
weather conditions (weathersit).
Fill in the helper functions for mean, variance and covariance. When calculating variance and covariance, you should use (n-1) as the denominator to make the estimator
unbiased. (Note: n is the sample size.)
You cannot use numpy functions for these
calculations. Use this in your train function to calculate the (a,b) of your regression model.
Then, you need to go back to util.py
and implement calculate_r_squared
,
which will be used for testing both your simple and multiple linear regressions.
You should use r-squared = 1 - SSE / SSTO,
and refer to this page for an explanation on
how to calculate SSE and SSTO.
Lastly, fill in the test function which is used to evaluate your model. Here you should calculate and
return your training
and testing mean squarred error, and your testing r-squared.
Note that you can use StatsModels and numpy to calculate the MSE.
Wait, the management team realized that there are many more variables that affect bikesharing use in addition to weather conditions. They are concerned that by ignoring these variables, your simple linear regression might be over or underestimating the effect of weather on useage.
Please open multiple_regression.py
and using StatsModels ,
implement a multiple linear regression. From the given 11 different independent variables, your job is
to select the ones that could help the prediction model. See how the R-squared value changes as you
include different independent variables to your regression model.
Calling function summary
on the results returned by the regression model will print out the
report containing the value of each coefficient and its corresponding p-value. See the StatsModels docs
for more details. At the end, please return your training-MSE, testing-MSE, and testing R-squared value.
Answer the following in written_questions.txt
Your responses should be thoughtful, provide justification for your claims, and be concise but complete. See the response guide for more guidance.
After finishing the assignment, run python3 zip_assignment.py
in the command line from your
assignment directory, and fix any issues brought up by the script.
After the script has been run successfully, you should find the file
stats1-submission-1951A.zip
in your assignment directory. Please submit this zip file on
Gradescope under the respective assignment.
Adapted from the previous TA staff for CS1951a.
Bike Sharing Dataset: Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.