Simple Linear Regression using R Programming

Hello Data Experts,

Let me continue from my last blog https://outstandingoutlier.blogspot.in/2017/08/z-and-t-distribution-values-using-r.html “Z and T distribution values using R” where I had covered when to choose Z distribution and we should opt for T Distribution. Finally, I had also touched upon 3 key formulas to derive those values:

Confidence Interval = Sample mean + Margin of Error

Z Distribution = Sample mean + Z_(1-α) value* (SD/square root of (sample size))
T Distribution = Sample mean + T_{(1-α, n-1,)} value* (SD/square root of (sample size))

In this blog, I will cover Simple Linear Regression model using R. Let us first understand what and when to go for SLR model. We should understand certain basic concepts around “Correlation Coefficient” i.e. r and “Coefficient of Determination” i.e., R^2.

Correlation Coefficient can be defined as how Input Variable and Output Variables are correlated i.e., is there any relationship between both variables. Key characteristics of this are:

It can have a value from -1 to 1

Perfect correlation will have value -1 or 1 else within range from 0 to 1 in absolute value term.

Positive correlation will have Positive value whereas Negative will have Negative value.

This coefficient help us define 3 inferences.

Direction: Relationship will be positive if Output variable increases with the increase in input variable, whereas it will be negative if output variable decreases with the increase in input variable or there is no clear relationship.

Strength: If there is a Positive or Negative relationship, does output variable value changes with the proportion as input variable or less or more. If both variable change at same proportion it shows higher strength.

Linearity: Is that relation straight linear or curvilinear or curve or sinusoidal etc.

Coefficient of Determination:

Statistically how close will the calculated outcome variables/line matches the Real variables/line. It is represented as R Squared. Key characteristics of R Squared are:

It can have value from 0 to 1

Higher the value shows accurate calculations and model.

Before we start building Simple Linear Regression, let us understand when we should go for SLR model. If we have a SINGLE CONTINOUS input variable and CONTNOUS output variable, we should go for Simple Linear Regression. For example, how can we derive SLR model for a car’s mileage based on speed as an input for it. Both Speed and Mileage are continuous variables and Speed is the only input variable hence SLR is the right statistical model that fit in this scenario.

It’s time for us to work on SLR modeling now. We should follow below listed 5 Steps to arrive at SLR model:

Step: 1 > We should perform Exploratory Data Analysis, where we should determine if data is normal and fit for model. For that we should completed all 4 moments of statistics:

1. Measure of Central Tendency

2. Measure of Dispersion.

3. Measure of Skewness

4. Measure of Peakness.

To confirm fitness of data, we should also run Scatter Plot Graph, Box Plot, Bar chart and Histogram chart to have visual reflection of measure of fitness. If data is not fit, transform it using Log, Square Root or Cube root of data.

Note to know more about Exploratory Data Analysis https://outstandingoutlier.blogspot.in/2017/08/exploratory-data-analysis-using-r.htm and Normality Test https://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html , please revisit my other blogs.

Step: 2 > Let us determine the Correlation Coefficient value to understand Direction, Strength and Linearity of data points. Let us assume r = -.75, it can be interpreted as moderate to strong Negative relationship where output variable’s value will tend to decrease with increase in input variables value. If absolute |r| value is above .75, it reflects strong relationship.

Using R, it can be derived by executing cor (x, y)

Step: 3 > Now that we have determined Normality and Linearity coefficient value, let us try to build Linear Regress model.

Using R execute lm (y ~ x) command with y and x parameters. It will give you below sample outcome

Coefficients:

(Intercept) x

β0 β1

Once we get both Intercept (β0) and Slope (β1) values, we have the SLR model ready. Simple Linear Regression will be Y = β0 + β1(x)

Step: 4 > Let is validate the model. We can leave it here assuming predictive model that we have created is Perfect and we can start leveraging it for predictions however we should first validate the model before start predicting outcomes using it. To validate we should execute Hypothesis Testing model by getting the probability value of taking an action.

Execute summary (Linear Regression command from Step 3) Output will have various section

Residuals

Coefficients

Residual standard error

Multiple R-squared and Adjusted R-squared:

F-statistic and p-value

Look out for

Estimate Std. Error t value Pr(>|t|) value.

If Pr(>|t|) i.e., Probability value is < .05 that means, there is less than 5 % chance of this model going wrong. If Probability value for both β0 and β1 is verified to be less than .05 then Y = β0 + β1(x) sound good.

Similarly, if Multiple R-squared is greater than .75 model sounds good for predicting the outcome variable based on input, assuming residual value will be minimal.

Step: 5 > Once SLR model is defined, as Data Architect it is important to define the expected confidence level, hence range Lower Confidence Level and Upper Confidence Level will be defined using SLR. With this we will have 3 values for any Predictor or Input variable.

Execute confint(Linear Regression command from Step 3, level = Confidence Level)

If we are looking for a confidence level of 95%, then Confidence level will be replaced by .95.

Output will be

2.5% 97.5 %

(Intercept) β0 β0

x β1 β1

Once we get two value for intercept and slope, we will form 2 more equation as formed in step 3. These will be LCL and UCL equations. LCL and UCL will vary based on what confidence level we would like to obtain. LCL and UCL range will broaden as confidence level increases.

Thank you for sparing time and going through this blog, I hope this blog helped you understand how to build a Simple Linear Regression Model with confidence using R. Kindly share your valuable feedback and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs around R Programming.

Thank you...

Outstanding Outliers:: "AG".

Outstanding Outlier

Search This Blog

Is today's world all about creativity and ideation?

Simple Linear Regression using R Programming

Labels

Comments

Post a Comment

Popular posts from this blog

Do we really need Data Scientist?

DevOps Models

Step by Step guide to Install R?