Are they the seeds to be nurtured to bring in automation, innovation and transformation. There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity. We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.
Hello Data Experts,
Let me continue from my last blog
https://outstandingoutlier.blogspot.in/2017/08/z-and-t-distribution-values-using-r.html “Z
and T distribution values using R” where I had covered when to choose Z
distribution and we should opt for T Distribution. Finally, I had also touched
upon 3 key formulas to derive those values:
Confidence Interval = Sample
mean + Margin of Error
Z Distribution = Sample
mean + Z(1-α) value* (SD/square root of (sample size))
T Distribution = Sample mean + T(1-α, n-1,) value* (SD/square root of (sample size))
T Distribution = Sample mean + T(1-α, n-1,) value* (SD/square root of (sample size))
In this blog, I will cover Simple
Linear Regression model using R. Let us
first understand what and when to go for SLR model. We should understand certain
basic concepts around “Correlation Coefficient” i.e. r and “Coefficient of Determination”
i.e., R^2.
Correlation Coefficient can be defined as how Input Variable and Output
Variables are correlated i.e., is there any relationship between both variables.
Key characteristics of this are:
It can have a value from -1 to 1
Perfect correlation will have
value -1 or 1 else within range from 0 to 1 in absolute value term.
Positive correlation will have Positive
value whereas Negative will have Negative value.
This coefficient help us define 3
inferences.
Direction: Relationship will be positive
if Output variable increases with the increase in input variable, whereas it
will be negative if output variable decreases with the increase in input
variable or there is no clear relationship.
Strength: If there is a Positive or Negative
relationship, does output variable value changes with the proportion as input
variable or less or more. If both variable change at same proportion it shows
higher strength.
Linearity: Is that relation straight linear
or curvilinear or curve or sinusoidal etc.
Coefficient of Determination:
Statistically how close will the calculated
outcome variables/line matches the Real variables/line. It is represented as R
Squared. Key characteristics of R Squared are:
It can have value from 0 to 1
Higher the value shows accurate calculations
and model.
Before we start building Simple
Linear Regression, let us understand when we should go for SLR model. If we have a SINGLE CONTINOUS input variable and
CONTNOUS output variable, we should go for Simple Linear Regression. For example, how can we derive SLR model for a
car’s mileage based on speed as an input for it. Both Speed and Mileage are continuous
variables and Speed is the only input variable hence SLR is the right
statistical model that fit in this scenario.
It’s time for us to work on SLR
modeling now. We should follow below
listed 5 Steps to arrive at SLR model:
Step: 1
> We should perform Exploratory Data Analysis, where we should determine if
data is normal and fit for model. For that we should completed all 4 moments of
statistics:
1. Measure of Central Tendency
2. Measure of Dispersion.
3. Measure of Skewness
4. Measure of Peakness.
To confirm fitness of data, we
should also run Scatter Plot Graph, Box Plot, Bar chart and Histogram chart to
have visual reflection of measure of fitness. If data is not fit, transform it
using Log, Square Root or Cube root of data.
Note to know more about
Exploratory Data Analysis https://outstandingoutlier.blogspot.in/2017/08/exploratory-data-analysis-using-r.htm
and Normality Test https://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html
, please revisit my other blogs.
Step: 2
> Let us determine the Correlation
Coefficient value to understand Direction, Strength and Linearity of data
points. Let us assume r = -.75, it can
be interpreted as moderate to strong Negative relationship where output variable’s
value will tend to decrease with increase in input variables value. If absolute
|r| value is above .75, it reflects strong relationship.
Using R, it can be derived by
executing cor (x, y)
Step: 3 >
Now that we have determined Normality and Linearity coefficient value, let us
try to build Linear Regress model.
Using R execute lm (y ~ x) command with y and x parameters.
It will give you below sample outcome
Coefficients:
(Intercept) x
β0 β1
Once we get both Intercept (β0) and
Slope (β1) values, we have the SLR model ready. Simple Linear Regression will
be Y = β0 + β1(x)
Step: 4
> Let is validate the model. We can leave it here assuming predictive model
that we have created is Perfect and we can start leveraging it for predictions however
we should first validate the model before start predicting outcomes using it. To validate we should execute Hypothesis Testing
model by getting the probability value of taking an action.
Execute summary (Linear Regression command from Step 3)
Output will have various section
Residuals
Coefficients
Residual
standard error
Multiple
R-squared and Adjusted R-squared:
F-statistic
and p-value
Look out for
Estimate Std. Error t value Pr(>|t|) value.
If Pr(>|t|) i.e., Probability value is < .05 that means, there is less than 5 %
chance of this model going wrong. If Probability value for both β0 and β1 is verified
to be less than .05 then Y = β0 + β1(x) sound good.
Similarly, if Multiple R-squared is greater than .75 model sounds good for predicting the outcome variable based on input, assuming residual value will be minimal.
Step: 5
> Once SLR model is defined, as Data Architect it is important to define the
expected confidence level, hence range Lower Confidence Level and Upper Confidence
Level will be defined using SLR. With this we will have 3 values for any Predictor
or Input variable.
Execute confint(Linear Regression command from Step 3, level
= Confidence Level)
If we are looking for a
confidence level of 95%, then Confidence level will be replaced by .95.
Output will be
2.5% 97.5 %
(Intercept) β0 β0
x β1 β1
Once we get two value for
intercept and slope, we will form 2 more equation as formed in step 3. These
will be LCL and UCL equations. LCL and UCL will vary based on what confidence
level we would like to obtain. LCL and UCL range will broaden as confidence
level increases.
Thank you for sparing time and going
through this blog, I hope this blog helped you understand how to build a Simple
Linear Regression Model with confidence using R. Kindly share your valuable feedback
and kind opinion. Please do not forget to suggest what you would like to
understand and hear from me in my future blogs around R Programming.
Thank you...
Outstanding Outliers::
"AG".
Comments
Post a Comment