Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it....

Simple Linear Regression using R Programming

Hello Data Experts,
Let me continue from my last blog https://outstandingoutlier.blogspot.in/2017/08/z-and-t-distribution-values-using-r.html “Z and T distribution values using R” where I had covered when to choose Z distribution and we should opt for T Distribution. Finally, I had also touched upon 3 key formulas to derive those values:
Confidence Interval             Sample mean + Margin of Error
Z Distribution               =          Sample mean + Z(1-α) value* (SD/square root of (sample size))
T Distribution               =          Sample mean + T(1-α, n-1,) value* (SD/square root of (sample size))
In this blog, I will cover Simple Linear Regression model using R.  Let us first understand what and when to go for SLR model. We should understand certain basic concepts around “Correlation Coefficient” i.e. r and “Coefficient of Determination” i.e., R^2. 
Correlation Coefficient can be defined as how Input Variable and Output Variables are correlated i.e., is there any relationship between both variables. Key characteristics of this are:
It can have a value from -1 to 1
Perfect correlation will have value -1 or 1 else within range from 0 to 1 in absolute value term.
Positive correlation will have Positive value whereas Negative will have Negative value.
This coefficient help us define 3 inferences.
Direction: Relationship will be positive if Output variable increases with the increase in input variable, whereas it will be negative if output variable decreases with the increase in input variable or there is no clear relationship.
Strength: If there is a Positive or Negative relationship, does output variable value changes with the proportion as input variable or less or more. If both variable change at same proportion it shows higher strength.  
Linearity: Is that relation straight linear or curvilinear or curve or sinusoidal etc.    
Coefficient of Determination:
Statistically how close will the calculated outcome variables/line matches the Real variables/line. It is represented as R Squared. Key characteristics of R Squared are:
It can have value from 0 to 1
Higher the value shows accurate calculations and model.
Before we start building Simple Linear Regression, let us understand when we should go for SLR model.  If we have a SINGLE CONTINOUS input variable and CONTNOUS output variable, we should go for Simple Linear Regression.  For example, how can we derive SLR model for a car’s mileage based on speed as an input for it. Both Speed and Mileage are continuous variables and Speed is the only input variable hence SLR is the right statistical model that fit in this scenario.
It’s time for us to work on SLR modeling now.  We should follow below listed 5 Steps to arrive at SLR model:
Step: 1 > We should perform Exploratory Data Analysis, where we should determine if data is normal and fit for model. For that we should completed all 4 moments of statistics:
1.      Measure of Central Tendency
2.      Measure of Dispersion.
3.      Measure of Skewness
4.      Measure of Peakness.
To confirm fitness of data, we should also run Scatter Plot Graph, Box Plot, Bar chart and Histogram chart to have visual reflection of measure of fitness. If data is not fit, transform it using Log, Square Root or Cube root of data.   
Step: 2 > Let us determine the Correlation Coefficient value to understand Direction, Strength and Linearity of data points.  Let us assume r = -.75, it can be interpreted as moderate to strong Negative relationship where output variable’s value will tend to decrease with increase in input variables value. If absolute |r| value is above .75, it reflects strong relationship.
Using R, it can be derived by executing cor (x, y)
Step: 3 > Now that we have determined Normality and Linearity coefficient value, let us try to build Linear Regress model.
Using R execute lm (y ~ x) command with y and x parameters.  It will give you below sample outcome
Coefficients:
(Intercept)            x 
     β0                      β1 
Once we get both Intercept (β0) and Slope (β1) values, we have the SLR model ready. Simple Linear Regression will be Y = β0 + β1(x)
Step: 4 > Let is validate the model. We can leave it here assuming predictive model that we have created is Perfect and we can start leveraging it for predictions however we should first validate the model before start predicting outcomes using it.  To validate we should execute Hypothesis Testing model by getting the probability value of taking an action.
Execute summary (Linear Regression command from Step 3) Output will have various section
Residuals
Coefficients
Residual standard error
Multiple R-squared and Adjusted R-squared:   
F-statistic and p-value
Look out for
Estimate                Std. Error               t value                    Pr(>|t|) value.
If Pr(>|t|) i.e., Probability value is < .05 that means, there is less than 5 % chance of this model going wrong. If Probability value for both β0 and β1 is verified to be less than .05 then Y = β0 + β1(x) sound good.
Similarly, if Multiple R-squared is greater than .75 model sounds good for predicting the outcome variable based on input, assuming residual value will be minimal. 
Step: 5 > Once SLR model is defined, as Data Architect it is important to define the expected confidence level, hence range Lower Confidence Level and Upper Confidence Level will be defined using SLR. With this we will have 3 values for any Predictor or Input variable.
Execute confint(Linear Regression command from Step 3, level = Confidence Level)
If we are looking for a confidence level of 95%, then Confidence level will be replaced by .95.
Output will be
               2.5%   97.5 %
(Intercept)    β0           β0                         
x              β1           β1
Once we get two value for intercept and slope, we will form 2 more equation as formed in step 3. These will be LCL and UCL equations. LCL and UCL will vary based on what confidence level we would like to obtain. LCL and UCL range will broaden as confidence level increases.
Thank you for sparing time and going through this blog, I hope this blog helped you understand how to build a Simple Linear Regression Model with confidence using R. Kindly share your valuable feedback and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs around R Programming. 
Thank you...
Outstanding Outliers:: "AG".
 

Comments

Popular posts from this blog

Do we really need Data Scientist?

Hello Data Inquisitors, Today while having my discussion with Database expert, there was a healthy discussion between us around "Do we really need Data Scientist?". "DATA SPEAKS WHAT AND HOW ONE WANT TO SEE" - AG Discussion started by one of my dear friend who is the DB expert, he is the database administrator and is serving the industries consuming Data Mining and Data Warehouse techniques. He was very clear when he called out that Data Analytics is like an old wine in the new bottle. It just a new Job title has been created to continuous with thunder in new disruptive world. I appreciated his thought and the sense of attachment to "Data Cloud". Discussion went on for an hour before he embraced the need of Data Scientists.  Data Scientist to me is an Architect who has the skills to project collection of data points i.e., " Data Ocean" to a decision-making Data Visualization asset by using complex stati...

DevOps Models

Hello Everyone, IT industry is going through a Disruptive Evolution, where Artificial Intelligence and Intelligent Automation is helping organization go Lean and Agile. Leaders are at the crossroad where they need to pick the path which will empower their business teams to be more productive and focus on core. In this blog, I tried to invoke a thought process for leaders how they can step up their game by taking baby steps but still following fast lane to reach destination on time. Thought leaders must have been tracking the industry pulse how IT is changing fast pace by adopting Artificial Intelligent Driven Innovative frameworks. To drive Delivery in much more efficient and eloquent way, everyone must adopt new optimized Development and Operations practices to sustain in the current competitive ecosystem (Service or Captive world), by keeping cost to minimal.     IT gurus are smartly redefining their vision and practices towards Lean methodologies.  ...

Step by Step guide to Install R?

Hello Data Scientists, Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/why-data-scientist-prefer-r.html :: “Why Data Scientists prefer R?” where I shared my personnel choice of embracing R as the preferable tool for analytics. I had articulated top 10 reason for me to go for it as listed below (again): 1.   Open source software. 2.   Easy to install across platforms. 3.   Standalone computing and individual servers. 4.   Extensive library of statistical packages. 5.   Extra ordinary Data Visualization. 6.   RStudio is big plus, easy to use IDE. 7.   Easy to integrate with other packages like Excel, SAS. 8.   Easy to create scripts and pass on to other stakeholders. 9.   Trend for R in flying high, it’s in thing in Data Statistical category. 10. Higher average salary for R practitioners. Given that we had decided to move ahead with R as our tool for analytics, let me w...