Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.

Simple Linear Regression using R Programming

Hello Data Experts,
Let me continue from my last blog https://outstandingoutlier.blogspot.in/2017/08/z-and-t-distribution-values-using-r.html “Z and T distribution values using R” where I had covered when to choose Z distribution and we should opt for T Distribution. Finally, I had also touched upon 3 key formulas to derive those values:
Confidence Interval             Sample mean + Margin of Error
Z Distribution               =          Sample mean + Z(1-α) value* (SD/square root of (sample size))
T Distribution               =          Sample mean + T(1-α, n-1,) value* (SD/square root of (sample size))
In this blog, I will cover Simple Linear Regression model using R.  Let us first understand what and when to go for SLR model. We should understand certain basic concepts around “Correlation Coefficient” i.e. r and “Coefficient of Determination” i.e., R^2. 
Correlation Coefficient can be defined as how Input Variable and Output Variables are correlated i.e., is there any relationship between both variables. Key characteristics of this are:
It can have a value from -1 to 1
Perfect correlation will have value -1 or 1 else within range from 0 to 1 in absolute value term.
Positive correlation will have Positive value whereas Negative will have Negative value.
This coefficient help us define 3 inferences.
Direction: Relationship will be positive if Output variable increases with the increase in input variable, whereas it will be negative if output variable decreases with the increase in input variable or there is no clear relationship.
Strength: If there is a Positive or Negative relationship, does output variable value changes with the proportion as input variable or less or more. If both variable change at same proportion it shows higher strength.  
Linearity: Is that relation straight linear or curvilinear or curve or sinusoidal etc.    
Coefficient of Determination:
Statistically how close will the calculated outcome variables/line matches the Real variables/line. It is represented as R Squared. Key characteristics of R Squared are:
It can have value from 0 to 1
Higher the value shows accurate calculations and model.
Before we start building Simple Linear Regression, let us understand when we should go for SLR model.  If we have a SINGLE CONTINOUS input variable and CONTNOUS output variable, we should go for Simple Linear Regression.  For example, how can we derive SLR model for a car’s mileage based on speed as an input for it. Both Speed and Mileage are continuous variables and Speed is the only input variable hence SLR is the right statistical model that fit in this scenario.
It’s time for us to work on SLR modeling now.  We should follow below listed 5 Steps to arrive at SLR model:
Step: 1 > We should perform Exploratory Data Analysis, where we should determine if data is normal and fit for model. For that we should completed all 4 moments of statistics:
1.      Measure of Central Tendency
2.      Measure of Dispersion.
3.      Measure of Skewness
4.      Measure of Peakness.
To confirm fitness of data, we should also run Scatter Plot Graph, Box Plot, Bar chart and Histogram chart to have visual reflection of measure of fitness. If data is not fit, transform it using Log, Square Root or Cube root of data.   
Step: 2 > Let us determine the Correlation Coefficient value to understand Direction, Strength and Linearity of data points.  Let us assume r = -.75, it can be interpreted as moderate to strong Negative relationship where output variable’s value will tend to decrease with increase in input variables value. If absolute |r| value is above .75, it reflects strong relationship.
Using R, it can be derived by executing cor (x, y)
Step: 3 > Now that we have determined Normality and Linearity coefficient value, let us try to build Linear Regress model.
Using R execute lm (y ~ x) command with y and x parameters.  It will give you below sample outcome
Coefficients:
(Intercept)            x 
     β0                      β1 
Once we get both Intercept (β0) and Slope (β1) values, we have the SLR model ready. Simple Linear Regression will be Y = β0 + β1(x)
Step: 4 > Let is validate the model. We can leave it here assuming predictive model that we have created is Perfect and we can start leveraging it for predictions however we should first validate the model before start predicting outcomes using it.  To validate we should execute Hypothesis Testing model by getting the probability value of taking an action.
Execute summary (Linear Regression command from Step 3) Output will have various section
Residuals
Coefficients
Residual standard error
Multiple R-squared and Adjusted R-squared:   
F-statistic and p-value
Look out for
Estimate                Std. Error               t value                    Pr(>|t|) value.
If Pr(>|t|) i.e., Probability value is < .05 that means, there is less than 5 % chance of this model going wrong. If Probability value for both β0 and β1 is verified to be less than .05 then Y = β0 + β1(x) sound good.
Similarly, if Multiple R-squared is greater than .75 model sounds good for predicting the outcome variable based on input, assuming residual value will be minimal. 
Step: 5 > Once SLR model is defined, as Data Architect it is important to define the expected confidence level, hence range Lower Confidence Level and Upper Confidence Level will be defined using SLR. With this we will have 3 values for any Predictor or Input variable.
Execute confint(Linear Regression command from Step 3, level = Confidence Level)
If we are looking for a confidence level of 95%, then Confidence level will be replaced by .95.
Output will be
               2.5%   97.5 %
(Intercept)    β0           β0                         
x              β1           β1
Once we get two value for intercept and slope, we will form 2 more equation as formed in step 3. These will be LCL and UCL equations. LCL and UCL will vary based on what confidence level we would like to obtain. LCL and UCL range will broaden as confidence level increases.
Thank you for sparing time and going through this blog, I hope this blog helped you understand how to build a Simple Linear Regression Model with confidence using R. Kindly share your valuable feedback and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs around R Programming. 
Thank you...
Outstanding Outliers:: "AG".
 

Comments

Popular posts from this blog

Z and T distribution values using R

Hello Data Experts, Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics. First step covers Mean, Median and Mode, it is a measure of central tendency. Second step covers Variance Standard Deviation, Range, it is a measure of dispersion. Third step covers Skewness, it is a measure of asymmetry. Fourth step covers Kurtosis, it is a measure of peakness. To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can e

Practical usage of RStudio features

Hello Data Experts, Let me continue from my last blog Step by Step guide to install R :: “Step by Step guide to install R” where I had shared steps to install R framework and R Studio on windows platform. Now that we are ready with Installation and R Studio, I will take you through R Studio basics. R Studio has primarily 4 sections with multiple sub tabs in each window: Top Left Window: Script editor: It is for writing, Saving and opening R Scripts. Commands part of Script can also be executed from this window. Data viewer: Data uploaded can be viewed in this window.   Bottom Left Window: Console: R Commands run in this window.   Top Right Window: Workspace: workspace allow one to view objects and values assigned to them in global environment. Historical commands: There is an option to search historical commands from beginning till last session. Beauty of this editor is that historical commands are searchable. Once historical commands are searched they can be

Code Branch and Merge strategies

Learn Git in a Month of Lunches Hello Everyone, IT industry is going through a disruptive evolution where being AGILE and adopting DevOps is the key catalytic agent for accelerating the floor for success. As explained in my earlier blog, they complement each other rather than competing against one another. If Leaders will at the crossroad where in case they need to pick one what should be their pick. There is no right or wrong approaching, it depends on the scenario and dynamics for the program or project. I would personally pick #DevOps over Agile as its supremacy lies in ACCELERATING delivery with RELIABILITY and CONSISTENCY . This path will enable and empower development teams to be more productive and prone to less rework. Does this mean adopting DevOps with any standard will help reap benefits? In this blog, I will focus on importance of one of the standard and best practice around Code branching and merging strategy to get the desired outcome by adopting DevOps. To