Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.

Z and T distribution values using R

Hello Data Experts,

Let me continue from my last blog “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics.

  • First step covers Mean, Median and Mode, it is a measure of central tendency.
  • Second step covers Variance Standard Deviation, Range, it is a measure of dispersion.
  • Third step covers Skewness, it is a measure of asymmetry.
  • Fourth step covers Kurtosis, it is a measure of peakness.
To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can execute “qqnorm” and “qqline” commands using R.

We had learned thus far that probability of any value is always Zero but can get probability less than or greater using standard normal distribution leveraging pnorm value. Generally, in the industry we have come across 95% as the starting benchmark value for confidence that expected outcome will be within this range. This definition of confidence in statistical terms called as confidence level. In simple statistical definition, it means for 95% of the samples population will follow the same mean.

We will touch upon Z Distribution and T Distribution techniques.  There is always an open query when to use which technique. As a matter of experience and usage, I follow below guiding principle for myself to proceed, If the size of a sample is < 30 (sample less than 30 is categorized as small in statistical world) and the Standard deviation for population is unknown, T distribution can should be the first choice whereas if the sample size is large i.e., >30 as well as SD for population is known Z distribution should be the technique. As sample size increase they trend closer output.

Confidence Interval    =          Sample mean + Margin of Error
Z Distribution               =          Sample mean + Z(1-α) value* (SD/square root of (sample size))
T Distribution               =          Sample mean + T(1-α, n-1,) value* (SD/square root of (sample size))

Let us consider a e-retailer who has 10500 register customers whom e-retailer wants to launch a new offer but before doing so she would like to get the confidence level of success. Before going for a launch, they chose 200 customers and granted then an access to new promotion where on an average 5 new products were purchased during this selecting launch with a standard deviation of 6. E-retailer typically launch new promotion every month hence they have a sd from last launch to the full population which is 5.5. Before new full launch she wanted the 95% confidence level to go full scale.

Here we have a sample size > 30 (Big sample size) and population SD is also known this Z-Distribution is the appropriate option here.

We can take manual route and using Z table come up with the Z score for 95 % confidence level and then then calculate confidence intervals but using R it is simple to get Z score by executing “qnorm” command.

# for 95% confidence, a value will be (for easy remembrance follow 95+(100-95)/2 = 97.5%).
qnorm (.975)  
result will be 1.959964

Once we get Z value (1.959964), sample mean as 5, Population SD (5.5) and sample size (200), applying a formula will get confidence level.  

5 + (1.971957*(6/Square root (200))) to 5 + (1.971957*(6/Square root (200)))

Pilot launch helped e-retailer that there will be 95% confidence that average sale will fall in the range from 4.24 to 5.76

Let us assume there was no earlier pilot launch and hence it’s for the first-time e-retailer is trying to launch promotion. In this case, only change will be instead of using population SD, it is recommended to use sample SD with Degree of freedom.  Degree of freedom can be considered as n-1 because if we have n-1 value, last value will be confirm/fix.

We can take manual route and using T table come up with the T score for 95% confidence and 199 degree of freedom and then calculate confidence intervals but using R it is simple to get T score by executing “qt” command.

# for 95% confidence, a value will be (for easy remembrance follow 95+(100-95)/2 = 97.5%) whereas degree of freedom will be 199 as sample size minus 1
qt(.975, 199)  
result will be 1.971957

Once we get T value (1.971957), sample mean as 6, Sample SD (6) and sample size (200), applying a formula will get confidence level.  

5 + (1.971957*(6/Square root (200))) to 5 + (1.971957*(6/Square root (200)))

Pilot launch helped e-retailer that there will be a 95% confidence that average sale will fall in the range from 4.16 to 5.84

If we know the benchmark confidence level, we can proceed with range but if we would like to understand the confidence level for a LCL or UCL we can use
pt(1.971957, 199)
result will be .975, i.e., 95% confidence.

I hope this topic was helpful in understating Z and T distribution concepts and how to derive Z Score and T score using R. Sample size and standard deviation for the population plays key role in deciding which technique to opt for.

Thank you for going through this blog, I hope it helped you built sound foundation of Z and T Distribution using R. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs. 

Thank you...
Outstanding Outliers:: "AG".  




Popular posts from this blog

Code Branch and Merge strategies

Learn Git in a Month of Lunches Hello Everyone, IT industry is going through a disruptive evolution where being AGILE and adopting DevOps is the key catalytic agent for accelerating the floor for success. As explained in my earlier blog, they complement each other rather than competing against one another. If Leaders will at the crossroad where in case they need to pick one what should be their pick. There is no right or wrong approaching, it depends on the scenario and dynamics for the program or project. I would personally pick #DevOps over Agile as its supremacy lies in ACCELERATING delivery with RELIABILITY and CONSISTENCY . This path will enable and empower development teams to be more productive and prone to less rework. Does this mean adopting DevOps with any standard will help reap benefits? In this blog, I will focus on importance of one of the standard and best practice around Code branching and merging strategy to get the desired outcome by adopting DevOps. To

“OUTCOME” or “OUTPUT” driven Agile

Hello All,     Nowadays IT industry is bombarded with articles on Agile with loud and clear message #BeLean. Everyone around teaches AGILE as in #GOAGILE, #BEAGILE, #AGILITYLEADS and many more hashtags around #ONLYAGILE. Lean Engineering gurus have been coaching corporates to go #AGILE and be #LEAN. Literal English meaning of being Agile is to be nimble, to be able to adapt to the changing needs of company to achieve goals as to what is desired by business. But why do we need Agility, is it to be able to achieve outcome i.e., #BusinessesNeed with speed i.e., #Velocity? I am perplexed with what I keep hearing around Agile practices and I firmly believe we should try to understand the rational for being Agile by choosing right “O”, either go #Outcome or #Output. What will you prefer without reading this blog, Output or Outcome?   Let me take you two decades back when there was a need for transformation. Transformation from big-bang i.e., #waterfall to iterative i.e., #lea

Step by Step guide to Install R?

Hello Data Scientists, Let me continue from my last blog :: “Why Data Scientists prefer R?” where I shared my personnel choice of embracing R as the preferable tool for analytics. I had articulated top 10 reason for me to go for it as listed below (again): 1.   Open source software. 2.   Easy to install across platforms. 3.   Standalone computing and individual servers. 4.   Extensive library of statistical packages. 5.   Extra ordinary Data Visualization. 6.   RStudio is big plus, easy to use IDE. 7.   Easy to integrate with other packages like Excel, SAS. 8.   Easy to create scripts and pass on to other stakeholders. 9.   Trend for R in flying high, it’s in thing in Data Statistical category. 10. Higher average salary for R practitioners. Given that we had decided to move ahead with R as our tool for analytics, let me walk you through step by step guide how to in