Z and T distribution values using R

Hello Data Experts,

Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics.

First step covers Mean, Median and Mode, it is a measure of central tendency.
Second step covers Variance Standard Deviation, Range, it is a measure of dispersion.
Third step covers Skewness, it is a measure of asymmetry.
Fourth step covers Kurtosis, it is a measure of peakness.

To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can execute “qqnorm” and “qqline” commands using R.

We had learned thus far that probability of any value is always Zero but can get probability less than or greater using standard normal distribution leveraging pnorm value. Generally, in the industry we have come across 95% as the starting benchmark value for confidence that expected outcome will be within this range. This definition of confidence in statistical terms called as confidence level. In simple statistical definition, it means for 95% of the samples population will follow the same mean.

We will touch upon Z Distribution and T Distribution techniques. There is always an open query when to use which technique. As a matter of experience and usage, I follow below guiding principle for myself to proceed, If the size of a sample is < 30 (sample less than 30 is categorized as small in statistical world) and the Standard deviation for population is unknown, T distribution can should be the first choice whereas if the sample size is large i.e., >30 as well as SD for population is known Z distribution should be the technique. As sample size increase they trend closer output.

Confidence Interval = Sample mean + Margin of Error

Z Distribution = Sample mean + Z_(1-α) value* (SD/square root of (sample size))
T Distribution = Sample mean + T_{(1-α, n-1,)} value* (SD/square root of (sample size))

Let us consider a e-retailer who has 10500 register customers whom e-retailer wants to launch a new offer but before doing so she would like to get the confidence level of success. Before going for a launch, they chose 200 customers and granted then an access to new promotion where on an average 5 new products were purchased during this selecting launch with a standard deviation of 6. E-retailer typically launch new promotion every month hence they have a sd from last launch to the full population which is 5.5. Before new full launch she wanted the 95% confidence level to go full scale.

Here we have a sample size > 30 (Big sample size) and population SD is also known this Z-Distribution is the appropriate option here.

We can take manual route and using Z table come up with the Z score for 95 % confidence level and then then calculate confidence intervals but using R it is simple to get Z score by executing “qnorm” command.

# for 95% confidence, a value will be (for easy remembrance follow 95+(100-95)/2 = 97.5%).

qnorm (.975)
result will be 1.959964

Once we get Z value (1.959964), sample mean as 5, Population SD (5.5) and sample size (200), applying a formula will get confidence level.

5 + (1.971957*(6/Square root (200))) to 5 + (1.971957*(6/Square root (200)))

Pilot launch helped e-retailer that there will be 95% confidence that average sale will fall in the range from 4.24 to 5.76

Let us assume there was no earlier pilot launch and hence it’s for the first-time e-retailer is trying to launch promotion. In this case, only change will be instead of using population SD, it is recommended to use sample SD with Degree of freedom. Degree of freedom can be considered as n-1 because if we have n-1 value, last value will be confirm/fix.

We can take manual route and using T table come up with the T score for 95% confidence and 199 degree of freedom and then calculate confidence intervals but using R it is simple to get T score by executing “qt” command.

# for 95% confidence, a value will be (for easy remembrance follow 95+(100-95)/2 = 97.5%) whereas degree of freedom will be 199 as sample size minus 1

qt(.975, 199)
result will be 1.971957

Once we get T value (1.971957), sample mean as 6, Sample SD (6) and sample size (200), applying a formula will get confidence level.

5 + (1.971957*(6/Square root (200))) to 5 + (1.971957*(6/Square root (200)))

Pilot launch helped e-retailer that there will be a 95% confidence that average sale will fall in the range from 4.16 to 5.84

If we know the benchmark confidence level, we can proceed with range but if we would like to understand the confidence level for a LCL or UCL we can use

pt(1.971957, 199)
result will be .975, i.e., 95% confidence.

I hope this topic was helpful in understating Z and T distribution concepts and how to derive Z Score and T score using R. Sample size and standard deviation for the population plays key role in deciding which technique to opt for.

Thank you for going through this blog, I hope it helped you built sound foundation of Z and T Distribution using R. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs.

Thank you...

Outstanding Outliers:: "AG".

Outstanding Outlier

Search This Blog

Is today's world all about creativity and ideation?

Z and T distribution values using R

Labels

Comments

Post a Comment

Popular posts from this blog

Code Branch and Merge strategies

“OUTCOME” or “OUTPUT” driven Agile