Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it....

Dataset using R



Hello Data Experts,

Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/basic-r-programing.html  “Basic R Programming” where we discussed Assignment and data Operators.  

Let us move forward and understand how to load Datasets. First let me define .csv with data in it and we will use it for this session. Copy the below data and paste it in a notepad, save it as “Plasma.csv” file.

**************
Number of times Pregnant, Plasma glucose concentration, Diastolic blood pressure, Triceps Skin fold thickness
6,148,72,35
1,85,66,29
8,183,64,0
1,89,66,23
0,137,40,35
5,116,74,0
3,78,50,32
10,115,0,0
2,197,70,45
8,125,96,0
4,110,92,0
10,168,74,0
10,139,80,0
1,189,60,23
**************
As a best practice one should set the working directory using setwd() command. First let us find out the working directory which h is by default set, by executing below command.
getwd()

If the result of the above command does not match your source directory location, one can set working directory path by executing below commands
setwd("C:/ABC/XYZ")
setwd("C:\\ABC\\XYZ")

Kindly note by default windows file path has "\" whereas for R programming it is "/". Another way to perform the same operation is by using “\\” as path definition.

To load the data from CSV file, we should “read.csv()” command. Let us load plasma.csv data into an object. Create an object name “Plasma” and assign read.csv with path to it.
Plasma <- read.csv(“Plasma.csv”)

In Global environment window, we can notice a new row added with below details
Observation are same as # of Rows
Variables are same as # of columns

Now to understand what is the data that got loaded in the Plasma object, we can execute View command
View(Plasma)

Another quick way to achieve the same is by clicking on the row, data will be loaded in data viewer window on another tab. Once data got loaded in the object, let us get the list of columns by executing one of the below commands:
names(Plasma)
colnames(Plasma)

Output will be as below
"Number.of.times.Pregnant"  "Plasma.glucose.concentration" "Diastolic.blood.pressure"   
"Triceps.Skin.fold.thickness"

Similarly, to understand the name of rows execute below command
rownames(Plasma)

Output will be as below:
"1"  "2"  "3"  "4"  "5"  "6"  "7"  "8” “9"  "10" "11" "12" "13" "14"

Let us now understand the structure of loaded data by executing below command. It will help us understand number of observations and variables in the dataset, along with the datatype for each variable.
str(Plasma)

Output will be as shown below:
'data.frame':         14 obs. of  4 variables:
$ Number.of.times.Pregnant    : int  6 1 8 1 0 5 3 10 2 8 ...
$ Plasma.glucose.concentration: int  148 85 183 89 137 116 78 115 197 125 ...
$ Diastolic.blood.pressure    : int  72 66 64 66 40 74 50 0 70 96 ...
$ Triceps.Skin.fold.thickness : int  35 29 0 23 35 0 32 0 45 0 ...

Once the dataset gets loaded, how can we retrieve the values from it. 

To get data for second row with header execute below command
Plasma[2,]

To get data for second column execute below command
Plasma[,2]

To get data we can also execute below command since it was not explicitly state row or a column, it will pick up column by default and all rows
Plasma[2]

To retrieve a specific value from dataset and if we are aware of coordinates, specify Row number and the column number.
Plasma[2,3]

To retrieve data for first 2 rows but value only from third column execute as below
Plasma[1:2,3]

We can also club Data operators with retrieved data as listed below
sum(Plasma[1:2,3])

It is not always easy to retrieve data based on coordinate, in that case we should be able to get value based on column and row name. this is how we can reference data in the dataset to retrieve all value for a column
Plasma$Number.of.times.Pregnant

As we will do some advance statistical programming we will feel the need to adding a new calculated column. This is how we can add a new calculated column
Plasma$NewCol <- 1 + Plasma$Number.of.times.Pregnant

We can always retrieve subset of the dataset by specifying rows and column as below, Outcome will be all rows with only first 4 columns
Plasma[,1:4]

To retrieve the number of rows in a dataset we can execute following commands. length(Plasma$Number.of.times.Pregnant)

Filtered dataset can be created based on certain criteria as shown below
Plasma[Plasma$Number.of.times.Pregnant == 10 & Plasma$Diastolic.blood.pressure > 0,]

I hope this blog helped you understand how to load and execute commands to work on dataset. Now we understand various Data Operations and Dataset related commands, we should be able to move forward with some advance aspects of R programming. In my next blog, I will cover “Basic statistical programming using R Studio”.

Thank you for continuing with me reading through this blog I hope it was insightful. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs. 

Thank you...

Outstanding Outliers:: "AG".  


Comments

Popular posts from this blog

Do we really need Data Scientist?

Hello Data Inquisitors, Today while having my discussion with Database expert, there was a healthy discussion between us around "Do we really need Data Scientist?". "DATA SPEAKS WHAT AND HOW ONE WANT TO SEE" - AG Discussion started by one of my dear friend who is the DB expert, he is the database administrator and is serving the industries consuming Data Mining and Data Warehouse techniques. He was very clear when he called out that Data Analytics is like an old wine in the new bottle. It just a new Job title has been created to continuous with thunder in new disruptive world. I appreciated his thought and the sense of attachment to "Data Cloud". Discussion went on for an hour before he embraced the need of Data Scientists.  Data Scientist to me is an Architect who has the skills to project collection of data points i.e., " Data Ocean" to a decision-making Data Visualization asset by using complex stati...

DevOps Models

Hello Everyone, IT industry is going through a Disruptive Evolution, where Artificial Intelligence and Intelligent Automation is helping organization go Lean and Agile. Leaders are at the crossroad where they need to pick the path which will empower their business teams to be more productive and focus on core. In this blog, I tried to invoke a thought process for leaders how they can step up their game by taking baby steps but still following fast lane to reach destination on time. Thought leaders must have been tracking the industry pulse how IT is changing fast pace by adopting Artificial Intelligent Driven Innovative frameworks. To drive Delivery in much more efficient and eloquent way, everyone must adopt new optimized Development and Operations practices to sustain in the current competitive ecosystem (Service or Captive world), by keeping cost to minimal.     IT gurus are smartly redefining their vision and practices towards Lean methodologies.  ...

“OUTCOME” or “OUTPUT” driven Agile

Hello All,     Nowadays IT industry is bombarded with articles on Agile with loud and clear message #BeLean. Everyone around teaches AGILE as in #GOAGILE, #BEAGILE, #AGILITYLEADS and many more hashtags around #ONLYAGILE. Lean Engineering gurus have been coaching corporates to go #AGILE and be #LEAN. Literal English meaning of being Agile is to be nimble, to be able to adapt to the changing needs of company to achieve goals as to what is desired by business. But why do we need Agility, is it to be able to achieve outcome i.e., #BusinessesNeed with speed i.e., #Velocity? I am perplexed with what I keep hearing around Agile practices and I firmly believe we should try to understand the rational for being Agile by choosing right “O”, either go #Outcome or #Output. What will you prefer without reading this blog, Output or Outcome?   Let me take you two decades back when there was a need for transformation. Transformation from big-bang i.e., #waterfall to iterat...