Dataset Concept
Different industries call the rows and columns of a dataset differently. Statisticians call them observation s and variable s, database analysts call them record s and field s, and researchers in data mining and machine learning disciplines call them example s ) and attribute s.
data structure
R has many object types for storing data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in the type of data they store, how they are created, the complexity of their structure, and the markup used to locate and access individual elements within them.
In R, an object is anything that can be assigned to a variable, including constants, data structures, functions, and even graphs.
Start with vectors and explore each data structure one by one.
vector
A vector is a one-dimensional array used to store numeric, character, or logical data. The function c(), which performs the combinatorial function, can be used to create a vector. The various types of vectors are shown in the following example:
a <- c(1, 2, 5, 3, 6, -2, 4) b <- c("one", "two", "three") c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) # View the type of the vector. print(class(a))
Here, a is a numeric vector, b is a character vector, and c is a logical vector. Note that the data in a single vector must have the same type or schema (numeric, character, or logical). Data of different patterns cannot be mixed in the same vector.
Elements in a vector are accessed by giving the numerical value of the element's position in square brackets. For example, a[c(2, 4)] is used to access the second and fourth elements in the vector a. More examples are as follows:
> a <- c("k", "j", "h", "a", "c", "m") > a[3] [1] "h" > a[c(1, 3, 5)] [1] "k" "h" "c" > a[2:6] [1] "j" "h" "a" "c" "m"
The colon used in the last statement is used to generate a sequence of numbers. For example, a <- c(2:6) is equivalent to a <- c(2,3,4,5,6).
scalar
There are no scalars in R, "scalar" can be understood as a vector with only one element, eg f <- 3, g <- "US" and h <- TRUE. They are used to hold constants.
matrix
A matrix is a two-dimensional array, except that each element has the same pattern (numeric, character, or logical). A matrix can be created with the function matrix(). The general format used is:
myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames))
where vector contains the elements of the matrix, nrow and ncol are used to specify the dimensions of rows and columns, dimnames contains optional
, the row and column names as character vectors. The option byrow indicates that the matrix should be filled by rows (byrow=TRUE)
Or fill by column (byrow=FALSE), which is filled by column by default.
Created a 5x4 matrix, then created a 2x2 matrix with column name labels, filled by rows, and finally created a 2x2 matrix and filled by columns.
> y <- matrix(1:20, nrow=5, ncol=4) > y [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 > cells <- c(1,26,24,68) > rnames <- c("R1", "R2") > cnames <- c("C1", "C2") > mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames)) > mymatrix C1 C2 R1 1 26 R2 24 68 > mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames)) > mymatrix C1 C2 R1 1 24 R2 26 68
You can use subscripts and square brackets to select rows, columns, or elements in a matrix. X[i,] refers to row i in matrix X, X[,j] refers to column j, and X[i, j] refers to Row i and element j. When selecting multiple rows or columns, subscripts i and j can be numerical vectors Use of matrix subscripts:
Created a 2x5 matrix of numbers 1 to 10. By default, matrices are filled by columns. Then, the elements of the second row and second column are selected, respectively. Next, the element in the first row and the fourth column is selected. Finally, the elements located in the fourth and fifth columns of the first row are selected. Matrices are two-dimensional, and similar to vectors, they can only contain one data type.
> x <- matrix(1:10, nrow=2) > x [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 > x[2,] [1] 2 4 6 8 10 > x[,2] [1] 3 4 > x[1,4] [1] 7 > x[1, c(4,5)] [1] 7 9
array
Arrays are similar to matrices, but can have dimensions greater than 2. Arrays can be created with the array function in the following form:
myarray <- array(vector, dimensions, dimnames)
Where vector contains the data in the array, dimensions is a numeric vector giving the maximum value of each dimension subscript, and dimnames is an optional list of dimension name labels. Create a three-dimensional (2-by-3-by-4) numeric array:
> dim1 <- c("A1", "A2") > dim2 <- c("B1", "B2", "B3") > dim3 <- c("C1", "C2", "C3", "C4") > z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3)) > z , , C1 B1 B2 B3 A1 1 3 5 A2 2 4 6 , , C2 B1 B2 B3 A1 7 9 11 A2 8 10 12 , , C3 B1 B2 B3 A1 13 15 17 A2 14 16 18 , , C4 B1 B2 B3 A1 19 21 23 A2 20 22 24
Arrays are a natural extension of matrices. They can be useful when writing new statistical methods. Like matrices, data in an array can only have one pattern. Elements are selected from an array in the same way as a matrix. In the above example, the element z[1,2,3] is 15.
data frame
Since different columns can contain data in different modes (numeric, character, etc.), the concept of a data frame is more general than that of a matrix. It is similar to datasets seen in SAS, SPSS and Stata. A data frame will be the most commonly handled data structure in R. A data frame can be created with the function data.frame():
mydata <- data.frame(col1, col2, col3,...)
The column vectors col1, col2, col3, etc. can be of any type (such as character, numeric, or logical). the name of each column
The name can be specified by the function names. Create a data frame:
> patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor
The schema must be unique for each column of data, but different columns from multiple schemas can be grouped together to form a data frame. The terms column and variable will be used interchangeably when discussing a data frame because it is closer to the shape of a data set that analysts typically envision.
There are several ways to select elements in a data frame. You can use the aforementioned (as in a matrix) undermark number, or you can specify column names directly.
Select elements in a data frame
> patientdata[1:2] patientID age 1 1 25 2 2 34 3 3 28 4 4 52 > patientdata[c("diabetes", "status")] diabetes status 1 Type1 Poor 2 Type2 Improved 3 Type1 Excellent 4 Type1 Poor > patientdata$age [1] 25 34 28 52
The token $ in the example is new. It is used to select a specific variable in a given data frame. For example, if you want to generate a contingency table for the diabetes type variable diabetes and the condition variable status, use the following code:
> table(patientdata$diabetes, patientdata$status) Excellent Improved Poor Type1 1 0 2 Type2 0 1 0
Typing patientdata$ once before every variable name can get tiresome, so take some shortcuts. can be linked
Use the functions attach() and detach() together or use the function with() alone to simplify the code.
The attach(), detach() and with() functions attach() add a data frame to R's search path. R will check the data frame in the search path after encountering a variable name.
factor
Variables can be categorized as nominal, ordinal, or continuous.
Nominal variables are categorical variables without order. Diabetes type Diabetes (Type1, Type2) is an example of a nominal variable. Even if Type1 is encoded as 1 and Type2 is encoded as 2 in the data, it does not mean that the two are in order.
Ordinal variables represent an order relationship, not a quantitative relationship. The condition Status (poor, improved, excellent) is a good example of an ordinal variable. The condition of poor (poor) patients is not as good as improved (improved) patients, but it is not known by how much.
Continuous variables can take on any value within a range and represent both order and quantity. Age is a continuous variable that can represent values like 14.5 or 22.8 and any other value in between. Clearly, a 15-year-old is a year older than a 14-year-old.
Categorical (nominal) variables and ordinal categorical (ordinal) variables are called factor s in R. Factors are very important in R because they determine how data is analyzed and how it is presented visually. The function factor() stores category values in the form of a vector of integers in the range [1…k] (where k is the number of unique values in the nominal variable), and a string (original value) consisting of will map to these integers.
For example, suppose there are vectors:
diabetes <- c("Type1", "Type2", "Type1", "Type1")
The statement diameters <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it internally as 1=Type1 and 2=Type2 (specific assignments are based on alphabetical order). Any analysis performed on the vector diameters treats it as a nominal variable and automatically selects a statistical method suitable for this measurement scale.
To represent an ordered variable, you need to specify the parameter ordered=TRUE for the function factor(). Given a vector:
status <- c("Poor", "Improved", "Excellent", "Poor")
The statement status <- factor(status, ordered=TRUE) encodes the vector as (3, 2, 1, 3) and internally associates the values as 1=Excellent, 2=Improved, and 3=Poor. In addition, any analysis performed on this vector will treat it as an ordinal variable and automatically select the appropriate statistical method.
For character vectors, factor levels are created alphabetically by default. This makes sense for the factor status because "Excellent" "Improved" "Poor" are sorted in exactly the same way as the logical order. If "Poor" were encoded as "Ailing", there would be a problem because the order would be "Ailing" "Excellent" "Improved". A similar problem arises if the ideal order is "Poor" "Improved" "Excellent". The default alphabetical ordering of factors is rarely satisfactory. The default sorting can be overridden by specifying the levels option. E.g:
status <- factor(status, order=TRUE,levels=c("Poor", "Improved", "Excellent"))
The assignments for each level will be 1=Poor, 2=Improved, 3=Excellent. Make sure that the specified levels match the true values in the data, as any data that is present in the data but not listed in the parameter will be considered missing.
Numeric variables can be encoded as factors using the levels and labels parameters. If males are coded as 1 and females as 2, the following statement:
sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))
Convert the variable to an unordered factor. Note that the order of the labels must be consistent with the level. In this example, gender will be treated as a categorical variable, the labels "Male" and "Female" will be output in the results instead of 1 and 2, and all gender variables that are not 1 or 2 will be set as missing values. Use of factors:
> patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > patientdata <- data.frame(patientID, age, diabetes, status) > str(patientdata) 'data.frame': 4 obs. of 4 variables: $ patientID: num 1 2 3 4 $ age : num 25 34 28 52 $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1 $ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 > summary(patientdata) patientID age diabetes status Min. :1.00 Min. :25.00 Type1:3 Excellent:1 1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1 Median :2.50 Median :31.00 Poor :2 Mean :2.50 Mean :34.75 3rd Qu.:3.25 3rd Qu.:38.50 Max. :4.00 Max. :52.00
First, enter the data in the form of a vector. Then, specify diameters and status as an ordinary factor and an ordinal factor, respectively. Finally, combine the data into a data frame. The function str(object) provides information about an object (in this case a data frame) in R. It clearly shows that diameters is a factor and status is an ordinal factor, and how this data frame is coded internally. Note that the function summary() treats each variable differently. It shows the minimum, maximum, mean, and quartiles of the continuous variable age, and the frequency values of the categorical variables diameters and status (each level).
list
Lists are one of the most complex data types in R. In general, a list is an ordered collection of objects (or component s). Lists allow combining several (possibly unrelated) objects under a single object name. For example, a list might be a combination of several vectors, matrices, data frames, or even other lists. A list can be created using the function list():
mylist <- list(object1, object2, ...)
The objects in it can be any of the structures discussed so far. You can also name objects in the list:
mylist <- list(name1=object1, name2=object2, ...)
> g <- "My First List" > h <- c(25, 26, 18, 39) > j <- matrix(1:10, nrow=5) > k <- c("one", "two", "three") > mylist <- list(title=g, ages=h, j, k) > mylist $title [1] "My First List" $ages [1] 25 26 18 39 [[3]] [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10 [[4]] [1] "one" "two" "three" > mylist[[2]] [1] 25 26 18 39 > mylist[["ages"]] [[1] 25 26 18 39
This example creates a list with four components: a string, a numeric vector, a matrix, and a character vector. You can combine as many objects as you want and save them as a list. Elements in a list can also be accessed by specifying a number or name that represents an ingredient in double square brackets. In this example, both mylist[[2]] and mylist[["ages"]] refer to that four-element vector. For named ingredients, mylist$ages also works fine. Lists are an important data structure in R for two reasons. First, lists allow disparate information to be organized and recalled in an easy way. Second, the results of many R functions are returned as lists. Which of these components needs to be removed is up to the analyst.
Some aspects of the R language are unusual. Some features to be aware of in this language
❑ The period (.) in an object name has no special meaning, but the dollar sign ($) has a meaning similar to the period in other languages, that is, specifying a data frame or part of a list. For example, A$x refers to the variable x in data frame A.
❑R does not provide multi-line or block comments. You must start each line of a multi-line comment with #. For debugging purposes, you can also put code that you want the interpreter to ignore in the statement if(FALSE){…}. Changing FALSE to TRUE allows this block of code to execute.
❑ When you assign a value to a nonexistent element in a vector, matrix, array, or list, R will automatically expand the data structure to accommodate the new value. For example, consider the following code:
> x <- c(8, 6, 4) > x[7] <- 10 > x [1] 8 6 4 NA NA NA 10
By assignment, the vector x is extended from three elements to seven elements. x <- x[1:3] reduces it back down to three elements.
❑ There are no scalars in R. A scalar comes as a one-element vector.
❑ Subscripts in R do not start from 0, but from 1. In the above vector, the value of x[1] is 8.
❑ Variables cannot be declared. They are generated when they are first assigned a value.
Common functions for working with data
length(object) # Displays the number of elements/components in the object dim(object) # Display the dimensions of an object str(object) # show the structure of an object class(object) # Displays the class or type of an object mode(object) # Display the mode of an object names(object) # Displays the names of ingredients in an object c(object, object,...) # Combine objects into a vector cbind(object, object, ...) # Merge objects by column rbind(object, object, ...) # Merge objects by row Object # output an object head(object) # List the beginning of an object tail(object) # List the last part of an object ls() # Display the current object list rm(object, object, ...) # Delete one or more objects. The statement rm(list = ls()) will delete almost all objects in the current working environment, hidden objects starting with a period. will not be affected newobject <- edit(object) # Edit object and save as newobject fix(object) # Edit objects directly