R language programming Lecture 1 variables and assignment

This series will systematically introduce the theory and practice of R language. R language is the most popular open source language focusing on the field of application statistics and data analysis. It has the characteristics of both functional programming and object-oriented programming. The use threshold of R language is very low. If it is only used to estimate a specific model, it only needs to transfer the input and output packages, but someone must write and optimize these packages. Therefore, it is necessary to systematically learn R language programming before using R language.

This series will first introduce some fundamentals of R language programming, such as variables, vectors, vector slices, branches and loops, functions, visualization, etc., then introduce the contents of R language function programming and object-oriented programming respectively, and finally introduce meta programming and other programming skills. I hope this series can be completed in the next academic year!

Variable name of R language

  1. Variable names can contain letters, numbers, and underscores_ And point
  2. Variable names can only start with letters or dots. Letters are case sensitive
  3. Variables cannot use keywords, such as if, for, etc
  4. Adding ` ` to any symbol string can ignore all rules

Example 1 strange but correct variable name

> .c = 3
> `_fff` = 5
> `for` = 3
> for = 3
Error: unexpected '=' in "for ="
> _fff = 5
Error: unexpected input in "_"

The difference between the assignment symbol < - and =

< - and = are very common assignment symbols in R language, but they are very different:

  1. X < - Obj means to create an object Obj, and then let the variable x point to Obj;
  2. y = Obj means to create an object Obj and assign it to variable y;

Example 2 difference between < - and =

library(lobstr) # Using lobstr package, you can analyze the details of R language operation

x <- 1
y <- x
obj_addr(x)
obj_addr(y)

a = 1
b = 1
obj_addr(a)
obj_addr(b)

The first code creates an object with a value of 1, and then makes the variable x point to this object; And let variable y point to the address that variable x points to, so variable y also points to the object, function obj_addr() can return the address of an object, and the output is

obj_addr(x)
[1] "0x2e164f19ae0"
obj_addr(y)
[1] "0x2e164f19ae0"

This indicates that variable x and variable y point to the same address, that is, the address of the object with value 1.

The second code assigns 1 to variable a, and then assigns 1 to variable b. these two 1s are two different objects created respectively, so the function obj_ The output of addr() is

obj_addr(a)
[1] "0x2e1663493b8"
obj_addr(b)
[1] "0x2e166349348"

This indicates that variable a and variable b do not point to the same address.

You can try by yourself. Change y < - x in the first code to y = x, and the effect is the same as that in the second code.

More details of the assignment symbol < -

Copy on modify and modify in place

  1. Copy on modify: when multiple variables point to the same object, when modifying one of them, R language will copy the object on another memory address, then modify it on the new object, and finally make the variable point to the new memory address;
  2. Modify in place: when only one variable points to an object, modifying this variable will be directly modified on this object;

Example 3 copy on modify vs modify in place

x <- c(1,2,3,4)
y <- x
y[[2]] <- 1
x
obj_addr(x)
y
obj_addr(y)

Let's analyze this code first. First, we create a vector object with c(), and let the variable x point to this object, and then let the variable y also point to this object; Next, change the value of the second position in the object pointed to by the variable y to 1, and the magic thing happens! Let's take a look at the output

> x
[1] 1 2 3 4
> obj_addr(x)
[1] "0x2e16fcc0558"

The output of these two lines indicates that the value of the object pointed to by x has not changed;

> y
[1] 1 1 3 4
> obj_addr(y)
[1] "0x2e16fcc0648"

The output of these two lines indicates that y points to a new object, and the value of the new object has changed. The scientific name of this phenomenon is called copy on modify. It is actually caused by a protection mechanism of R language, because variables X and Y point to the same object. In order to avoid the change of variable x caused by the modification of variable y, R language essentially copies an object with the same value, modifies the value of the object, and then makes variable y point to the new object, The original objects will be collected and deleted by the Garbage Collector to free up memory. The advantage of this protection mechanism is that it can avoid modifying variables pointing to the same object at the same time. The disadvantage is that it increases the time required for operation (copy + modify + redirect). You can try by yourself. Changing y [[2]] < - 1 to y[[2]] = 1 will also lead to copy on modify.

x <- c(1,2,3,4)
cat(tracemem(x), "\n")
y <- x
y[[2]] <- 1
untracemem(x)

This code can help us understand the change of address more clearly. tracemem() is used to track the transformation of memory address, and untracemem() is used to stop tracking. The function of cat() is to output an object, "\ n" means line feed output. Let's analyze the first output first:

> x <- c(1,2,3,4)
> cat(tracemem(x), "\n")
<000002E16F72B890> 

This is the memory address of the vector object created by the first statement; Let's look at the second output,

> y <- x
y[[2]] <- 1
tracemem[0x000002e16f72b890 -> 0x000002e16f728310]: 

This output indicates that when executing y [[2]] < - 1, R language copies an object at 0x000002e16f72b890 at 0x000002e16f728310, then changes the second element of the object at 0x000002e16f728310 to 1, and finally makes y point to 0x000002e16f728310.

x <- c(1,2,3,4)
y <- x
obj_addr(y)
y[[2]] <- 1
obj_addr(y)
y[[3]] <- 1
obj_addr(y)

The following is an analysis of this code. If you modify the value of variable y again after one modification, you will find that the example of variable y after the second modification will not change again:

> x <- c(1,2,3,4)
> y <- x
> obj_addr(y)
[1] "0x2e16f724ca0"
> y[[2]] <- 1
> obj_addr(y)
[1] "0x2e16f724e30"
> y[[3]] <- 1
> obj_addr(y)
[1] "0x2e16f724e30"

The reason is very simple. Now there is only y variable pointing to the object 0x2e16f724e30. Therefore, the value of variable y is modified directly in the memory location 0x2e16f724e30. This phenomenon is called modify in place.

Copy on modify takes more time than modify in place. When optimizing code, you can try to avoid using different variables to point to the same object.

function call

When a function calls a variable, the pointer, not the value, is passed to the function.

Example 4 pointer passing during R language function call

f <- function(a){
  a
}
x <- c(1,2,3,4)
cat(tracemem(x), "\n")
z1 <- f(x)
z2 = f(x)
untracemem(x)

y = c(1,2,3,4)
cat(tracemem(y), "\n")
w <- f(y)
untracemem(y)

When the first code is executed, tracemem() does not detect the change of memory address, so function f refers to the pointer of variable x rather than the value of assigned variable x, and this property has nothing to do with the way in which the function value is assigned to another variable; Executing the second code will find the same conclusion, indicating that this property has nothing to do with the assignment method of the referenced object.

list

The list of R language is more complex than the values and vectors above, such as

L1 <- list(1,2,3,4)

This statement actually performs three steps:

  1. Create four objects: 1, 2, 3 and 4;
  2. Create a list object, and fill in the addresses of the four objects 1, 2, 3 and 4 in each position;
  3. Let the variable L1 point to the list object

In other words, the address is installed in the list, not the value! Therefore, the variable L1 points to an address, and its four elements point to four different addresses:

> obj_addr(L1)
[1] "0x143ec3855a0"
> obj_addr(L1[[1]])
[1] "0x143ed747f08"
> obj_addr(L1[[2]])
[1] "0x143ed747ed0"
> obj_addr(L1[[3]])
[1] "0x143ed747e98"
> obj_addr(L1[[4]])
[1] "0x143ed747e60"

When you modify the value of an element in the list, you essentially create a new object and then change the address of the corresponding position in the list. In fact, the vector defined by c() in R language also has similar properties. Different elements have different addresses:

> x <- c(1,2,3,4)
> obj_addr(x)
[1] "0x143ec843630"
> obj_addr(x[[1]])
[1] "0x143ed8be8d0"
> obj_addr(x[[2]])
[1] "0x143ed8be6a0"
> obj_addr(x[[3]])
[1] "0x143ed8be470"
> obj_addr(x[[4]])
[1] "0x143ed8be240"

Data frame

The data frame structure in R language is more complex than the list, but we can simply understand that it is the list of the list, and each position stores the address of the corresponding element.

> D1 <- data.frame(x=c(1,2,3,4),y=c(1,2,3,4))
> obj_addr(D1)
[1] "0x143ec2e3950"
> obj_addr(D1$x)
[1] "0x143ecc4e148"
> obj_addr(D1$y)
[1] "0x143ecc4de28"
> obj_addr(D1$x[[1]])
[1] "0x143ed911550"
> obj_addr(D1$y[[1]])
[1] "0x143ed9112b0"

When an object is created, the following actions are performed:

  1. Two groups of 1, 2, 3 and 4 objects are created;
  2. Create two lists, the corresponding positions point to one 1, 2, 3 and 4 object respectively, and then let x and y point to the two lists respectively;
  3. Create a list. The two elements of the list store the addresses pointed to by x and y respectively, and let D1 point to the list;

character

R language has a global string pool in which all English characters and simple strings are saved. Therefore, the pointer of character type variables or lists points to a character in the global string pool.

> x <- c("a", "a", "abc", "d")
> ref(x, character = TRUE)
o [1:0x143ed294f28] <chr> 
+-[2:0x143e34cd7a0] <string: "a"> 
+-[2:0x143e34cd7a0] 
+-[3:0x143e62f5830] <string: "abc"> 
\-[4:0x143e369d0a0] <string: "d"> 

The ref() function returns the address of the character in the global string pool referenced by the character variable. From the above results, the same character "a" is actually the same object in the global string pool, and different characters are stored separately in the global string pool.

storage space

Use obj size() to view the size of an object. Thanks to the fact that R language assignment is in the form of pointer, repeated reference to an object for many times will not significantly increase memory usage:

> x <- runif(1e6)
> obj_size(x)
8,000,048 B
> y <- list(x, x, x)
> obj_size(y)
8,000,128 B
> obj_size(list(NULL, NULL, NULL))
80 B

When creating y, the pointer to x is essentially repeated three times, so the increased memory consumption is the memory of an empty list. And memory computing has a Magical Law:

  1. If x is included in Y, obj_size(x, y)=obj_size(y);
  2. If x and y do not intersect, obj_size(x, y)=obj_size(x)+obj_size(y);
> obj_size(x,y)
8,000,128 B

Another interesting thing about R language storage is that when saving a sequence with a step size of 1, only the first and last items are saved:

> obj_size(1:3)
680 B
> obj_size(1:300)
680 B
> obj_size(1:30000)
680 B
> obj_size(1:3000000)
680 B

Tags: R Language

Posted by SpasePeepole on Sun, 22 May 2022 12:23:39 +0300