The place where the dream begins - C language preprocessing + compilation process

Compilation (preprocessing) of C language programs

The C language standard specifies that in any implementation of C, there are two distinct environments.

The first is the translation environment, which is converted into executable machine instructions, and the second is the execution environment, which is used to actually execute the code.

1. Compile and link

Suppose there is a source code of test.c, it needs to be compiled --> link --> executable program

If there are multiple .c source code files, each of them will be compiled separately and then passed through the linker to finally become an executable program

  • Each source file that makes up a program is converted into object code by compilation
  • Each object file is bundled together by the linker to form a single and complete executable program
  • The linker will also import any functions in the C library function used by the program. The linker can also search the programmer's personal library and link the functions it needs into the program.

1) Several stages of compilation

Note: At this time, my environment is the gcc compiler of Centos7.6

code:

test.c

#include <stdio.h>
#define MAX 666666
//Declare external functions
extern int add(int x, int y);
int main()
{
	int a = 10;
	int b = 20;
    int tmp = MAX;
	printf("%d\n", a + b);

	return 0;
}

add.c

int add(int x, int y)
{
	return x + y;
}

The compilation in the translation environment can be divided into three stages

  • Precompiled
  • compile
  • compilation

precompilation stage

During precompilation the compiler will do several things

  1. header file inclusion
  2. Deletion of comments
  3. #define definition symbol replacement
  4. preprocessing directive
  5. ...

We use gcc -E test.c > test.i to precompile the test.c file on Linux, and it will stop immediately after precompilation, and the precompiled solution result is saved in the test.i file for easy viewing

We will find that after precompilation. Our written comments are gone, the written header files are gone, and there are a bunch of function declarations (this is just a partial screenshot).

what we found

  1. #include <stdio.h> header file is missing
  2. Comments written were deleted
  3. MAX defined by #define is also replaced

We can verify the inclusion of the following header files. The stdio.h file is saved in /usr/include/stdio.h in my Linux system. After checking, we find that the function information in it is indeed what we saw above. So the preprocessing section will include the contents of the header file into the source file.

Compile phase

In the compilation stage, the C code will be translated into assembly code, and a few things will be done

  1. Gramma analysis
  2. lexical analysis
  3. Semantic Analysis
  4. Symbol summary

Syntax analysis is simply to check whether the code has grammatical errors

Lexical analysis: split the C language code one by one, build a syntax tree and the like

Semantic analysis: Simply put, it is how to convert the code of C language into the corresponding assembly code, and the semantics of C language

Compile the test.c file through gcc -S test.c, and it will stop and save the result to the test.s file after compiling. What is saved in test.s is the assembly code.

Symbol summary is a very important process in the compilation phase.

Symbol summary is to extract the important symbols in the file

Let's simply modify the test.c file

#include <stdio.h>
#define MAX 666666
//Declare external functions
extern int add(int x, int y);
int count = 0;
void print()
{
    
}
int main()
{
	int a = 10;
	int b = 20;
        int tmp = MAX;
	printf("%d\n", a + b);

	return 0;
}

Under the Linux file, generate an object file of test.o through the command gcc -c test.c. For the .obj file in windows mentioned above

Then check the content through the readelf -s test.o command and find that only some other key global functions and variables are recorded

Look at the source file of add.c, there is only one add function in this file

Summarize their symbols, summarize the main symbols, and then summarize the symbols.

compilation stage

In the Linux environment, use the command gcc -c test.s to convert the assembly code in test.s into binary instructions, and generate a binary file of test.o

The reassembly stage will also form a symbol table, and the symbols are only summarized in the previous compilation stage. And here the assembly stage will generate a .o file (.obj file in windows), form a symbol table with the previously summarized symbols, record the summarized symbols in the symbol table and assign them an address.

Note: add in the main function is just a statement, the address assigned to this add is meaningless, it is equivalent to an identifier, whether this function exists or not depends on whether the add function is defined earlier

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-p0rmGJpA-

2) link

In the Linux environment, link the target file through gcc test.o to generate an executable file of a.out (equivalent to the .exe file in windos)

There are two main things to do during the link

  1. Merge Segment Table
  2. Merging and relocation of symbol tables

Simply put, it will link multiple .o object files, because a project will have multiple object files after compilation, and these files have nothing to do with each other. Link them through their function declarations to associate these files .

Linkage errors if a function is not defined (unresolved external command)

And combined segment table and symbol table, the simple understanding is that only one of the same segment in multiple object files is reserved, for example, the symbol and address of the add function are reserved in the combined symbol table. During linking, it is to check whether some external function and symbol definitions are legal.

After linking, an executable program is generated.

Graphical process

operating environment

  1. After compiling the executable program, the program must be loaded into the memory. In an environment with an operating system, this operation is generally completed by the operating system. In a stand-alone environment, program loading must be arranged manually, possibly by placing executable code into read-only memory
  2. The program starts to execute, and then the main function is called
  3. Start to execute the program code. At this time, the program will open up a stack frame for the function, and store the local variables and return address of the function. The program can also use static memory when it starts at the same time. Variables stored in static memory retain their value throughout execution
  4. Terminate the program, terminate the main function normally, and may also terminate unexpectedly

2. Preprocessing

1) Predefined symbols

There are some predefined symbols in C language, which respectively store this information, and they are also directly replaced in the preprocessing stage.

__FILE__ //source files for compilation
__LINE__ //the current line number of the file
__DATE__ //The date the file was compiled
__TIME__ //The time the file was compiled
__STDC__ //Its value is 1 if the compiler follows ANSI C, otherwise undefined
#include <stdio.h>

int main()
{
	printf("source files for compilation: %s\n",__FILE__);
	printf("the current line number of the file: %d\n",__LINE__);
	printf("The date the file was compiled: %s\n",__DATE__);
	printf("The time the file was compiled: %s\n",__TIME__);

	return 0;
}

There is no __STDC__ in vs2019 and the entire symbol is not defined, which means that vs2019 does not support ANSI C well, and I normally output 1 in the Linux environment, indicating that the C language standard is strictly followed in the LInux environment.

2) #define

Identifiers can be defined through #define, and macros can also be defined

#fefine definition identifier

#include <stdio.h>
#define MAX 100000
#define STR "hello"
#define PRINTLN printf("\n")
int main()
{
	printf("%d", MAX);
	PRINTLN;
	printf("%s", STR);
	
	

	return 0;
}

operation result

100000
hello

When #define defines a macro, do you need to add a semicolon after it;?

The suggestion is not to add a semicolon. The semicolon will also be replaced. Just add it when needed.

#define Defining macros

The #define mechanism includes a provision that allows parameters to be substituted into the text, and this implementation is often called a macro or define macro.

Write a macro to calculate the sum of two numbers

#include <stdio.h>
#define ADD(x,y) x+y
int main()
{
	int a = 10;
	int b = 20;
	printf("%d\n", add(a, b));
	
	

	return 0;
}

There is a problem with this way of writing, and problems will arise when writing such code

#include <stdio.h>
#define ADD(x,y) x+y
int main()
{
	int a = 10;
	int b = 20;
	printf("%d\n", add(a, b)*add(a,b));



	return 0;
}

print result

230

This is not the result we want, because there is a priority problem after the replacement

printf("%d\n", ADD(a, b)*ADD(a,b));
	//Equivalent to printf("%d\n", 10+20*10+20);

The solution is to add parentheses to each parameter of the macro, and add parentheses to the whole

#include <stdio.h>
#define ADD(x,y) ((x)+(y))
int main()
{
	int a = 10;
	int b = 20;
	printf("%d\n", ADD(a, b)*ADD(a,b));


	return 0;
}

Therefore, when using macros to find such numerical expressions in the future, add parentheses to each parameter at the end to avoid using macros.

define rules for replacing macros

When performing macro replacement in a program, the following steps need to be involved

  1. When calling a macro, the arguments are first checked to see if they contain any symbols defined by #define
  2. The replacement text is then inserted into the program in place of the original text, and for macros, parameter names are replaced by their values
  3. Finally, the resulting file is scanned again to see if it contains any symbols defined by #define. If yes, repeat the above process

Notice

  1. Other #define variables can appear in macro parameters and #define definitions. But for macros, recursion cannot occur
  2. When the preprocessor searches for symbols defined by #define, the contents of string constants are not searched

For example, the following way of writing is no problem

#define MAX 1000
#define add(x,y) ((x)+(y))*MAX

It is no problem to write the symbols defined by #define in this kind of macro, but the macro does not support calling itself

3) # and ##

How to insert parameters into the string?

Before answering this question, let's take a look at another string writing method in C language

#include <stdio.h>

int main()
{
	char* str = "hello" "world;";
	printf("%s\n", str);
	printf("123" "abc\n");
	return 0;
}

print result

helloworld;
123abc

Write two strings together, and they will be automatically concatenated into one string at compile time.

Now we want to complete such a printing, print out the variable name and value of a variable, and insert it into the string, we found that this is not easy to achieve. At this time, macros can be used.

#include <stdio.h>

int main()
{
	float f = 4.5f;
	printf("the value of f is %f\n", f);

	int a = 10;
	printf("the value of a is %d\n", a);

	int b = 20;
	printf("the value of b is %d\n", b);

	return 0;
}

The code can be written like this through the macro definition, and the effect of the above code can also be achieved, avoiding code redundancy.

#include <stdio.h>
#define PRINT(data, format) printf("the value of "#data" is %"#format"\n",data)
int main()
{
	float f = 4.5f;
	PRINT(f, f);

	int a = 10;
	PRINT(a, d);


	int b = 20;
	PRINT(b,d);

	return 0;
}

#data is equivalent to "data", which will be replaced with the corresponding characters during precompilation.

## role

## You can combine the symbols on both sides of it into one symbol. It allows macro definitions to create identifiers from separated text fragments.

#include <stdio.h>
#define APPEND(str,number) str##number
int main()
{
	int day100 = 2022;
	printf("%d\n", APPEND(day,100));

	return 0;
}

operation result

2022

4) Macro parameters with side effects

When the definition of a macro parameter appears more than once, if the parameter has side effects, it may be dangerous when using this macro, resulting in unpredictable consequences.

For example, the following code saves side effects

#include <stdio.h>
#define MAX(x,y) ((x)>(y)?(x):(y))
int main()
{
	int a = 10;
	int b = 20;
	printf("%d\n", MAX(a++, b++));
	printf("a=%d b=%d\n", a, b);

	return 0;
}

Here **b++** is executed twice, which is equivalent to the expression after replacement

printf("%d\n", ((a++) > (b++) ? (a++) : (b++)));

This is macro parameters with side effects

5) Comparison of macros and functions

Macros are usually used to do some simple operations, such as we find the sum of two numbers

#define ADD(x,y) ((x)+(y))

So why not use a function for this task?

int add(int x, int y)
{
    return x + y;
}

Advantages of macros over functions

  1. Calling a function and returning code from a function can take more time than actually performing such a small computational job, so macros outperform functions in size and speed

    Let's look at a code comparison

    This is the assembly code to calculate the sum of two numbers through a macro

Then look at the amount of code converted to assembly code through the function to calculate the sum of two numbers

We found that there are only 7 codes implemented through macros, and more than a dozen lines of assembly codes are implemented through functions.

During the pre-compilation period, the macro can replace the defined code and then perform calculations, while the function has three processes: call + operation + return.

  1. More importantly, the parameters of the function must be declared as specific types. So functions can only be used on expressions of the appropriate type. On the contrary, how can this macro
    It is suitable for integer, long integer, floating point and other types that can be used for > to compare. Macros are type-independent

    The above code macro can calculate sums of various types, while the function can only calculate sums of integers.

    To give another example, our commonly used malloc function is used to open up space. We can write a macro to open up space, but the transfer type function cannot.

    #include <stdio.h>
    #include <stdlib.h>
    #define MALLOC(size,type) (type*)(malloc(sizeof(type)*size)) 
    
    int main()
    {
    	int* arr = MALLOC(10, int);
    	int i = 0;
    	for (i = 0; i < 10; i++)
    	{
    		arr[i] = i;
    	}
    	for (i = 0; i < 10; i++)
    	{
    		printf("%d ", arr[i]);
    	}
    	return 0;
    }
    

Disadvantages of Macros vs. Functions

  1. Every time a macro is used, a copy of the code defined by the macro is inserted into the program. Unless the macro is relatively short, it may significantly increase the length of the program
  2. Macros cannot be debugged
  3. Macros are not rigorous enough because they are type-independent
  4. Macros may introduce operator precedence issues, making programs prone to errors

Comparative summary

belongs to#define Define macrosfunction
the codeMacro code is inserted into the program each time it is used. Except for very small macros, the length of the program can grow substantiallyThe function code appears in only one place; every time the function is used, the same code in that place is called
execution speedfasterThere is additional overhead for function calls and returns, so it is relatively slow
operator precedenceThe evaluation of macro parameters is in the context of all surrounding expressions. Unless parentheses are added, the precedence of adjacent operators may have unpredictable consequences, so it is recommended to use more parentheses when writing macrosFunction parameters are evaluated only once when the function is called, and its resulting value is passed to the function. The result of evaluating an expression is more predictable
with side effectsParameters may be substituted in multiple places in the macro body, so parameter evaluation with side effects may produce unpredictable resultsFunction parameters are only evaluated once when passing parameters, and the result is easier to control
parametersThe parameters of the macro have nothing to do with the type, as long as the operation on the parameters is legal, it can be used for any parameter typeThe parameters of a function are related to the type. If the types of the parameters are different, different functions are required, even if the tasks they perform are different.
debugMacros are inconvenient to debugFunctions can be debugged statement by statement
recursionMacros cannot be recursivefunctions can be recursive

3. Common preprocessing commands

1) #undef

This instruction is used to define one macro at a time

#include <stdio.h>
#define MAX 1000

int main()
{
	int tmp = MAX;
#undef MAX
	int ret = MAX;//report error

	
	return 0;
}

2) Command line definition

Many C compilers provide the ability to define symbols on the command line. Used to start the compilation process. For example: this feature is somewhat useful when we want to compile different versions of a different program based on the same source file.

#include <stdio.h>


int main()
{
	int arr[SIZE] = { 0 };
	int i = 0;
	for (i = 0; i < SIZE; i++)
	{
		arr[i] = i;
	}
	for (i = 0; i < SIZE; i++)
	{
		printf("%d ", arr[i]);
	}

	
	return 0;
}

In the Linux64-bit environment, compile the test.c file through the command gcc -D SIZE=10 test.c, generate the a.out file, and run it as an array with a size of 10

[root@aliyun code]# ./a.out 
0 1 2 3 4 5 6 7 8 9

3) Conditional compilation

When compiling a program, it is very convenient if we want to compile or discard a statement (a group of statements). Because we have conditional compilation directives.
For example:
Debugging code, it’s a pity to delete it, and keep it in the way, so we can selectively compile

#include <stdio.h>
#define DEBUG 1

int main()
{
	printf("hello world!\n");
#ifdef DEBUG
	printf("test");
#endif // DEBUG


	
	return 0;
}

If DEBUG is set to 0, the line of code that prints test will not be compiled

Of course, multiple branches are also possible

#include <stdio.h>
#define DEBUG 0

int main()
{
	int a = 0;
	printf("hello world!\n");
#if DEBUG
	printf("test");
#elif a
	printf("false");
#else
	printf("haha");
#endif // DEBUG

	return 0;
}

nested definition

#include <stdio.h>
#define DEBUG 0

int main()
{
	int a = 0;
	printf("hello world!\n");
#if defined(DEBUG)
	#if 0
	printf("0");
	#elif a-1
	printf("0");
	#else a+1
	printf("1");
	#endif

#endif



	
	return 0;
}

4) The file contains

We already know that the #include directive can cause another file to be compiled. Just like where it actually appears in the #include directive.
The way this substitution works is simple: the preprocessor first removes the directive and replaces it with the contents of the include file. Such a source file is included 10 times, it is actually compiled 10 times.

How header files are included

  • The local file contains

    #include "add.h"
    

    Search method: First search the header file of add.h in the directory where the source file is located. If the header file is not found, the compiler will search for the header file in the standard location just like searching for the library function header file. If it cannot find it, it will prompt a compilation error.

    Linux environment standard header file path /usr/include/

  • The library file contains

    #include <stdio.h>
    

    The search file is directly searched under the standard path. If it cannot be found, a compilation error will be prompted.

    Using "" can also display the inclusion of header files in the library, but this is less efficient and it is not easy to distinguish whether it is a local file or a library file.

Avoid duplication of header files

Once there are too many header files, there may be repeated introduction of header files.

For example, introduce multiple times like this

#include "add.h"
#include "add.h"
#include "add.h"
int main()
{	
	return 0;
}

Suppose the implementation of add.c is like this

int Add(int a, int b);

Then the precompiled file is like this, and multiple introductions lead to code redundancy.

# 1 "test.c"
# 1 "add.h" 1
int Add(int a, int b);
# 2 "test.c" 2
# 1 "add.h" 1
int Add(int a, int b);
# 3 "test.c" 2
# 1 "add.h" 1
int Add(int a, int b);
# 4 "test.c" 2
int main()
{
 return 0;
}

So how to avoid this situation?

That is conditional compilation

Judging by ifndef, whether ADD_FUNC has been defined in a macro, if it is defined, it will not be defined

#ifndef ADD_FUNC
#define ADD_FUNC
int Add(int a, int b);
#endif

There is also a simpler way of writing, and the effect can also be achieved through #pragma once.

#pragma once
int Add(int a, int b);

5) Implement offsetof

The implementation of offsetof can be simulated through macros

Convert 0 to a structure pointer, think that 0 is the address of the structure, and then find the corresponding member variable to take out the address and convert it to an integer to find the offset.

Similar to pointer addition and subtraction, but here the start is 0 so there is no need to subtract.

#include <stdio.h>
#include <stddef.h>
#define OFFSETOF(structName,member) (size_t)(&(((structName*)0)->member))
struct S
{
	char c;
	int i;
	double d;
};
int main()
{	
	printf("%d\n", offsetof(struct S,i));
	printf("%d\n",OFFSETOF(struct S, i));
	
	return 0;
}

Tags: C C++ Linux

Posted by yakk0 on Sat, 10 Dec 2022 11:57:31 +0300