Compilation (preprocessing) of C language programs
The C language standard specifies that in any implementation of C, there are two distinct environments.
The first is the translation environment, which is converted into executable machine instructions, and the second is the execution environment, which is used to actually execute the code.
1. Compile and link
Suppose there is a source code of test.c, it needs to be compiled --> link --> executable program
If there are multiple .c source code files, each of them will be compiled separately and then passed through the linker to finally become an executable program
- Each source file that makes up a program is converted into object code by compilation
- Each object file is bundled together by the linker to form a single and complete executable program
- The linker will also import any functions in the C library function used by the program. The linker can also search the programmer's personal library and link the functions it needs into the program.
1) Several stages of compilation
Note: At this time, my environment is the gcc compiler of Centos7.6
code:
test.c
#include <stdio.h> #define MAX 666666 //Declare external functions extern int add(int x, int y); int main() { int a = 10; int b = 20; int tmp = MAX; printf("%d\n", a + b); return 0; }
add.c
int add(int x, int y) { return x + y; }
The compilation in the translation environment can be divided into three stages
- Precompiled
- compile
- compilation
precompilation stage
During precompilation the compiler will do several things
- header file inclusion
- Deletion of comments
- #define definition symbol replacement
- preprocessing directive
- ...
We use gcc -E test.c > test.i to precompile the test.c file on Linux, and it will stop immediately after precompilation, and the precompiled solution result is saved in the test.i file for easy viewing
We will find that after precompilation. Our written comments are gone, the written header files are gone, and there are a bunch of function declarations (this is just a partial screenshot).
what we found
- #include <stdio.h> header file is missing
- Comments written were deleted
- MAX defined by #define is also replaced
We can verify the inclusion of the following header files. The stdio.h file is saved in /usr/include/stdio.h in my Linux system. After checking, we find that the function information in it is indeed what we saw above. So the preprocessing section will include the contents of the header file into the source file.
Compile phase
In the compilation stage, the C code will be translated into assembly code, and a few things will be done
- Gramma analysis
- lexical analysis
- Semantic Analysis
- Symbol summary
Syntax analysis is simply to check whether the code has grammatical errors
Lexical analysis: split the C language code one by one, build a syntax tree and the like
Semantic analysis: Simply put, it is how to convert the code of C language into the corresponding assembly code, and the semantics of C language
Compile the test.c file through gcc -S test.c, and it will stop and save the result to the test.s file after compiling. What is saved in test.s is the assembly code.
Symbol summary is a very important process in the compilation phase.
Symbol summary is to extract the important symbols in the file
Let's simply modify the test.c file
#include <stdio.h> #define MAX 666666 //Declare external functions extern int add(int x, int y); int count = 0; void print() { } int main() { int a = 10; int b = 20; int tmp = MAX; printf("%d\n", a + b); return 0; }
Under the Linux file, generate an object file of test.o through the command gcc -c test.c. For the .obj file in windows mentioned above
Then check the content through the readelf -s test.o command and find that only some other key global functions and variables are recorded
Look at the source file of add.c, there is only one add function in this file
Summarize their symbols, summarize the main symbols, and then summarize the symbols.
compilation stage
In the Linux environment, use the command gcc -c test.s to convert the assembly code in test.s into binary instructions, and generate a binary file of test.o
The reassembly stage will also form a symbol table, and the symbols are only summarized in the previous compilation stage. And here the assembly stage will generate a .o file (.obj file in windows), form a symbol table with the previously summarized symbols, record the summarized symbols in the symbol table and assign them an address.
Note: add in the main function is just a statement, the address assigned to this add is meaningless, it is equivalent to an identifier, whether this function exists or not depends on whether the add function is defined earlier
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-p0rmGJpA-
2) link
In the Linux environment, link the target file through gcc test.o to generate an executable file of a.out (equivalent to the .exe file in windos)
There are two main things to do during the link
- Merge Segment Table
- Merging and relocation of symbol tables
Simply put, it will link multiple .o object files, because a project will have multiple object files after compilation, and these files have nothing to do with each other. Link them through their function declarations to associate these files .
Linkage errors if a function is not defined (unresolved external command)
And combined segment table and symbol table, the simple understanding is that only one of the same segment in multiple object files is reserved, for example, the symbol and address of the add function are reserved in the combined symbol table. During linking, it is to check whether some external function and symbol definitions are legal.
After linking, an executable program is generated.
Graphical process
operating environment
- After compiling the executable program, the program must be loaded into the memory. In an environment with an operating system, this operation is generally completed by the operating system. In a stand-alone environment, program loading must be arranged manually, possibly by placing executable code into read-only memory
- The program starts to execute, and then the main function is called
- Start to execute the program code. At this time, the program will open up a stack frame for the function, and store the local variables and return address of the function. The program can also use static memory when it starts at the same time. Variables stored in static memory retain their value throughout execution
- Terminate the program, terminate the main function normally, and may also terminate unexpectedly
2. Preprocessing
1) Predefined symbols
There are some predefined symbols in C language, which respectively store this information, and they are also directly replaced in the preprocessing stage.
__FILE__ //source files for compilation __LINE__ //the current line number of the file __DATE__ //The date the file was compiled __TIME__ //The time the file was compiled __STDC__ //Its value is 1 if the compiler follows ANSI C, otherwise undefined
#include <stdio.h> int main() { printf("source files for compilation: %s\n",__FILE__); printf("the current line number of the file: %d\n",__LINE__); printf("The date the file was compiled: %s\n",__DATE__); printf("The time the file was compiled: %s\n",__TIME__); return 0; }
There is no __STDC__ in vs2019 and the entire symbol is not defined, which means that vs2019 does not support ANSI C well, and I normally output 1 in the Linux environment, indicating that the C language standard is strictly followed in the LInux environment.
2) #define
Identifiers can be defined through #define, and macros can also be defined
#fefine definition identifier
#include <stdio.h> #define MAX 100000 #define STR "hello" #define PRINTLN printf("\n") int main() { printf("%d", MAX); PRINTLN; printf("%s", STR); return 0; }
operation result
100000 hello
When #define defines a macro, do you need to add a semicolon after it;?
The suggestion is not to add a semicolon. The semicolon will also be replaced. Just add it when needed.
#define Defining macros
The #define mechanism includes a provision that allows parameters to be substituted into the text, and this implementation is often called a macro or define macro.
Write a macro to calculate the sum of two numbers
#include <stdio.h> #define ADD(x,y) x+y int main() { int a = 10; int b = 20; printf("%d\n", add(a, b)); return 0; }
There is a problem with this way of writing, and problems will arise when writing such code
#include <stdio.h> #define ADD(x,y) x+y int main() { int a = 10; int b = 20; printf("%d\n", add(a, b)*add(a,b)); return 0; }
print result
230
This is not the result we want, because there is a priority problem after the replacement
printf("%d\n", ADD(a, b)*ADD(a,b)); //Equivalent to printf("%d\n", 10+20*10+20);
The solution is to add parentheses to each parameter of the macro, and add parentheses to the whole
#include <stdio.h> #define ADD(x,y) ((x)+(y)) int main() { int a = 10; int b = 20; printf("%d\n", ADD(a, b)*ADD(a,b)); return 0; }
Therefore, when using macros to find such numerical expressions in the future, add parentheses to each parameter at the end to avoid using macros.
define rules for replacing macros
When performing macro replacement in a program, the following steps need to be involved
- When calling a macro, the arguments are first checked to see if they contain any symbols defined by #define
- The replacement text is then inserted into the program in place of the original text, and for macros, parameter names are replaced by their values
- Finally, the resulting file is scanned again to see if it contains any symbols defined by #define. If yes, repeat the above process
Notice
- Other #define variables can appear in macro parameters and #define definitions. But for macros, recursion cannot occur
- When the preprocessor searches for symbols defined by #define, the contents of string constants are not searched
For example, the following way of writing is no problem
#define MAX 1000 #define add(x,y) ((x)+(y))*MAX
It is no problem to write the symbols defined by #define in this kind of macro, but the macro does not support calling itself
3) # and ##
How to insert parameters into the string?
Before answering this question, let's take a look at another string writing method in C language
#include <stdio.h> int main() { char* str = "hello" "world;"; printf("%s\n", str); printf("123" "abc\n"); return 0; }
print result
helloworld; 123abc
Write two strings together, and they will be automatically concatenated into one string at compile time.
Now we want to complete such a printing, print out the variable name and value of a variable, and insert it into the string, we found that this is not easy to achieve. At this time, macros can be used.
#include <stdio.h> int main() { float f = 4.5f; printf("the value of f is %f\n", f); int a = 10; printf("the value of a is %d\n", a); int b = 20; printf("the value of b is %d\n", b); return 0; }
The code can be written like this through the macro definition, and the effect of the above code can also be achieved, avoiding code redundancy.
#include <stdio.h> #define PRINT(data, format) printf("the value of "#data" is %"#format"\n",data) int main() { float f = 4.5f; PRINT(f, f); int a = 10; PRINT(a, d); int b = 20; PRINT(b,d); return 0; }
#data is equivalent to "data", which will be replaced with the corresponding characters during precompilation.
## role
## You can combine the symbols on both sides of it into one symbol. It allows macro definitions to create identifiers from separated text fragments.
#include <stdio.h> #define APPEND(str,number) str##number int main() { int day100 = 2022; printf("%d\n", APPEND(day,100)); return 0; }
operation result
2022
4) Macro parameters with side effects
When the definition of a macro parameter appears more than once, if the parameter has side effects, it may be dangerous when using this macro, resulting in unpredictable consequences.
For example, the following code saves side effects
#include <stdio.h> #define MAX(x,y) ((x)>(y)?(x):(y)) int main() { int a = 10; int b = 20; printf("%d\n", MAX(a++, b++)); printf("a=%d b=%d\n", a, b); return 0; }
Here **b++** is executed twice, which is equivalent to the expression after replacement
printf("%d\n", ((a++) > (b++) ? (a++) : (b++)));
This is macro parameters with side effects
5) Comparison of macros and functions
Macros are usually used to do some simple operations, such as we find the sum of two numbers
#define ADD(x,y) ((x)+(y))
So why not use a function for this task?
int add(int x, int y) { return x + y; }
Advantages of macros over functions
-
Calling a function and returning code from a function can take more time than actually performing such a small computational job, so macros outperform functions in size and speed
Let's look at a code comparison
This is the assembly code to calculate the sum of two numbers through a macro
Then look at the amount of code converted to assembly code through the function to calculate the sum of two numbers
We found that there are only 7 codes implemented through macros, and more than a dozen lines of assembly codes are implemented through functions.
During the pre-compilation period, the macro can replace the defined code and then perform calculations, while the function has three processes: call + operation + return.
-
More importantly, the parameters of the function must be declared as specific types. So functions can only be used on expressions of the appropriate type. On the contrary, how can this macro
It is suitable for integer, long integer, floating point and other types that can be used for > to compare. Macros are type-independentThe above code macro can calculate sums of various types, while the function can only calculate sums of integers.
To give another example, our commonly used malloc function is used to open up space. We can write a macro to open up space, but the transfer type function cannot.
#include <stdio.h> #include <stdlib.h> #define MALLOC(size,type) (type*)(malloc(sizeof(type)*size)) int main() { int* arr = MALLOC(10, int); int i = 0; for (i = 0; i < 10; i++) { arr[i] = i; } for (i = 0; i < 10; i++) { printf("%d ", arr[i]); } return 0; }
Disadvantages of Macros vs. Functions
- Every time a macro is used, a copy of the code defined by the macro is inserted into the program. Unless the macro is relatively short, it may significantly increase the length of the program
- Macros cannot be debugged
- Macros are not rigorous enough because they are type-independent
- Macros may introduce operator precedence issues, making programs prone to errors
Comparative summary
belongs to | #define Define macros | function |
---|---|---|
the code | Macro code is inserted into the program each time it is used. Except for very small macros, the length of the program can grow substantially | The function code appears in only one place; every time the function is used, the same code in that place is called |
execution speed | faster | There is additional overhead for function calls and returns, so it is relatively slow |
operator precedence | The evaluation of macro parameters is in the context of all surrounding expressions. Unless parentheses are added, the precedence of adjacent operators may have unpredictable consequences, so it is recommended to use more parentheses when writing macros | Function parameters are evaluated only once when the function is called, and its resulting value is passed to the function. The result of evaluating an expression is more predictable |
with side effects | Parameters may be substituted in multiple places in the macro body, so parameter evaluation with side effects may produce unpredictable results | Function parameters are only evaluated once when passing parameters, and the result is easier to control |
parameters | The parameters of the macro have nothing to do with the type, as long as the operation on the parameters is legal, it can be used for any parameter type | The parameters of a function are related to the type. If the types of the parameters are different, different functions are required, even if the tasks they perform are different. |
debug | Macros are inconvenient to debug | Functions can be debugged statement by statement |
recursion | Macros cannot be recursive | functions can be recursive |
3. Common preprocessing commands
1) #undef
This instruction is used to define one macro at a time
#include <stdio.h> #define MAX 1000 int main() { int tmp = MAX; #undef MAX int ret = MAX;//report error return 0; }
2) Command line definition
Many C compilers provide the ability to define symbols on the command line. Used to start the compilation process. For example: this feature is somewhat useful when we want to compile different versions of a different program based on the same source file.
#include <stdio.h> int main() { int arr[SIZE] = { 0 }; int i = 0; for (i = 0; i < SIZE; i++) { arr[i] = i; } for (i = 0; i < SIZE; i++) { printf("%d ", arr[i]); } return 0; }
In the Linux64-bit environment, compile the test.c file through the command gcc -D SIZE=10 test.c, generate the a.out file, and run it as an array with a size of 10
[root@aliyun code]# ./a.out 0 1 2 3 4 5 6 7 8 9
3) Conditional compilation
When compiling a program, it is very convenient if we want to compile or discard a statement (a group of statements). Because we have conditional compilation directives.
For example:
Debugging code, it’s a pity to delete it, and keep it in the way, so we can selectively compile
#include <stdio.h> #define DEBUG 1 int main() { printf("hello world!\n"); #ifdef DEBUG printf("test"); #endif // DEBUG return 0; }
If DEBUG is set to 0, the line of code that prints test will not be compiled
Of course, multiple branches are also possible
#include <stdio.h> #define DEBUG 0 int main() { int a = 0; printf("hello world!\n"); #if DEBUG printf("test"); #elif a printf("false"); #else printf("haha"); #endif // DEBUG return 0; }
nested definition
#include <stdio.h> #define DEBUG 0 int main() { int a = 0; printf("hello world!\n"); #if defined(DEBUG) #if 0 printf("0"); #elif a-1 printf("0"); #else a+1 printf("1"); #endif #endif return 0; }
4) The file contains
We already know that the #include directive can cause another file to be compiled. Just like where it actually appears in the #include directive.
The way this substitution works is simple: the preprocessor first removes the directive and replaces it with the contents of the include file. Such a source file is included 10 times, it is actually compiled 10 times.
How header files are included
-
The local file contains
#include "add.h"
Search method: First search the header file of add.h in the directory where the source file is located. If the header file is not found, the compiler will search for the header file in the standard location just like searching for the library function header file. If it cannot find it, it will prompt a compilation error.
Linux environment standard header file path /usr/include/
-
The library file contains
#include <stdio.h>
The search file is directly searched under the standard path. If it cannot be found, a compilation error will be prompted.
Using "" can also display the inclusion of header files in the library, but this is less efficient and it is not easy to distinguish whether it is a local file or a library file.
Avoid duplication of header files
Once there are too many header files, there may be repeated introduction of header files.
For example, introduce multiple times like this
#include "add.h" #include "add.h" #include "add.h" int main() { return 0; }
Suppose the implementation of add.c is like this
int Add(int a, int b);
Then the precompiled file is like this, and multiple introductions lead to code redundancy.
# 1 "test.c" # 1 "add.h" 1 int Add(int a, int b); # 2 "test.c" 2 # 1 "add.h" 1 int Add(int a, int b); # 3 "test.c" 2 # 1 "add.h" 1 int Add(int a, int b); # 4 "test.c" 2 int main() { return 0; }
So how to avoid this situation?
That is conditional compilation
Judging by ifndef, whether ADD_FUNC has been defined in a macro, if it is defined, it will not be defined
#ifndef ADD_FUNC #define ADD_FUNC int Add(int a, int b); #endif
There is also a simpler way of writing, and the effect can also be achieved through #pragma once.
#pragma once int Add(int a, int b);
5) Implement offsetof
The implementation of offsetof can be simulated through macros
Convert 0 to a structure pointer, think that 0 is the address of the structure, and then find the corresponding member variable to take out the address and convert it to an integer to find the offset.
Similar to pointer addition and subtraction, but here the start is 0 so there is no need to subtract.
#include <stdio.h> #include <stddef.h> #define OFFSETOF(structName,member) (size_t)(&(((structName*)0)->member)) struct S { char c; int i; double d; }; int main() { printf("%d\n", offsetof(struct S,i)); printf("%d\n",OFFSETOF(struct S, i)); return 0; }