efcore tips - there may be tips you don't know

foreword

The .net environment has been developing steadily in recent years. In the process of development, dealing with the database is essential. The early developers all started with DbHelper, and now various ORM frameworks are popular. No one can tell which is better and which is worse. This is what Wen Wu No. 1 and Wu Wu No. 2 say. Nothing is best, only what is best for you.

I also started with DbHelper, during which I used SugarSql and then EFCODE. With the original intention of learning and sharing, I share some tips I have summarized in my work, hoping to help more developers and hope to achieve common progress. If there are mistakes in the text, please feel free to let me know.

1. DbContext configuration

In asp.net, usually, the ef service is injected in the ConfigureServices method of the Startup class.
The sample code is as follows:

services.AddDbContext<DemoDbContext>(opt=>opt.UseMySql("server=.;Database=demo;Uid=root;Pwd=123;Port=3306;"));

The above code indicates the use of MySql database. If you use the SqlServer database, you can change UseMySql to UseSqlServer, and the use of other databases is also selected by calling different methods. But you need to install the corresponding extension method package, such as Microsoft.EntityFrameworkCore.SqlServer or Microsoft.EntityFrameworkCore.Sqlite.

In addition, the UseMySql method also includes a nullable Action type parameter, which can be used to perform some personalized configuration, such as configuring the retry mechanism. As follows:

services.AddDbContext<DemoDbContext>(opt => opt.UseMySql("server=.;Database=demo;Uid=root;Pwd=123456;Port=3306;",
                provideropt => provideropt.EnableRetryOnFailure(3,TimeSpan.FromSeconds(10),new List<int>(){0} )));

This retry mechanism is useful in some scenarios. For example, a momentary connection timeout due to network fluctuations or traffic. If the retry mechanism is not set, an exception will be triggered directly. After the timeout is set, it will be retried according to the set time interval and the number of retries. The last parameter of the EnableRetryOnFailure method is used to set the error code. Only the error that sets the error code will trigger a retry. There are many ways to obtain the error code. Personally, it is recommended to obtain it through exception information. For example, when using MySql data, the type of exception triggered is MySqlException. The value of the Number property of this class is the Number required by the EnableRetryOnFailure method.

2. DbContext thread problem

efcore does not support running multiple parallel operations on the same DbContext instance, this includes parallel execution of asynchronous queries and any explicit concurrent use from multiple threads. Therefore, always await asynchronous calls, or use a separate DbContext instance for operations performed in parallel.
When EF Core detects a parallel operation or when multiple threads attempt to use a DbContext instance at the same time, you will see an InvalidOperationException with a message similar to the following:

A second operation started on this context before a previous operation completed. Any instance members are not guaranteed to be thread safe.

It means that a new operation is started before the previous operation is completed, so there is no guarantee that the thread is safe.

Here's a sample code that is wrong and can trigger this exception:

So, always await asynchronous calls. If you use DbContext in multiple threads, you need to ensure that the instance of DbContext for each thread is unique.

3. The database uses connection pooling

The throughput of using services.AddDbContextPool is improved by 10~20 percentage points than using services.AddDbContext (unofficial statement, the performance improvement data is the result obtained after my test).
It should be noted that the connection pool size is not the bigger the better.

4. Logging

When using ef, most of the interaction with the database is realized through linq, and then ef translates linq into corresponding sql statements. When troubleshooting problems, when developing or troubleshooting problems, you often need to pay attention to the final execution. sql script, so you need to view it through the log.
In the efcore2.x version, the default is to inject the log service, so you can view the corresponding sql script without additional operations. However, the version of efcore3.x removes the log service by default. For the specific reasons, please refer to: https://docs.microsoft.com/zh-cn/ef/core/what-is-new/ef-core-3.0/breaking-changes#adddbc.
The log task can be injected by customizing the DbContext. The sample code is as follows:

public static readonly ILoggerFactory MyLoggerFactory
            = LoggerFactory.Create(builder => { builder.AddConsole(); });
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    base.OnConfiguring(optionsBuilder);
    optionsBuilder.UseLoggerFactory(MyLoggerFactory);
}

When the ef code is executed, the related sql script can be viewed in the console, as shown in the following figure:

5. Increase

Common scenarios for inserting data into the database are: ordinary single-form row insertion, multi-table cascading insertion, and batch insertion.
Ordinary single-form row insertion is relatively simple, and the example code is as follows:

var student = new Student {CreateTime = DateTime.Now, Name = "zjjjjjj"};
await _context.Students.AddAsync(student);
await _context.SaveChangesAsync();

For multi-table cascading inserts, attribute navigation needs to be configured in entity mapping.
For example, the relationship between the Blog table and Post is a one-to-many relationship. Then in the entity of Blog, define a property of type List. The sample code is as follows:

[Table("blog")]
public class Blog 
{
    [Column("id")]
    public long Id { get; set; }
    [Column("title")]
    public string Title { get; set; }
    public List<Post> Posts { get; set; }
    [Column("create_date")]
    public DateTime CreateDate { get; set; }
}

The corresponding insert statement is as follows:

var blog = new Blog
{
    Title = "test title",
    Posts = new List<Post>
    {
        new Post{Content = "Comment 1"},
        new Post{Content = "Comment 2"},
        new Post{Content = "Comment 3"},
    }
};
await _context.Blog.AddAsync(blog);
await _context.SaveChangesAsync();

Executing this code will generate the following log:

It can be seen from the log that the effect of cascading insertion is achieved in this way.

There are two implementations of batch insertion. One is the default implementation of EF, which is suitable for situations where there are few data sources. On the other hand, based on EF, we develop a service for inserting a large amount of data in batches, which is suitable for scenarios with more than 1000 data sources. In terms of data volume of 10,000 and above, there is a very obvious improvement in the performance of EF's default batch insertion. Specific reference: https://www.cnblogs.com/fulu/p/13370335.html

EF default implementation:

var list = new List<Student>();
for (int i = 0; i < num; i++)
{
    list.Add(new Student { CreateTime = DateTime.Now, Name = "zjjjjjj" });
}

await _context.Students.AddRangeAsync(list);
await _context.SaveChangesAsync();

ISqlBulk implementation:

var list = new List<Student>();
for (int i = 0; i < 100000; i++)
{
    list.Add(new Student { CreateTime = DateTime.Now, Name = "zjjjjjj" });
}
await _bulk.InsertAsync(list);

auto increment OR GUID

The advantages of int auto-increment:

1,Requires very little data storage space, only 4 byte . 

2,insert and update use during operation INT performance ratio GUID ok, so use int Will improve the performance of the application.

3,index and Join operate, int the best performance.

4,Easy to remember.

Disadvantages of int auto-increment:

1,use INT The data range is limited. If there is a large amount of data, it may exceed INT range of values.

2,Difficult to deal with distributed storage of data tables.

Advantages of GUID as primary key:

1,uniqueness.

2,Suitable for insert and update operations in large amounts of data.

3,Data merging across servers is very convenient.

Disadvantages of GUID as primary key:

1,Large storage space (16 byte),So it will take up more disk size.

2,It's hard to remember. join Operational performance ratio int to be low.

3,There is no built-in function to get the latest generated guid primary key.

4,EF generated by default GUID It is unordered and will affect data insertion performance.

in conclusion:

In scenarios with a relatively small amount of data, it is recommended to use int auto-increment, such as classification. For large data volumes, ordered GUID s are recommended. Because the GUID generated by .net is unordered by default, and the primary key in the database is the clustered index by default, and the physical storage of the clustered index is ordered. When inserting data, if the unordered GUID is inserted, it may be When it comes to moving data, it affects the performance of insertion, especially when the amount of data is millions, the performance impact is more obvious. References: https://www.cnblogs.com/CameronWu/p/guids-as-fast-primary-keys-under-multiple-database.html

Other options:

After personal understanding, the distributed id generation algorithm commonly used on the market is similar to the Snowflake algorithm released by Twitter. I have also used the Snowflake algorithm in the project. Interested friends can find related content in the blog garden. However, the snowflake algorithm packaged in .net is generally relatively basic, and it is difficult to simply use it in the docker or k8s environment. Therefore, under this notice, the version I wrote based on the snowflake algorithm that can be used in the k8s environment will be open source, so stay tuned.

6. Check

EF uses Linq to query data in the database, and Linq can be used to write strongly typed queries. When the command is executed, EF first converts the Linq expression into a sql script, and then submits it to the database for execution. The generated sql script can be viewed in the log.

Query by condition:
await _context.Blog.Where(x=>x.Id>0).ToListAsync();

The sql script generated when the above code is executed is as follows:

SELECT `x`.`id`, `x`.`create_date`, `x`.`title`
      FROM `blog` AS `x`
      WHERE `x`.`id` > 0
get a single entity

The ways to obtain a single entity are First, FirstOrDefault, Single, SingleOrDefault
Among them, the sql scripts generated when First and FirstOrDefault are executed are as follows:

 SELECT `x`.`id`, `x`.`create_date`, `x`.`title`
      FROM `blog` AS `x`
      WHERE `x`.`id` > 10
      LIMIT 1

Single, the sql script generated when SingleOrDefault is executed is as follows:

 SELECT `x`.`id`, `x`.`create_date`, `x`.`title`
      FROM `blog` AS `x`
      WHERE `x`.`id` > 10
      LIMIT 2

If you are careful, you should have discovered the difference between the two. Single needs to query two pieces of data. When there is more than one piece of data returned, the Single and SingleOrDefault methods will report Source sequence contains more than one element. exception. Therefore, the Single method is only applicable to scenarios where there is only one piece of data corresponding to the query condition, such as querying the value of the primary key. As follows:

await _context.Blog.SingleOrDefaultAsync(x => x.Id==100);

The difference between the suffix with OrDefault and without the suffix is ​​that when the sql script executes the query and cannot query the data, the suffix will return a null value, and the suffix without the suffix will directly report an exception.

Check if the database exists

The presence or absence of data can be achieved through the Any() and Count() methods. The sample code is as follows:

await _context.Blog.AnyAsync(x => x.Id > 100);

await _context.Blog.CountAsync(x => x.Id > 100)>0;

The generated sql script corresponds to the following:

SELECT CASE
          WHEN EXISTS (
              SELECT 1
              FROM `blog` AS `x`
              WHERE `x`.`id` > 100)
          THEN TRUE ELSE FALSE
      END
SELECT COUNT(*)
      FROM `blog` AS `x`
      WHERE `x`.`id` > 100

At first glance, the script generated by the Any method seems to be more complicated, but in fact, the performance of the Any method is much higher than that of the Count method under large data volumes. So when judging whether it exists, please use the Any method.

connection query

Join query is the most important query in relational database, mainly including inner join, outer join (left join, outer join) and cross join. Multiple table queries can be implemented through the join operator. This article mainly explains the commonly used inner join and left join.
The sample code for inner join is as follows:

var query = from post in _context.Post
            join blog in _context.Blog on post.BlogId equals blog.Id
    where blog.Id > 0
    select new {blog, post};

The sample code for left join is as follows:

var query = from post in _context.Post
                        join blog in _context.Blog on post.BlogId equals blog.Id
                        into pbs
                        from pb in pbs.DefaultIfEmpty()
                where pb.Id>0 && post.Content.Contains("1")
                        select new {post,pb.Title};
Cascading query

In many scenarios, it may involve querying the sub-table data associated with the parent table. In such a scenario, some people will first find out the data in the main table, and then query the data in the sub-table based on the primary key of the main table. The author also used this method in the early days of using ef. But with the help of the Include method, we can more easily solve the problem of cascading query between parent and child tables. The sample code is as follows:

var result = await _context.Blog.Include(b => b.Posts) .SingleOrDefaultAsync(x=>x.Id==157);

If there are more levels, you can use ThenInclude to query.

Sometimes, there is such a scenario: we are not simply querying the data of the sub-table, but need to query the data that meets the specified conditions, then we are required to pass in parameters when calling the Include method. The sample code is as follows:

 var filteredBlogs = await _context.Blogs
        .Include(blog => blog.Posts
            .Where(post => post.BlogId == 1)
            .OrderByDescending(post => post.Title)
            .Take(5))
        .ToListAsync();

Note: The above method is only supported in .net5. Therefore, efcore is also in the process of development. With the update of time and version, the functions will gradually become more perfect. For related content, please refer to: https://docs.microsoft.com/zh-cn/ef/core/querying/related-data

7. Change

Anyone who has used EF should understand the concept of tracking and non-tracking of queries (Nani? You haven't heard of it, let me show you a clear way: https://docs.microsoft.com/zh-cn/ef/core/querying/tracking).

Generally speaking, the update process is like this: query the data, modify the value of some fields, call the Update method, and then call the SaveChange method. It looks flawless, but if you have carefully observed the generated sql script, maybe you should have a better way, let's take a look at the sample code:

var school = await _context.Schools.FirstAsync(x => x.Id > 0);
school.Name = "6666";
_context.Schools.Update(school);
await _context.SaveChangesAsync();

As shown in the figure below is the update sql statement generated by executing the above code. We found that only the Name was re-assigned in the code, but the generated script updated all the fields of this record. Obviously, this is not what we want. the result of.

In fact, if the entity is obtained by tracking the query, you can directly call the SaveChage method instead of calling the Update method redundantly. At this time, EF will automatically determine which fields have been updated, so that only sql statements with changed values ​​are generated.

Conclusion: When tracking is turned on for the entity to be updated, when updating, it is not necessary to call the Update method, but directly call the SaveChange method, and then update the field whose value has changed. If you call Update first and then SaveChange, regardless of whether the fields of the entity are updated, the generated sql script will still update all fields, sacrificing performance. If your entity is not obtained through the database tracking query, you only need to call the Update method when you call it.

Produced by Flow ICH. Architecture

Author: Fowles

August 2020

Tags: .NET ef efcore

Posted by mbh23 on Fri, 20 May 2022 14:43:01 +0300