How git diff is generated: myers diff algorithm

Overview

Reference for this article: https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/

Every developer has been exposed to the diff function of git to some extent, as shown below:

This function can display the modification information of a certain file, which is very convenient for the daily development of developers.
This article will discuss the principle behind the realization of this function.

what myers diff does

cite an example

a = ABCABBA
b = CBABAC

Now to find the diff between ab. Thinking about it carefully, we will find that there are actually countless diff situations between the two. For example, we can find the following situations:

1.  - A       2.  - A       3.  + C
    - B           + C           - A
      C             B             B
    - A           - C           - C
      B             A             A
    + A             B             B
      B           - B           - B
      A             A             A
    + C           + C           + C

But when we actually use it, what we expect to see is an optimal situation, and it is best to meet the following two criteria:

  1. diff has the least content
  2. Deleted content precedes added content

According to these two criteria, we can obtain the optimal solution in this example, as follows:

- A
- B
  C
+ B
  A
  B
- B
  A
+ C

myers diff ideas

In the myers diff algorithm, the process of finding diff is represented as a graph search, as shown below:

       A     B     C     A     B     B     A

    o-----o-----o-----o-----o-----o-----o-----o   0
    |     |     | \   |     |     |     |     |
C   |     |     |  \  |     |     |     |     |
    |     |     |   \ |     |     |     |     |
    o-----o-----o-----o-----o-----o-----o-----o   1
    |     | \   |     |     | \   | \   |     |
B   |     |  \  |     |     |  \  |  \  |     |
    |     |   \ |     |     |   \ |   \ |     |
    o-----o-----o-----o-----o-----o-----o-----o   2
    | \   |     |     | \   |     |     | \   |
A   |  \  |     |     |  \  |     |     |  \  |
    |   \ |     |     |   \ |     |     |   \ |
    o-----o-----o-----o-----o-----o-----o-----o   3
    |     | \   |     |     | \   | \   |     |
B   |     |  \  |     |     |  \  |  \  |     |
    |     |   \ |     |     |   \ |   \ |     |
    o-----o-----o-----o-----o-----o-----o-----o   4
    | \   |     |     | \   |     |     | \   |
A   |  \  |     |     |  \  |     |     |  \  |
    |   \ |     |     |   \ |     |     |   \ |
    o-----o-----o-----o-----o-----o-----o-----o   5
    |     |     | \   |     |     |     |     |
C   |     |     |  \  |     |     |     |     |
    |     |     |   \ |     |     |     |     |
    o-----o-----o-----o-----o-----o-----o-----o   6

    0     1     2     3     4     5     6     7

Finding the shortest path from the upper left to the lower right in the graph is to find the optimal diff.
The right means "delete", the downward means "addition", and the diagonal line means "the original content remains unchanged".

Finding a path needs to meet two characteristics:

  1. Shortest path length (diagonals do not count as length)
  2. Right first, then down (delete first, then add)

dynamic programming

In this way, to find the shortest path of the nth step, then you need to find the shortest path of the n-1 th step, and so on, you need to take the optimal solution from the second step and the first step.

Take the point (2, 2) as an example, the paths that can reach it are as follows:
(0,0)->(1,0)->(2,2)
(0,0)->(1,0)->(1,1)->(1,2)->(2,2)
(0,0)->(1,0)->(1,1)->(2,1)->(2,2)
(0,0)->(1,0)->(2,0)->(2,1)->(2,2)
(0,0)->(0,1)->(2,2)
(0,0)->(0,1)->(1,1)->(1,2)->(2,2)
(0,0)->(0,1)->(1,1)->(2,1)->(2,2)
(0,0)->(0,1)->(0,2)->(1,2)->(2,2)
Since we have to follow the principle of first right and then down, the optimal solution to reach (2,2) is
(0,0)->(1,0)->(2,2)

If we start from (0, 0) and keep only the optimal solution for each path, we can finally get the following paths:

0,0 --- 1,0 --- 3,1 --- 5,2 --- 7,3
 |       |       |
 |       |       |
0,1     2,2     5,4 --- 7,5
 |               |       |
 |               |       |
2,4 --- 4,5     5,5     7,6
 |       |       |
 |       |       |
3,6     4,6     5,6

myers diff concept

Two concepts k and d are introduced in myers diff to facilitate the calculation of the shortest path.

k: k=x-y, x and y are the positions in the horizontal and vertical directions respectively
d: the number of steps taken, slashes are not counted as steps

According to these two parameters, the following figure can be obtained:

When d is the same, the smaller k is, the closer to the end point.
Therefore, the optimal solution can be filtered out.

Tags: Algorithm gif diff

Posted by hcdarkmage on Mon, 16 May 2022 12:55:33 +0300