Fn/Log - Rit Journal

>> Chapter 11: Myers Diff

For my first chapter journal entry, the task is to implement the Myer's diff algorithm which calculates the changes shown in git diff and possibly integrate to git status next. While that is the default, it can be changed and have different results which is fascinating. I will not discuss the algorithm here so do checkout other sources such as from the author's blog.

Apologies if the journal is unstructured or sparse, I am trying to find the balance between being a personal journal and thoroughly reviewed article.

Source code for this chapter can be found here

>>> Sketch

Generally, the objective is to take two lists of comparable items and return an iterator of diffs to convert one to the other. My basic sketch of this algorithm is:

pub enum Diff<'a, T> {
    Same(&'a T),
    Insert(&'a T),
    Delete(&'a T),
}

pub struct Diffs<'a, T> {
    source: &'a [T],
    target: &'a [T],
    path: Vec<(usize, usize)>
}

impl<'a, T> Iterator for Diffs<'a, T> {
    type Item = Diff<'a, T>;

    fn next(&mut self) -> Option<Self::Output> {
        unimplemented!()
    }
}

pub fn myers<'a, T: PartialEq>(source: &'a [T], target: &'a [T]) -> Diffs<'a, T> {
    unimplemented!();
}

Ideally, it is used like this:

let source = "ABCABBA".chars().collect::<Vec<_>>();
let target = "CBABAC".chars().collect::<Vec<_>>();

for diff in myers(source.as_slice(), target.as_slice()) {
    let symbol = match diff {
        Diff::Same(text) => println!(" {}", text),
        Diff::Insert(text) => println!("+{}", text),
        Diff::Delete(text) => println!("-{}", text),
    }
}

It should output:

-A
-B
 C
+B
 A
 B
-B
 A
+C

The algorithm itself is straightforward to implement with finding the best point in the main function and backtracking with the iterator. The first part of the Ruby source code translates easily but the backtracking iterator is the challenging part. Looking back, the fundamental lesson for this chapter is keeping it simple since it does take time to refactor or experiment with a typed language. I did want to explore other Rust traits such as AsRef trait but were unnecessary in the end which is a waste.

>>> Iterator or List

Since the book returned the diffs as an iterator, it follows that the function should do so as well. However, since the backtracking logic yields diffs backwards, the iterator has to be reversed whim implies:

pub fn myers<'a, T: PartialEq>(source: &'a [T], target: &'a [T]) -> impl Iterator<Item = Diff<'a, T>> {
    Diffs {
    }
    .rev()
}

To use Iterator::rev requires implementing a double ended iterator. The easiest way with the diff iterator is to collect the iterator into a list then convert it back to an iterator:

pub fn myers<'a, T: PartialEq>(source: &'a [T], target: &'a [T]) -> impl Iterator<Item = Diff<'a, T>> {
    Diffs {
    }
    .collect::<Vec<_>>()
    .into_iter()
    .rev()
}

Returning an iterator here seems inefficient. Why not return the ordered diffs instead? Why not get rid of the iterator altogether to make the code more compact?

pub fn myers<'a, T: PartialEq>(source: &'a [T], target: &'a [T]) -> Vec<Diff<'a, T>> {
    /// Rest of the code

    let mut diffs = Vec::with_capacity(traces.len());

    /// Iterator code

    diffs.reverse();

    diffs
}

Sadly, I was able to implement the algorithm completely before realizing this and refactored afterwards. Perhaps I was too eager and ambitious with implementing an iterator that I made it unnecessarily harder for myself. Another lesson in keeping it simple but it was rather fun making the iterator work.

>>> Negative Indices

In Ruby, accessing array/list elements with negative indices start from the end instead of the beginning. Since the book uses this feature, we need to translate that to Rust. We could implement a wrapper trait, ExtSlide to support negative indices that is easily done with std::ops::Index and std::ops::IndexMut:

struct ExtSlice<'a, T>(&'a mut [T]);

impl<'a, T> Index<isize> for ExtSlice<'a, T> {
    type Output = T;

    fn index(&self, mut index: isize) -> &Self::Output {
        if index.is_negative() {
            index += self.0.len() as isize;
        }

        &self.0[index as usize]
    }
}

impl<'a, T> IndexMut<isize> for ExtSlice<'a, T> {
    fn index_mut(&mut self, mut index: isize) -> &mut Self::Output {
        if index.is_negative() {
            index += self.0.len() as isize;
        }

        &mut self.0[index as usize]
    }
}

let xs = ExtSlide(&mut [1, 2, 3]);

assert_eq!(xs[0], xs[-1]);
assert_eq!(xs[1], xs[-2]);
assert_eq!(xs[2], xs[-3]);
assert_eq!(xs[3], xs[-1]);

This works but now we cannot access elements past isize. For example if slice indices are u8, the maximum value of i8 is 127, so how to access elements 128 and above? Since the algorithm is simple, I opted to compute it manually to avoid any issue altogether:

#[inline]
fn normalize_index(index: isize, size: usize) -> usize {
    if index.is_negative() {
        size - (index.abs() as usize)
    } else {
        index as usize
    }
}

'outer: for d in 0..(max as isize) {
    for k in (-d..=d).step_by(2) {
        let prev_index = normalize_index(k - 1);
        let next_index = normalize_index(k + 1);

        if k == -d || (k != d && (values[prev_index].0 < values[next_index].0)) { }

        /// Rest of the code
    }
}

How to implement negative indices without precision loss? Perhaps implementing a custom std::ops::RangeBounds can work but that is currently beyond scope. Still interesting to know that indexing operation can be implemented to enable this feature.

>>> Backtracking

Initially, the way backtracking was presented made me think of an alternative representation. Since the best values at each round are only stored, retracing from the best path is lost. The idea is then to clone and store the previous rounds into a list so that backtracking is possible.

let mut values: Vec<usize> = Vec::with_capacity(capacity);
values.resize(capacity, 0);

let mut traces: Vec<Vec<usize>> = Vec::with_capacity(max);

'outer: for d in 0..(max as isize) {
    traces.push(values.clone());

    // Rest of the code
}

Instead of cloning the values on each round, what if we store the path on each value and return that in the end?

let mut values: Vec<(usize, Vec<(usize, usize)>)> = Vec::with_capacity(capacity);

'outer: for d in 0..(max as isize) {
    for k in (-d..=d).step_by(2) {
        // Rest of the code
        let this_path = &mut values[index].1;

        this_path.push((this_x, this_y));

        if this_x >= m && this_y >= n { }
    }
}

I completed the initial implementation with this layout but I realized that is not as efficient as I thought. Given m is the length of the source, n the length of the target, d as the number of rounds, my approach allocates 2 * m * n + 1 small vectors that grow to d elements while the original stores d vectors of 2 * m * n + 1 size. While equal, the original has fewer allocations and more compact memory layout. I keep relearning the old lesson of planning ahead before writing.

>>> Testing

Aside from testing the base case, the book provides no other test. To do property-based testing, I could generate the diffs and compute the left and right text, then the applied diff should be the same. Since we are using quickcheck, we want to create a generator type for our Diff type. However, since it contains a reference field, we need to create an owned version:

#[derive(Debug, PartialEq)]
pub enum Diff<'a, T> {
    Same(&'a T),
    Insert(&'a T),
    Delete(&'a T),
}

#[cfg(test)]
mod tests {
    use quickcheck::{Arbitrary, Gen};

    use super::*;

    #[derive(Clone, Debug, PartialEq)]
    enum TestDiff<T> {
        Same(T),
        Insert(T),
        Delete(T),
    }
}

Next step is to create a generator for the diff collection. Ideally, the type is Vec<TestDiff<T>>; however, since we do not own Vec, we need to wrap it with a struct and implement the trait on that because of coherence rules. To generate the collection, a list of values are generated uniquely by BTreeSet instead of an Vec and then randomly mapped to a diff enum:

use std::collections::BTreeSet;
use rand::Rng;

#[derive(Clone, Debug)]
struct TestDiffs<T>(Vec<TestDiff<T>>);

impl<T: Arbitrary + Ord> Arbitrary for TestDiffs<T> {
    fn arbitrary<G: Gen>(g: &mut G) -> Self {
        let values: BTreeSet<T> = Arbitrary::arbitrary(g);

        let diffs = values
            .into_iter()
            .map(|val| match g.gen::<u8>().rem_euclid(3) {
                0 => TestDiff::Same(val),
                1 => TestDiff::Insert(val),
                2 => TestDiff::Delete(val),
                _ => unreachable!(),
            })
            .collect::<Vec<_>>();

        TestDiffs(diffs)
    }
}

We can now define the property test using the generator. More importantly, how do we extract the source and target text given the diffs? Since insertions come from the target and deletions from the source, we can iterate through the diffs and collect them appropriately:

#[quickcheck]
fn diffs_are_correct(test_diffs: TestDiffs<String>) -> bool {
    let mut diffs = test_diffs.0;

    let mut source = Vec::with_capacity(diffs.len());
    let mut target = Vec::with_capacity(diffs.len());

    for diff in diffs.clone() {
        match diff {
            TestDiff::Delete(item) => {
                source.push(item);
            }
            TestDiff::Insert(item) => {
                target.push(item);
            }
            TestDiff::Same(item) => {
                source.push(item.clone());
                target.push(item);
            }
        };
    }

    unimplemented!()
}

Finally, we apply the myers function on the source and target and expect the generated diffs to be the same:

#[quickcheck]
fn diffs_are_correct(test_diffs: TestDiffs<String>) -> bool {
    /// Rest of the code

    let mut actual_diffs = myers(source.as_slice(), target.as_slice());

    for (actual_diff, expected_diff) in actual_diffs.zip(diffs) {
        match (actual_diff, expected_diff) {
            (Diff::Same(actual), TestDiff::Same(ref expected))
                | (Diff::Insert(actual), TestDiff::Insert(ref expected))
                | (Diff::Delete(actual), TestDiff::Delete(ref expected)) => {
                    if actual != expected {
                        return false;
                    }
                }
            _ => return false,
        };
    }

    true

}

Running this test exposes two edge cases and one generator issue. A common edge case is handling empty list arguments. Adding the test case directly and fixing it simple:

pub fn myers<'a, T: PartialEq>(source: &'a [T], target: &'a [T]) -> Vec<Diff<'a, T>> {
    /// Return quickly when both lists are empty
    if source.is_empty() && target.is_empty() {
        return Vec::new();
    }

    /// Rest of the code
}

#[test]
fn empty_values_should_work() {
    assert_eq!(vec![Diff::Insert(&1)], myers(&[], &[1]));
    assert_eq!(vec![Diff::Delete(&1)], myers(&[1], &[]));
    assert_eq!(Vec::<Diff<u8>>::new(), myers::<u8>(&[], &[]));
}

The trickier issue is that if the starting values of the source and target are the same, it will ignore that diagonal/free move and proceed normally which will generate an unequal diff. The fix is simply to apply the while diagonal loop on the first iteration and augment the new point with that:

pub fn myers<'a, T: PartialEq>(left: &'a [T], right: &'a [T]) -> Vec<Diff<'a, T>> {
    /// Rest of the code

    'outer: for d in 0..(max as isize) {
        /// Check for the diagonal in the search phase
        let delta = if d == 0 {
            let mut i = 0;

            while i < m && i < n && source[i] == target[i] {
                i += 1;
            }

            i
        } else {
            0
        };

        for k in (-d..=d).step_by(2) {
            this_x += delta;
            this_y += delta;

            while this_x < m && this_y < n && left[this_x] == right[this_y] { }

            /// Rest of the code
        }
    }

    /// Check for the diagonal right after the backtracking phase
    let (mut x, mut y) = point;

    while x > 0 && y > 0 {
        x -= 1;
        y -= 1;

        diffs.push(Diff::Same(&target[y]));
    }

    diffs.reverse();

    diffs
}

#[test]
fn same_initial_should_work() {
    assert_eq!(
        vec![Diff::Same(&1), Diff::Delete(&2), Diff::Insert(&3)],
        myers(&[1, 2], &[1, 3])
    );
}

Whether these two issues are fixed later in the book or because of my implementation, the property test thankfully caught these issues early on. However, I must confess the actual and expected diffs are not equal. Specifically, they have the same elements but sometimes not in the same order. For example:

/// EXPECTED
[Same(20), Insert(78), Delete(99)]

/// ACTUAL
[Same(20), Delete(99), Insert(78)]

Sadly, our generation method is basic and incompatible with how myers generates diff. After struggling for a while, it was enough to test when the diffs are sorted and are still equal:

/// Added extra traits for sorting: PartialEq, PartialOrd, Eq, Ord
#[derive(Clone, Debug, PartialEq, PartialOrd, Eq, Ord)]
enum Diff<'a, T> { }

#[derive(Clone, Debug, PartialEq, PartialOrd, Eq, Ord)]
enum TestDiff<T> { }

#[quickcheck]
fn diffs_are_correct(test_diffs: TestDiffs<u8>) -> bool {
    /// Rest of the code

    let mut actual_diffs = myers(left.as_slice(), right.as_slice());

    actual_diffs.sort();
    diffs.sort();

    for (actual_diff, expected_diff) in actual_diffs.into_iter().zip(diffs) {}
    /// Rest of the coe
}

While the generation is imperfect, sorting the diffs before comparing them make the test consistent. My lesson here is that perfect tests are nice but must be weighted against time and effort. Nonetheless, the algorithm works safely and can now declare completion.

>> Histogram

In my early years with git, I thought what made it special among other VCS is that it stores the diffs instead of the files that makes it more efficient in computing which is incorrect. In reality, both old and new file are stored then the diff between them is computed to show in git diff. My flawed thinking comes from applying a fold of diffs to compute the end file as in an empty file plus hello then finally world is helloworld. The issue is that this computation is not unique as in hell then oworld produces the same output. With multiple ways to compute the same text, what then is the best way to compute them?

Rather, how does one define quality or better? Git has four diff algorithms: myers, minimal, patience and histogram where the default is myers. How do they differ? I know myers is greedy and prefers deletions to insertions, but do not know or have experience about the others. Reading this paper and other searches suggest that histogram, a modified version of patience, is more human-readable specially with code. Consider this sample code about a right triangle:

struct RightTriangle;

impl RightTriangle {
    fn hypothenuse(a: f32, b: f32) -> f32 {
        let sum = a.powi(2) + b.powi(2);

        sum.sqrt()
    }

    fn area(a: f32, b: f32) -> f32 {
        (a * b) / 2.0
    }
}

I will do three modifications: move the function RightTriangle::area before RightTriangle::hypothenuse, add a new function RightTriangle::perimeter, and refactor RightTriangle::hypothenuse. Here is the new snippet:

struct RightTriangle;

impl RightTriangle {
    fn area(a: f32, b: f32) -> f32 {
        (a * b) / 2.0
    }

    fn hypothenuse(a: f32, b: f32) -> f32 {
        (a.powi(2) + b.powi(2)).sqrt()
    }

    fn perimeter(a: f32, b: f32) -> f32 {
        a + b + Self::hypothenuse(a, b)
    }
}

Running first git diff --diff-algorithm=myers:

-    fn hypothenuse(a: f32, b: f32) -> f32 {
-        let sum = a.powi(2) + b.powi(2);
+    fn area(a: f32, b: f32) -> f32 {
+        (a * b) / 2.0
+    }

-        sum.sqrt()
+    fn hypothenuse(a: f32, b: f32) -> f32 {
+        (a.powi(2) + b.powi(2)).sqrt()
     }

-    fn area(a: f32, b: f32) -> f32 {
-        (a * b) / 2.0
+    fn perimeter(a: f32, b: f32) -> f32 {
+        a + b + Self::hypothenuse(a, b)

At a quick glance, it is hard to see what is going on here. With git diff --diff-algorithm=histogram:

-    fn hypothenuse(a: f32, b: f32) -> f32 {
-        let sum = a.powi(2) + b.powi(2);
-
-        sum.sqrt()
-    }
-
     fn area(a: f32, b: f32) -> f32 {
         (a * b) / 2.0
     }
+
+    fn hypothenuse(a: f32, b: f32) -> f32 {
+        (a.powi(2) + b.powi(2)).sqrt()
+    }
+
+    fn perimeter(a: f32, b: f32) -> f32 {
+        a + b + Self::hypothenuse(a, b)
+    }

This looks easier to understand. While this is a contrived refactoring example, it does show the possible advantage of this algorithm. If histogram is more human-readable, why is it not the default or popular? I believe git does not need to focus on subjective diff algorithms, instead opting for the simplest and offering configurability. Rather than arguing the use for histogram, changing the diff algorithm is a minor potential and more importantly cannot be configured in GitHub or major git repositories to see the results as a team. Still having potentially better diffs locally while adding or merging is worth a try like so:

git config --global diff.algorithm histogram

I will get longer diffs in the worst case and perhaps mostly not notice the difference, but I know now at least the diff view can be configured.