Evaluation of Algorithms for Spreadsheet Differences Identification

Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher Scaffidi School of Electrical Engineering and Computer Science Oregon State University

Spreadsheets as a hub for work • Collecting, organizing, analyzing, and visualizing data • Frequently shared among people in the organization • Who then edit the spreadsheets • And then share the new versions • To other people who then reuse and edit them…  Proliferation of spreadsheets • People choose among which spreadsheets to reuse • Auditors may need to determine who made changes to which cells (that contain errors) Background Algorithm Evaluation Conclusions

Should I reuse Spreadsheet A or B? Spreadsheet X Edits by Bob Edits by Alice Spreadsheet A Spreadsheet B Background Algorithm Evaluation Conclusions

Existing features for understanding spreadsheet differences • TellTable, as well as Excel change tracking • Show differences between X and direct descendant A • We need to compare A vs B • DiffEngineX, Synkronizer, Suntrap, SheetDiff • Direct comparison of any A vs any B • Somewhat inaccurate at recovering intervening edits(errors on 2-12% at cell level, even higher on row/column, for 8 real spreadsheet pairs from the EUSES corpus) Background Algorithm Evaluation Conclusions

Example of an error (Synkronizer) Note and apologies: This figure is referenced but missing in the printed proceedings. (It’s my fault: accidentally deleted it during final round of edits.) Actual edits: insert B’s second column (“c”, “g”, …), insert B’s second row (“d”, “d”, “d”), change B’s A3 from “d” to “e” Background Algorithm Evaluation Conclusions

Outline of this talk Background Algorithm Evaluation Conclusions Background Algorithm Evaluation Conclusions

New algorithm concept • Find a “target alignment” of cells that are nearly identical • i.e., Find what A and B have in common • All remaining differences are attributable to edits • Specifically, row/column insertions in A or Bor cell-level edits within the target alignment cells Background  Algorithm Evaluation Conclusions

Target alignment concept An alignment with only 1 cell-level edit out of 14 cells Background  Algorithm Evaluation Conclusions

Starting point for a specific algorithm: LCS in 1D f c a d b a e f d c a d b a e Background  Algorithm Evaluation Conclusions

Let’s think in terms of aligning rows(put off thinking about columns for a moment) Background  Algorithm Evaluation Conclusions

Insight: Match up rows based on the length of their LCS (1D) A good alignment df dc ba fd ab aa ee ∑ equals 12 1 1 2 2 2 2 2 dcf ddd egc baa fad afb aga ege Background  Algorithm Evaluation Conclusions

Insight: Match up rows based on the length of their LCS (1D) A better alignment (maximal, actually) df dc ba fd ab aa ee ∑ equals 13 2 1 2 2 2 2 2 dcf ddd egc baa fad afb aga ege Background  Algorithm Evaluation Conclusions

Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB Background  Algorithm Evaluation Conclusions

Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Argmax ∑LCS1D(rows retained in A, rows retained in B), where the ∑ is over rows. (Use dynamic programming.) Background  Algorithm Evaluation Conclusions

Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • Argmax ∑LCS1D(cols retained in A, cols retained in B), where the ∑ is over columns Background  Algorithm Evaluation Conclusions

Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • For each row or column not chosen for target alignment • If it’s in B (i.e., not A), then represent as an insert • Else (it’s in A, not B), represent as a delete Background  Algorithm Evaluation Conclusions

Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • For each row or column not chosen for target alignment • For each aligned row or column • If it has virtually no differences between A and B, then represent any remaining differences as cell-level edits • Else, represent the entire row/column as a delete+insert Background  Algorithm Evaluation Conclusions

Three investigations we conducted to evaluate RowColAlign • Tested on 10 manually-created spreadsheet pairs previously used to test an older algorithm (SheetDiff) • Won’t discuss today (due to time) – see paper • Bottom line: RowColAlign made no errors • Tested on >500 automatically-generated cases • Discussed below • Bottom line: RowColAlign made no errors • Formally analyzed expected behavior of RowColAlign • Summarized below • Bottom line: RowColAlign will rarely if ever make errors in practice; runtime is O(spreadsheet area2) Background  Algorithm Evaluation Conclusions

Evaluation based on planted model • Planted model = generative model • Automatically generates test cases • For which we know the correct answer • Very interesting technique to try because this way of thinking about evaluation might be useful for evaluating other algorithms that this community creates Background  Algorithm Evaluation Conclusions

Planted model / generating test cases • Create a blank spreadsheet O of size n x n • Randomly fill O with letters from alphabet of size s • Copy O twice to create A and B • For each row and each column in A and in B With probability p, delete that row or column • For each cell in B With probability q, replace with new random letter Background  Algorithm Evaluation Conclusions

Parameter values based on 8 real spreadsheet pairs from prior work For each parameter setting, we generated 25 test cases. Background  Algorithm Evaluation Conclusions

Result: RowColAlign made no errors For comparison: The existing SheetDiff algorithm made errors at a rate of up to 28% as p and q increased. Background  Algorithm Evaluation Conclusions

Pushing the algorithm further: Huge spreadsheets with many edits Background  Algorithm Evaluation Conclusions

Results: Still no errors Background  Algorithm Evaluation Conclusions

In brief: Why? • Incorrect alignment would be caused by a chance when rows happen to be similar. • Which is less and less likely when… • The alphabet is large • Because the probability that two cells have the same value by chance is ~ 1/s • The spreadsheet is large • Because the probability that n cells have matching values by chance is ~ (1/s)n Background  Algorithm Evaluation Conclusions

Conclusions • The subsequence of rows and columns that two spreadsheets have in common can be computed using a dynamic programming algorithm • The error rate of such an algorithm can be evaluated using a planted model • Our specific dynamic programming algorithm • Is unlikely to make errors when recovering edits Except on spreadsheets that are small or have small alphabets Background  Algorithm Evaluation Conclusions

Future research opportunities • Develop tools based on this algorithm • To help people understand and manage versions • To choose among multiple versions • Develop enhanced algorithms • For simultaneous diff of more than 2 spreadsheets • For clustering collections of spreadsheets based on similarity Background  Algorithm Evaluation Conclusions

Thank you For this opportunity to present For funding from Google and NSF For your questions and ideas Background  Algorithm Evaluation Conclusions

Evaluation of Algorithms for Spreadsheet Differences Identification