1 / 28

Evaluation of Algorithms for Spreadsheet Differences Identification

A study on algorithms to identify differences between Excel spreadsheets, assisting in data analysis, sharing, and reuse within organizations.

daria-riley
Download Presentation

Evaluation of Algorithms for Spreadsheet Differences Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers, Christopher Scaffidi School of Electrical Engineering and Computer Science Oregon State University

  2. Spreadsheets as a hub for work • Collecting, organizing, analyzing, and visualizing data • Frequently shared among people in the organization • Who then edit the spreadsheets • And then share the new versions • To other people who then reuse and edit them…  Proliferation of spreadsheets • People choose among which spreadsheets to reuse • Auditors may need to determine who made changes to which cells (that contain errors) Background Algorithm Evaluation Conclusions

  3. Should I reuse Spreadsheet A or B? Spreadsheet X Edits by Bob Edits by Alice Spreadsheet A Spreadsheet B Background Algorithm Evaluation Conclusions

  4. Existing features for understanding spreadsheet differences • TellTable, as well as Excel change tracking • Show differences between X and direct descendant A • We need to compare A vs B • DiffEngineX, Synkronizer, Suntrap, SheetDiff • Direct comparison of any A vs any B • Somewhat inaccurate at recovering intervening edits(errors on 2-12% at cell level, even higher on row/column, for 8 real spreadsheet pairs from the EUSES corpus) Background Algorithm Evaluation Conclusions

  5. Example of an error (Synkronizer) Note and apologies: This figure is referenced but missing in the printed proceedings. (It’s my fault: accidentally deleted it during final round of edits.) Actual edits: insert B’s second column (“c”, “g”, …), insert B’s second row (“d”, “d”, “d”), change B’s A3 from “d” to “e” Background Algorithm Evaluation Conclusions

  6. Outline of this talk Background Algorithm Evaluation Conclusions Background Algorithm Evaluation Conclusions

  7. New algorithm concept • Find a “target alignment” of cells that are nearly identical • i.e., Find what A and B have in common • All remaining differences are attributable to edits • Specifically, row/column insertions in A or Bor cell-level edits within the target alignment cells Background  Algorithm Evaluation Conclusions

  8. Target alignment concept An alignment with only 1 cell-level edit out of 14 cells Background  Algorithm Evaluation Conclusions

  9. Starting point for a specific algorithm: LCS in 1D f c a d b a e f d c a d b a e Background  Algorithm Evaluation Conclusions

  10. Let’s think in terms of aligning rows(put off thinking about columns for a moment) Background  Algorithm Evaluation Conclusions

  11. Insight: Match up rows based on the length of their LCS (1D) A good alignment df dc ba fd ab aa ee ∑ equals 12 1 1 2 2 2 2 2 dcf ddd egc baa fad afb aga ege Background  Algorithm Evaluation Conclusions

  12. Insight: Match up rows based on the length of their LCS (1D) A better alignment (maximal, actually) df dc ba fd ab aa ee ∑ equals 13 2 1 2 2 2 2 2 dcf ddd egc baa fad afb aga ege Background  Algorithm Evaluation Conclusions

  13. Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB Background  Algorithm Evaluation Conclusions

  14. Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Argmax ∑LCS1D(rows retained in A, rows retained in B), where the ∑ is over rows. (Use dynamic programming.) Background  Algorithm Evaluation Conclusions

  15. Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • Argmax ∑LCS1D(cols retained in A, cols retained in B), where the ∑ is over columns Background  Algorithm Evaluation Conclusions

  16. Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • For each row or column not chosen for target alignment • If it’s in B (i.e., not A), then represent as an insert • Else (it’s in A, not B), represent as a delete Background  Algorithm Evaluation Conclusions

  17. Summary of algorithm Given spreadsheets A and B, compute target alignment, then generate a list of edits AB • Use dynamic programming to choose which rows to include in the target alignment • Do the same with A and B to choose columns • For each row or column not chosen for target alignment • For each aligned row or column • If it has virtually no differences between A and B, then represent any remaining differences as cell-level edits • Else, represent the entire row/column as a delete+insert Background  Algorithm Evaluation Conclusions

  18. Three investigations we conducted to evaluate RowColAlign • Tested on 10 manually-created spreadsheet pairs previously used to test an older algorithm (SheetDiff) • Won’t discuss today (due to time) – see paper • Bottom line: RowColAlign made no errors • Tested on >500 automatically-generated cases • Discussed below • Bottom line: RowColAlign made no errors • Formally analyzed expected behavior of RowColAlign • Summarized below • Bottom line: RowColAlign will rarely if ever make errors in practice; runtime is O(spreadsheet area2) Background  Algorithm Evaluation Conclusions

  19. Evaluation based on planted model • Planted model = generative model • Automatically generates test cases • For which we know the correct answer • Very interesting technique to try because this way of thinking about evaluation might be useful for evaluating other algorithms that this community creates Background  Algorithm Evaluation Conclusions

  20. Planted model / generating test cases • Create a blank spreadsheet O of size n x n • Randomly fill O with letters from alphabet of size s • Copy O twice to create A and B • For each row and each column in A and in B With probability p, delete that row or column • For each cell in B With probability q, replace with new random letter Background  Algorithm Evaluation Conclusions

  21. Parameter values based on 8 real spreadsheet pairs from prior work For each parameter setting, we generated 25 test cases. Background  Algorithm Evaluation Conclusions

  22. Result: RowColAlign made no errors For comparison: The existing SheetDiff algorithm made errors at a rate of up to 28% as p and q increased. Background  Algorithm Evaluation Conclusions

  23. Pushing the algorithm further: Huge spreadsheets with many edits Background  Algorithm Evaluation Conclusions

  24. Results: Still no errors Background  Algorithm Evaluation Conclusions

  25. In brief: Why? • Incorrect alignment would be caused by a chance when rows happen to be similar. • Which is less and less likely when… • The alphabet is large • Because the probability that two cells have the same value by chance is ~ 1/s • The spreadsheet is large • Because the probability that n cells have matching values by chance is ~ (1/s)n Background  Algorithm Evaluation Conclusions

  26. Conclusions • The subsequence of rows and columns that two spreadsheets have in common can be computed using a dynamic programming algorithm • The error rate of such an algorithm can be evaluated using a planted model • Our specific dynamic programming algorithm • Is unlikely to make errors when recovering edits Except on spreadsheets that are small or have small alphabets Background  Algorithm Evaluation Conclusions

  27. Future research opportunities • Develop tools based on this algorithm • To help people understand and manage versions • To choose among multiple versions • Develop enhanced algorithms • For simultaneous diff of more than 2 spreadsheets • For clustering collections of spreadsheets based on similarity Background  Algorithm Evaluation Conclusions

  28. Thank you For this opportunity to present For funding from Google and NSF For your questions and ideas Background  Algorithm Evaluation Conclusions

More Related