380 likes | 508 Views
Software Bertillonage Finding the Provenance of Software Entities. Mike Godfrey Software Architecture Group (SWAG) University of Waterloo. Work with …. Julius Davies ** Daniel German Abram Hindle. Wei Wang Ian Davis Cory Kapser * ** Lijie Zou Qiang Tu. ** Did most of the hard work
E N D
Software BertillonageFinding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo
Work with … • Julius Davies** • Daniel German • Abram Hindle • Wei Wang • Ian Davis • Cory Kapser*** • Lijie Zou • QiangTu ** Did most of the hard work this time *** Did most of the hard work last time
“Provenance” • Originally, used for art / antiques, but now used in science and IT: • Data provenance / audit trails • Component provenance for security • … but what about source code artifacts? • A set of documentary evidence pertaining to • the origin, history, or ownership of an artifact. • [From “provenir”, French for “to come from”]
Consider this code … const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1"); ap_threads_per_child = 1; } return NULL;
… and this code const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d threads,", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1"); ap_threads_per_child = 1; } return NULL;
… or these two functions gnumeric_oct2bin (FunctionEvalInfo *ei, GnmValueconst * const *argv) { return val_to_base (ei, argv[0], argv[1], 8, 2, 0, GNM_const(7777777777.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO); } gnumeric_hex2bin (FunctionEvalInfo *ei, GnmValueconst * const *argv) { return val_to_base (ei, argv[0], argv[1], 16, 2, 0, GNM_const(9999999999.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO); }
Or this … static PyObject * py_new_RangeRef_object (constGnmRangeRef *range_ref){ py_RangeRef_object *self; self = PyObject_NEWpy_RangeRef_object, &py_RangeRef_object_type); if (self == NULL) { return NULL; } self->range_ref = *range_ref; return (PyObject *) self; }
… and this static PyObject * py_new_Range_object (GnmRangeconst *range) { py_Range_object *self; self = PyObject_NEW (py_Range_object, &py_Range_object_type); if (self == NULL) { return NULL; } self->range = *range; return (PyObject *) self; }
“Code clone” detection methods • Strings • Tokens • ASTs • PDGs Time and complexity / proglang dependence
Provenance: Related ideas • Software clone detection • Why? • Just “understand”where/why duplication has occurred • Possible refactoring to reduce inconsistent maintenance, binary footprint size, to improve design, … • Tracking software licensing compatibilities, esp. included libraries and cross-product entity “adoption” • Many techniques for this
Vold z g x y Vnew z f x y Provenance: Related ideas • “Origin analysis”+ software genealogy • Why? • Program comprehension • Name / location change of sw entity within a system can break longitudinal studies • Use entity and relationship analysis to look for likely suspects [TSE-05] ???
Provenance: Related ideas • Software dev. “recommender systems”, mining software repositories • Why? • Given info about similar situations, what might be helpful / informative in this situation? • Many techniques (AI, LSI, LDA, data mining, plus ad hoc specializations + combinations) • … and so on …
Who are you? Alphonse Bertillon(1853-1914)
Bertillonage metrics • Height • Stretch: Length of body from left shoulder to right middle finger when arm is raised • Bust: Length of torso from head to seat, taken when seated • Length of head: Crown to forehead • Width of head: Temple to temple • Length of right ear • Length of left foot • Length of left middle finger • Length of left cubit: Elbow to tip of middle finger • Width of cheeks
Forensic Bertillonage • Some problems … • Equipment was cumbersome, expensive, required training • Measurement error, consistency • The metrics were not independent! • Adoption (and later abandonment) • … but overall it was a big success! • Quick and dirty, and a huge leap forward • Some training and tools required but could be performed with technology of late 1800s • If done accurately, could quickly narrow down a very large pool of mugshots to only a handful
Software Bertillonage • We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) • Who are you, really? • Entity and relationship analysis • Where did you come from? • Evolutionary history • Does your mother know you’re here? • Licensing
Bertillonage desiderata • A good Bertillonage metric should: • be computationally inexpensive • be applicable to the desired level of granularity and programming language • catch most of the bad guys (recall) • significantly reduce the search space (precision) • Why not “fingerprinting” or “DNA analysis”? • Often there just is not enough info (or too much noise) to make conclusive identification • So we hope to reduce the candidate set so that manual examination is feasible
A real-world problem • Software packages often bundle in third-party libraries to avoid “DLL-hell”[Di Penta-10] • In Java world, jars may include library source code or just byte code • Included libs may include other libs too! • Payment Card Industry Data Security Std (PCI-DSS), Req #6: • “All critical systems must have the most recently released, appropriate software patches to protect against exploitation and compromise of cardholder data.” What if a financial software package doesn’t explicitly list the version IDs of its included libraries?
Identifying included libraries • The version ID may be embedded in the name of the component! e.g., commons-codec-1.1.jar • … but often the version info is simply not there! • Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10] • Won’t work if we don’t have library source code • Compare against all known compiled binaries • But compilers, build-time compilation options may differ
Anchored class signatures • Idea: Compile / acquire all known lib versions but extract only the signatures, then compare against target binary • Shouldn’t vary by compiler/build settings • For a class C with methods M1, … , Mn, we define its anchored class signatureas: θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ • For an archive A composed of classes C1,…,Ck, we define its anchored class signature as θ(A) = {θ(C1 ), ..., θ(Ck )} θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ • θ(A) = {θ(C1 ), ..., θ(Ck )}
// This is **decompiled** source!! package a.b; public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac } synchronized static int a (java.lang.String s) throws a.b.E { // decompiled byte code omitted } } σ(C) = public class a.b.C extends Object implements I σ(M1 ) = public C() σ(M2 ) = default synchronized static int a(String) throws E θ(C) = ⟨σ(C), ⟨σ(M1 ), σ(M2 )⟩⟩
Archive similarity • We define the similarity index of two archives as their Jaccard coefficient: • We define the inclusion index of two archives as:
Implementation • Created byte code (bcel5) and source code signature extractors • Used SHA1 hash for class signatures to improve performance • We don’t care about near misses at the method or class level! • Built corpus from Maven2 jar repository • Maven is unversioned + volatile! • 150 GB of jars, zips, tarballs, etc., • 130,000 binary jars (75,000 unique) • 26M .class files, 4M .java source files (incl. duplicates) • Archives contain archives: 75,000 classes are nested 4 levels deep!
Investigation Target system: An industrial e-commerce app containing 84 jars RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?
Investigation RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive? • 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0. • 48 exact, 3 correct product (target version not in Maven) • 20 / 84 we found multiple matches with simIndex= 1.0 • 19 exact, 1 correct product • 12 / 84 we found matches with 0 < simIndex<1.0 • 1 exact, 9 correct product, 2 incorrect product • 1 / 84 we found no match (product not in Maven as a binary) More data here: http://juliusdavies.ca/uvic/jarchive/
Investigation RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? • 22 / 84 binary jars (26.2%), we found a single candidate from the corpus with similarity index of 1.0. • 13 exact, 2 correct product • 7 / 84 we found multiple matches with simIndex= 1.0 • 6 exact, 1 correct product • 46 / 84 we found matches with 0 < simIndex < 1.0 • 25 exact, 20 correct product, 1 incorrect product • 16 / 84 we found no match (product not in Maven as source) More data here: http://juliusdavies.ca/uvic/jarchive/
Investigation RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? • Found exact match or correct product 81 / 84 times (96.4%) RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? • Found exact match or correct product 57 / 84 times (67.9%)
Further uses • Used the version info from extracted from e-commerce app to perform audits for licensing and security • One jar changed open source licenses (GNU Affero, LGPL) • One jar version was found to have known security bugs • When did Google Android developers copy-paste httpclient.jar classes into android.jar? • And how much work would it be to include a newer version? • We narrowed it down to two likely candidates, one of which turned out to be correct.
Summary:Anchored signature matching • If the master repository is rich enough, anchored signature matching can be highly accurate for both source and binary • Tho might have to examine multiple candidates with perfect scores (median: 4, max: 30) • Also works well if product present, but target version missing • The approach is fast and simple, once the repository has been built
Summary:Software Provenance / Bertillonage • Who are you? • Determining the provenance of software entities is a growing and important problem • Software Bertillonage: • Quick and dirty techniques applied widely, then expensive techniques applied narrowly • Identifying version IDs of included Java libraries is an example of the software provenance problem • And our solution of anchored signature matching is an example of software Bertillonage
References • “Software Bertillonage: Finding the provenance of an entity”, by Julius Davies, Daniel M. German, Michael W. Godfrey, and Abram J. Hindle. Under review. • “Copy-Paste as a Principled Engineering Tool”, Michael W. Godfrey and Cory J. Kapser. Chapter 28 in the book Making Software: What Really Works and Why We Believe It, by Greg Wilson and Andy Oram (eds.), O'Reilly and Associates, October 2010. • “Using Origin Analysis to Detect Merging and Splitting of Source Code Entities”, by Michael W. Godfrey and Lijie Zou, IEEE Transaction on Software Engineering, 31(2), Feb 2005.
Chapter 28 is awesome!! All author proceeds to Amnesty International!!
Non-CS References • Fingerprints: The Origins of Crime Detection and the Murder Case that Launched Forensic Science, Colin Beavan, Hyperion Publishing, 2001. • http://en.wikipedia.org/wiki/Alphonse_Bertillon
Software BertillonageFinding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo