240 likes | 336 Views
Copy or Not. Dawei (David) Shi. Copy Or Not. Introduction Algorithm Framework Future work Demo. Copy Or Not. Introduction Algorithm Framework Future work Demo. Introduction. A web-based document comparator Calculate accurate similarity between 2 documents. Copy Or Not.
E N D
Copy or Not Dawei (David) Shi
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Introduction • A web-based document comparator • Calculate accurate similarity between 2 documents
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Algorithm • Preprocessing • Vector space • Similarity calculation
Preprocessing • Stemming • Porter Stemming Algorithm • E.g. • cat – cats • meet – meeting • agree – agreed • correct - correctness
Vector Space • Build dictionary 1 • word -> frequency • Sort the keys of dictionary 1 • Build dictionary 2 • key -> (index, count) • Build binary vectors • index -> occurrence
Similarity Calculation • Vectors v1 and v2 • Similarity = v1 * v2 / (norm(v1) * norm(v2))
Performance • Algorithms coded in Python • Dynamic typing • Not good at numerical operations • Solution: numpy
Numpy • A Python extension module • Written mostly in C • Define numerical array and matrix types and basic operations on them
Numpyvs Python • Python code • a = range(10000000) • b = range(10000000) • c = [] • for i in range(len(a)): • c.append(a[i] + b[i]) • Takes up to 10 seconds on a several GHz processor
Numpyvs Python • Numpy code • import numpy as np • a = np.arrange(10000000) • a = np.arrange(10000000) • c = a + b • Almost Instant
Numpy Usage • Vector dot product • Vector normalization • Vector zero filling
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Framework • Django • The web framework for perfectionists with deadlines
Libraries • Python • Numpy • Porter Stemming • jQuery
Hosting • Alwaysdata • Django 1.3 • Python 2.6
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Future Work • Support file uploading and comparison • Add HTML5 features
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Demo • http://imds.alwaysdata.net