400 likes | 569 Views
Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog. Agenda. What is it? F# Intro Algorithms: Search Fuzzy Matching Classification ( SVM) Recommendations Q&A. All This in 45 mins? . This is an awareness session! Lots of content, very broad, very fast
E N D
Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog
Agenda • What is it? • F# Intro • Algorithms: • Search • Fuzzy Matching • Classification (SVM) • Recommendations • Q&A
All This in 45 mins? • This is an awareness session! • Lots of content, very broad, very fast • You’ll get all demos, pointers, and slide deck to take offline and digest • Two takeaways: • F# is a great language for data • Smart algorithms aren’t hard – use them, explore more!
F# is ...a functional, object-oriented, imperative and explorativeprogramming language for .NET what is Functional Programming? http://callvirt.net/jaoo.zip
What is Functional Programming? • Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data” • -> Emphasizes functions • -> Emphasizes shapes of data, rather than impl. • -> Modeled on lambda calculus • -> Reduced emphasis on imperative • -> Safely raises level of abstraction
Motivation for Functional • Simplicity in life is good: cheaper, easier, faster, better. • We typically achieve simplicity in software in two ways: • By raising the level of abstraction (and OO was one design to raise abstraction) • Increasing modularity • Increasing signal to noise another good strategy: • Communicate more in less time with more clarity • Better composition and modularity == reuse
Functional ProgrammingSafer, while still being useful C#, C++, … V.Next# F# Useful Haskell Not Useful Unsafe Safe
What is F# for? • F# is a General Purpose language • Can be used for a broad range of programming tasks • Superset of imperative and dynamic features • Great for learning FP concepts • Some particularly important domains • Financial modeling and analysis • Data mining • Scientific data analysis • Domain-specific modeling • Academic
Let Type inference. The static typing of C# with the succinctness of a scripting language • ‘Let’ binds values to identifiers lethelloWorld = “Hello, World” print_any helloWorld let myNum = 12 letmyAddFunction x y = letsum = x + y sum
Tuples • Simple, and most useful data structure letsite1 = (“msdn.com”, 10) letsite2 = (“abc.net.au”, 12) letsite3 = (“news.com.au”, 22) letallSites = (site1, site2, site3) letfst (a, b) = a letsnd (a, b) = b
Lists, Arrays, Seq and Options • Lists & Arrays are first-class citizens • Options provide a some-or-nothing capability letlist1 = [“Joel"; "Luke"] letarray = [|2; 3; 5;|] letmyseq = seq [0; 1; 2; ] letoption1 = Some(“Joel") letoption2 = None
Records • Simple concrete type definition type Person = { Name: string; DateOfBirth: System.DateTime; } letn = { Name = “Joel”; DateOfBirth = “13/04/81”; }
Immutability (by default) Data is immutable by default Values may not be changed
Discriminated Unions • Great for representing the structure of data type Make = string type Model = string type Transport = | Car of Make * Model | Bicycle letme = Car (“Holden”, “Barina”) letyou = Bicycle Both of these identifiers are of type “Transport”
Functions • Functions: like delegates + unified and simple • Deep type inference (funx ->x + 1) letmyFunc x = x + 1 valmyFunc : int ->int let recfactorial n = if n>1 then n * factorial (n-1) else 1 let data = [5; 3; 4; 4; 5] List.sort (fun x y -> x – y) data
Pattern Matching let (fst, _) = (“first”, “second”) Console.WriteLine(fst) let switchOnType(a:obj) match a with | :? Int32 -> printfn“int!” | :? Transport -> printfn“Transport“ | _ -> printfn“Everything Else!” • Very important part of F# • Helps deal with the ‘teasing apart’ of data • Works best with Discriminated Unions & Records
Search • Given a search term and a large document corpus, rank and return a list of the most relevant results…
Search • Words • Stemming? Tokenize? • E.g ‘Python/Ruby’ • Markup • Title, Author, Date • Headings (h1,h2 etc) • Paragraphs • Links • A sign of strength? Let’s explore something simple…
Search • Simplify: • For easy machine/language manipulation • … and most importantly, easy computation • Vectors: natures own quality data structure • Convenient machine representation (lists/arrays) • Lots of existing vector math algorithms After a loving incubation period, moonlight 2.0 has been released. <a href=“whatever”>source code</a><br><a href”something else”>FireFox binaries</a> … after after incubation loving moonlight firefox linux binaries 2 1 1 6 4 6 2
Term Count the incubation crazy moonlight firefox linux penguin • Document1: Linux post: • Document2: Animal post: • Vector space: 9 1 1 6 4 6 2 crazy the dog penguin 2 2 1 5 the incubation crazy moonlight firefox linux dog penguin 9 1 1 6 4 6 0 2 2 0 2 0 0 0 1 5
Term Count Issues the incubation crazy moonlight firefox linux dog penguin • ‘the dog penguin’ • Linux: 9+0+2 = 11 • Animal: 2+1+5 = 8 • ‘the’ is overweight • Enter TF-IDF: Term Frequency Inverse Document Frequency • A weight to evaluate how important a word is to a corpus • i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query 9 1 1 6 4 6 0 2 2 0 2 0 0 0 1 5
TF-IDF • Normalise the term count: • tf = termCount / docWordCount • Measure importance of term • idf = log ( |D| / termDocumentCount) • where |D| is the total documents in the corpus • tfidf = tf * idf • A high weight is reached by high term frequency, and a low document frequency
Fuzzy Matching • String similarity algorithms: • SoundEx; Metaphone • Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; … • We’ll look at Levenshtein Distance algorithm • Defined as: The minimum edit operations which transforms string1 into string2
Fuzzy Matching • Edit costs: • In-place copy – cost 0 • Delete a character in string1 – cost 1 • Insert a character in string2 – cost 1 • Substitute a character for another – cost 1 • Transform ‘kitten’ in to ‘sitting’ • kitten -> sitten (cost 1 – replace k with s) • sitten -> sittin (cost 1 - replace e with i) • sittin -> sitting (cost 1 – add g) • Levenshtein distance: 3
Fuzzy Matching • Estimated string similarity computation costs: • Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. • Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance. • Parallelisable – split the set of words to compare across n cores. • Can do approximately 10,000 compares per second on a standard single core laptop.
Did You Mean? demo
Classification • Support Vector Machines (SVM) • Supervised learning for binary classification • Training Inputs: ‘in’ and ‘out’ vectors. • SVM will then find a separating ‘hyperplane’ in an n-dimensional space • Training costs, but classification is cheap • Can retrain on the fly in some cases
SVM Issues • Classification on 2 dimensions is easy, but most input is multi-dimensional • Some ‘tricks’ are needed to transform the input data
SVM Classifier demo
F# and AlgorithmsNetflix Demo • Netflix Prize - $1 million USD • Must beat Netflix prediction algorithm by 10% • 480k users • 100 million ratings • 18,000 movies • Great example of deriving value out of large datasets • Earns Netflix loads and loads of $$$!
Nearest Neighbour AlgorithmFind all my neighbours movies • Find the best movies my neighbours agree on
A Short Stop-over at Vector Math A (x1,y1) B (x2,y2) C (x0,y0) If we want to calculate the distance between A and B, we call on Euclidean Distance We can represent the points in the same way using Vectors: Magnitude and Direction. Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.
Q & A • Any questions? • http://callvirt.net/ • joelpobar@gmail.com • THANKS!