1 / 40

Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog. Agenda. What is it? F# Intro Algorithms: Search Fuzzy Matching Classification ( SVM) Recommendations Q&A. All This in 45 mins? . This is an awareness session! Lots of content, very broad, very fast

lavi
Download Presentation

Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

  2. Agenda • What is it? • F# Intro • Algorithms: • Search • Fuzzy Matching • Classification (SVM) • Recommendations • Q&A

  3. All This in 45 mins? • This is an awareness session! • Lots of content, very broad, very fast • You’ll get all demos, pointers, and slide deck to take offline and digest • Two takeaways: • F# is a great language for data • Smart algorithms aren’t hard – use them, explore more!

  4. F# is ...a functional, object-oriented, imperative and explorativeprogramming language for .NET what is Functional Programming? http://callvirt.net/jaoo.zip

  5. What is Functional Programming? • Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data” • -> Emphasizes functions • -> Emphasizes shapes of data, rather than impl. • -> Modeled on lambda calculus • -> Reduced emphasis on imperative • -> Safely raises level of abstraction

  6. Motivation for Functional • Simplicity in life is good: cheaper, easier, faster, better. • We typically achieve simplicity in software in two ways: • By raising the level of abstraction (and OO was one design to raise abstraction) • Increasing modularity • Increasing signal to noise another good strategy: • Communicate more in less time with more clarity • Better composition and modularity == reuse

  7. Functional ProgrammingSafer, while still being useful C#, C++, … V.Next# F# Useful Haskell Not Useful Unsafe Safe

  8. What is F# for? • F# is a General Purpose language • Can be used for a broad range of programming tasks • Superset of imperative and dynamic features • Great for learning FP concepts • Some particularly important domains • Financial modeling and analysis • Data mining • Scientific data analysis • Domain-specific modeling • Academic

  9. Let Type inference. The static typing of C# with the succinctness of a scripting language • ‘Let’ binds values to identifiers lethelloWorld = “Hello, World” print_any helloWorld let myNum = 12 letmyAddFunction x y = letsum = x + y sum

  10. Tuples • Simple, and most useful data structure letsite1 = (“msdn.com”, 10) letsite2 = (“abc.net.au”, 12) letsite3 = (“news.com.au”, 22) letallSites = (site1, site2, site3) letfst (a, b) = a letsnd (a, b) = b

  11. Lists, Arrays, Seq and Options • Lists & Arrays are first-class citizens • Options provide a some-or-nothing capability letlist1 = [“Joel"; "Luke"] letarray = [|2; 3; 5;|] letmyseq = seq [0; 1; 2; ] letoption1 = Some(“Joel") letoption2 = None

  12. Records • Simple concrete type definition type Person = { Name: string; DateOfBirth: System.DateTime; } letn = { Name = “Joel”; DateOfBirth = “13/04/81”; }

  13. Immutability (by default) Data is immutable by default Values may not be changed

  14. Discriminated Unions • Great for representing the structure of data type Make = string type Model = string type Transport = | Car of Make * Model | Bicycle letme = Car (“Holden”, “Barina”) letyou = Bicycle Both of these identifiers are of type “Transport”

  15. Functions • Functions: like delegates + unified and simple • Deep type inference (funx ->x + 1) letmyFunc x = x + 1 valmyFunc : int ->int let recfactorial n = if n>1 then n * factorial (n-1) else 1 let data = [5; 3; 4; 4; 5] List.sort (fun x y -> x – y) data

  16. Pattern Matching let (fst, _) = (“first”, “second”) Console.WriteLine(fst) let switchOnType(a:obj) match a with | :? Int32 -> printfn“int!” | :? Transport -> printfn“Transport“ | _ -> printfn“Everything Else!” • Very important part of F# • Helps deal with the ‘teasing apart’ of data • Works best with Discriminated Unions & Records

  17. Lists, Types, Interactive demo

  18. Search • Given a search term and a large document corpus, rank and return a list of the most relevant results…

  19. Blog Crawler

  20. Search • Words • Stemming? Tokenize? • E.g ‘Python/Ruby’ • Markup • Title, Author, Date • Headings (h1,h2 etc) • Paragraphs • Links • A sign of strength? Let’s explore something simple…

  21. Search • Simplify: • For easy machine/language manipulation • … and most importantly, easy computation • Vectors: natures own quality data structure • Convenient machine representation (lists/arrays) • Lots of existing vector math algorithms After a loving incubation period, moonlight 2.0 has been released. <a href=“whatever”>source code</a><br><a href”something else”>FireFox binaries</a> … after after incubation loving moonlight firefox linux binaries 2 1 1 6 4 6 2

  22. Term Count the incubation crazy moonlight firefox linux penguin • Document1: Linux post: • Document2: Animal post: • Vector space: 9 1 1 6 4 6 2 crazy the dog penguin 2 2 1 5 the incubation crazy moonlight firefox linux dog penguin 9 1 1 6 4 6 0 2 2 0 2 0 0 0 1 5

  23. Term Count Issues the incubation crazy moonlight firefox linux dog penguin • ‘the dog penguin’ • Linux: 9+0+2 = 11 • Animal: 2+1+5 = 8 • ‘the’ is overweight • Enter TF-IDF: Term Frequency Inverse Document Frequency • A weight to evaluate how important a word is to a corpus • i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query 9 1 1 6 4 6 0 2 2 0 2 0 0 0 1 5

  24. TF-IDF • Normalise the term count: • tf = termCount / docWordCount • Measure importance of term • idf = log ( |D| / termDocumentCount) • where |D| is the total documents in the corpus • tfidf = tf * idf • A high weight is reached by high term frequency, and a low document frequency

  25. Search Engine in under 10 mins demo

  26. Fuzzy Matching • String similarity algorithms: • SoundEx; Metaphone • Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; … • We’ll look at Levenshtein Distance algorithm • Defined as: The minimum edit operations which transforms string1 into string2

  27. Fuzzy Matching • Edit costs: • In-place copy – cost 0 • Delete a character in string1 – cost 1 • Insert a character in string2 – cost 1 • Substitute a character for another – cost 1 • Transform ‘kitten’ in to ‘sitting’ • kitten -> sitten (cost 1 – replace k with s) • sitten -> sittin (cost 1 - replace e with i) • sittin -> sitting (cost 1 – add g) • Levenshtein distance: 3

  28. Fuzzy Matching • Estimated string similarity computation costs: • Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. • Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance. • Parallelisable – split the set of words to compare across n cores. • Can do approximately 10,000 compares per second on a standard single core laptop.

  29. Did You Mean? demo

  30. Classification • Support Vector Machines (SVM) • Supervised learning for binary classification • Training Inputs: ‘in’ and ‘out’ vectors. • SVM will then find a separating ‘hyperplane’ in an n-dimensional space • Training costs, but classification is cheap • Can retrain on the fly in some cases

  31. SVM Classification

  32. SVM Issues • Classification on 2 dimensions is easy, but most input is multi-dimensional • Some ‘tricks’ are needed to transform the input data

  33. SVM Classifier demo

  34. F# and AlgorithmsNetflix Demo • Netflix Prize - $1 million USD • Must beat Netflix prediction algorithm by 10% • 480k users • 100 million ratings • 18,000 movies • Great example of deriving value out of large datasets • Earns Netflix loads and loads of $$$!

  35. Nearest NeighbourFind neighbours who like what I like

  36. Netflix Data FormatNetflix Demo

  37. Nearest Neighbour AlgorithmFind all my neighbours movies • Find the best movies my neighbours agree on

  38. Netflix Recommendations demo

  39. A Short Stop-over at Vector Math A (x1,y1) B (x2,y2) C (x0,y0) If we want to calculate the distance between A and B, we call on Euclidean Distance We can represent the points in the same way using Vectors: Magnitude and Direction. Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.

  40. Q & A • Any questions? • http://callvirt.net/ • joelpobar@gmail.com • THANKS!

More Related