150 likes | 222 Views
Comp 480/580: Probabilistic Algorithms and Data Structures. About . Instructor : Anshumali Shrivastava https://www.cs.rice.edu/~as143/ Email: anshumali@rice.edu Office Hours: Tuesday 2:15pm – 3:15pm TAs: Kostas ( Constantinos Chamzas ): chamzask@gmail.com
E N D
About • Instructor: Anshumali Shrivastava • https://www.cs.rice.edu/~as143/ • Email: anshumali@rice.edu • Office Hours: Tuesday 2:15pm – 3:15pm • TAs: • Kostas (ConstantinosChamzas): chamzask@gmail.com • MohammadaminEdrisi: mae6@gmail.com • Office Hours will be announced soon. • Location: RYN 201 • Meeting Hours: Tuesdays and Thursdays 1:00pm - 2:15pm
Important Links to Remember • Course Web Page: • https://www.cs.rice.edu/~as143/COMP480_580_Spring19/index.html • Materials and slides • Canvas for Discussions. • For announcements and discussions. • Please sign up on Canvas for Comp 480. • Those who are signed up for Comp 580, please sign up for 480 on Canvas.
Prerequisites • COMP 182 or Equivalent Required. • COMP 382 is useful but not required. • Ability to manipulate arrays, linked lists, etc. will be needed for assignments. Familiarity with Basic File I/O. • Basic Probability: Comfortable with Random Variables, Expectation and Variance.
Grading • Term Project or Final Exam: 30% (Choose by 17thJan via Email to me and TAs) • Assignments: 40% (Biweekly starting 17th Jan due in 2-week time). Your lowest scored wont be counted (1 Slack). GENERALLY NO EXCEPTIONS TO DEADLINE. • Mid-Term Exam: 20%. (19th March) • Lecture Scribing:10% (Scribe 1 Lecture Starting 15th Jan)
If You Choose Project over Finals • Group of maximum 2 students. • Propose 1-page abstract due 24th Jan. (Iterate with me or TAs until approved) • OR Choose from a set of projects and write 1-page abstract by 24th Jan. (Announced Next Week) • Deadline for Finalizing the Project: 7th Feb. • Checkpoint 1: 2-page Progress Report Due: 28th Feb. (Literature Review and Proposed Idea) • Mid-Term Checkpoint: 2-page Progress: 21st March. (Preliminary Results) • Final Checkpoint: Report and Presentation. (TBA)
Assignments and Exams • Individual only. • Basic Blackboard discussions are allowed only for the coding part. • Read about Plagiarism • https://www.plagiarism.org/article/what-is-plagiarism • https://gpsdocs.rice.edu/orientation/Plagiarism_Hewitt_document.pdf • Rice Honor Code • http://honor.rice.edu/
Scribing • What is a scribe? • Come to class and take notes. • Short: Create lecture notes for the class. • Longer Description: 2-3 page clear and concise description of what was taught in a class so that if one of your fellow student missed the class, he can read the scribe and get everything. • Latex template available at: https://www.cs.rice.edu/~as143/COMP480_580_Spring19/scribe/scribe.zip • First three lectures are scribed by me. See an example. • By 15th January, choose one of the lectures to scribe. • Link to Sign Up: • Scribe are due within a week.
Why Probabilistic (Randomized) Algorithms? • Case Study 1: Design a Product Search Engine for Amazon.com • Take a database D of statistically significant query strings observed in the past. (say around 50 million). Given a new user typed query q, find the closest string (in some distance) to q and return items purchased with it. • Latency: 50 million distance computation per query. A cheap distance function it takes more than 400s or 6min. If you used edit distance, it will be hours. • Latency Limit is roughly < 20ms. (20000x improvement needed) • Ideas?
Why Probabilistic (Randomized) Algorithms? • Case Study 2: Design Chrome Malicious URL Detector • Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. • Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 25 average string length). 50MB is too heavy for a browser so this cannot be locally done!! • Anything locally should be something < 2MB memory. (25x improvement needed) Wait you cannot even compress the strings (Huffman Coding) to that size. • Ideas?
Different Scale is a Different World!! • Current algorithm is off by 1.5x factor. • Reactive Approach: Start with standard approach and use tricks and tools to squeeze everything out and make it better. • Similar ideas with smart implementation may do the trick. • Current algorithm is off by 10x+ factor. • Go to drawing board. You have the transcend the current level of thinking. Anything which is similar to current method is unlikely to work! We cannot solve our problems with the same thinking we used when we created them. • Albert Einstein
No Free Lunch (NFL) Theorem • I want 10x+ improvements over existing ones. • What am I trading off? Uncertainty is the refuge of hope. - Henri Frederic Amiel • We will trade-off Certainty. (It is a resource!!)
Why Probabilistic (Randomized) Algorithms? • Case Study 1: Design a Product Search Engine for Amazon.com • Take a database D of statistically significant query strings observed in the past. (say around 50 million). Given a new user typed query q, find the closest string (in some distance) to q and return items purchased with it. • Latency: 50 million distance computation per query. A cheap distance function it takes more than 400s or 6min. If you used edit distance, it will be hours. • Latency Limit is roughly < 20ms. (20000x improvement needed) • What if there is an algorithm that runs in 2ms and gives you a good enough and fails one out of million queries?
Why Probabilistic (Randomized) Algorithms? • Case Study 2: Design Chrome Malicious URL Detector • Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. • Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 25 average string length). 50MB is too heavy for a browser so this cannot be locally done!! • Anything locally should be something < 2MB memory but wait you cannot even compress the strings (Huffman Coding) to that size. • What if there is an algorithm that only uses 2MB memory and notifies perfectly if the url is malicious, but once out a million matches it can also notify a correct url as malicious? (Is that good enough?)