1 / 16

Data Structures & Algorithms and The Internet:

Explore the impact of data structures and algorithms on internet technologies, including comparison sorts, password cracking, histograms, graphs, server architecture, denial of service mitigation, partitioning, captchkas, databases, and scalability. Learn how these concepts affect data processing, security, redundancy, and user experience.

arobertson
Download Presentation

Data Structures & Algorithms and The Internet:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Structures & Algorithms and The Internet: A different way of thinking

  2. Comparison Sort vs Bucket Sort • What is bigger? The list of numbers or the alphabet of possible numbers? • O(n log n) vs. O(n+m)

  3. Password cracking • Is it easier to try to crack one account by trying a million passwords, • Or is it easier to try to crack a million accounts by trying the same password?

  4. Histograms and large vectors • How do search engines represent documents for searching/indexing? • As a histogram of words, e.g. a vector of say 100000 entries for each document, containing word counts • Term frequency – inverse document frequency: enhances rare words, diminishes common words • Math done on vectors to determine similarity / relevance

  5. Good Enough is OK • Not the fastest algorithm? Put more computers on it. Cache results. • Not the best results? Not a big deal if user can try again. • The data may be wrong/ambiguous anyway. • Britney Spears / Paris Hilton

  6. Graphs are everywhere • Web is a graph. Page rank algorithm says important/relevant pages are linked to by important/relevant pages. • Reverse lookup problem with web links, basically building inverse edges to a directed graph • Spam / porn links to each other… look for bad cliques? • Social networking. Graph of friends. If you respond well to an ad, maybe your friends will too.

  7. Liars, spam, hackers, etc. • Algorithms are easier when you can trust your data • Is this email spam? • Does this website have useful content? • Can you make it where if you do N work to detect spammers, they have to do N^2 work to avoid you?

  8. Sample Server Architecture cache1 front1 back1 back4 cache2 Load balancer front2 back2 back5 cache3 front3 back3 back6 cache4

  9. Denial of Service • A little amount of work by a client causes a lot of work for the server, slowing down or blocking out other users • Either malicious or accidental • Example: fast queries to a search engine that cause cache misses • How to catch?

  10. Log IP addresses • Block addresses causing too many cache misses within 10 sec, 1 min, 1 hr • They may be hitting different front ends • HashMap of IP addresses to counters? • But how many IP addresses?

  11. Data structures are too large for one computer • Store pieces of data structures across many computers • How do you know what is where? • One solution is a master computer serving as an index, but this is a single point of failure

  12. Partitioning • How to divide 4 billion IP addresses among several machines? • We need to know deterministically, given an IP, what machine to look it up on? • Use a hash function that knocks it down to the four machines, i.e. hash has good entropy on AAA.BBB.CCC.DDD • Sometimes can also use a small set of sorted data to divide into ranges that should evenly distribute the larger data

  13. Captchkas • Computation easy for person but hard for computer, vision-based problems • How to stop computers from getting fake email addresses? Make it hard to automatically solve a puzzle required for registration • How do spammers bypass? Pay 3rd-world folks to solve puzzles all day, or offer free porn for solving puzzles

  14. Databases • Persistent storage for your data • Let’s you do complex queries using a query language • Can add a layer of abstraction over sets, lists, hashes, etc. • Boolean/set logic very important

  15. Redundancy and robustness • Is your data fine if a computer goes down • Is your server fine if a computer goes down • Can you undo/revert changes? • If something changes somewhere, how does everyone know about it? • Average runtime vs guaranteed runtime

  16. Scaling to more users/data • Is growth as simple as adding another machine? (e.g. linear time / space) • If you get dugg, can you handle a burst of users? • Can you fail softly?

More Related