1 / 18

People Search Engines

People Search Engines. Ingmar Weber Email: ... find it via http://scifi.epfl.ch Joint work with Adish Singla. Demo of people search engines. Google address search (US only) Intelius background check (US only) Who’s who (25,000 big guys only) Business web (linkedin, xing, certain websites)

walker
Download Presentation

People Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. People Search Engines Ingmar Weber Email: ... find it via http://scifi.epfl.ch Joint work with Adish Singla

  2. Demo of people search engines Google address search(US only) Intelius background check(US only) Who’s who(25,000 big guys only) Business web(linkedin, xing, certain websites) Social networks++(myspace, friendster, wikipedia, xanga, …) User contributed (tags, wikipedia, user added) Database community (VLDB, SIGMOD, CIDR, ..) More: spock, pipl, isihighlycited, csbib, scirus, …

  3. A “Scientist Finder” (SciFi) http://scifi.epfl.ch Why I work on this: • I simply enjoy building (working) systems • Good to stay in touch with Web 2.0 and all that • Interesting to do (simple) web mining • A good way to improve “visibility” • Part of a bigger plan …

  4. Outline • Starting point • Why to search for a person • Why/how is people search different • Efficiency • How long are users willing to wait • How to make things (reasonably) efficient • Result Quality • What makes our life easy • What makes our life hard • Concrete Examples • Case 1: Finding a good picture • Case 2: Finding the (approximate) birth year • Software Design • Things we got right • Things we’d get right next time

  5. Why to search for a person? • Vanity • “Digital mirror” • Who is he/she? • Search for unknown author • Contact information • (email) address • Devoted fan • “britney spears” • It’s fun • Does the machine “know” your friends?

  6. Why is people search different? • What goes in (different sources) • social networks, phone books, personal homepages • What comes out (fact extraction) • address, age, photo, homepage url • What goes down (unforgiving users) • facts are either wrong or right, wrong picture is “major disaster”, offended by age or # publications

  7. How long will users wait? • It depends on the setting • Information retrieval without feedback • About 2 seconds are still tolerable • Information retrieval with feedback • Up to about 15 seconds is ok “A study on tolerable waiting time: how long are web users willing to wait?”, Fiona Fui-Hoon Nah, Journal of Behavior & Information Technology, 2004, vol. 23, no. 3 Show partial results as quickly as possible. Have a progress bar. At least have an hourglass.

  8. How to make things efficient • Do many things at once • One data source = one process • Heavier than threads, but easier • Know your bottlenecks • Network • Write small test programs • Grep a large file for a name • Download some web pages Use multiple processes when it makes sense. Don’t optimize where it’s unnecessary. Know realistic goals.

  9. What makes our life easy • Restriction to scientific domains • *.edu, *.ac.uk, uni-*.de, epfl.ch, … (Michael Jordan) • !/intranet/, !/events/, !/courses/, … • Predictable career path • 2 publs = PhD student, 80 publs = professor • First publication between 23 and 28 • Prominent homepage • High page-rank for scientific institutions • Easy to find via Google • Spam-free queries • No spam on academic pages • No commercially interesting queries

  10. What makes our life hard • Lazy people • Professors without a (proper) homepage • Photo.jpg, DSF12345.jpg, … • Multiple personality people • Yan Zhang: 5,000 publications in 10 years • Claire Kenyon = Claire Kenyon-Mathieu = Claire Mathieu • Schröder, Schroeder, Schroder • Mobile people • Institution 1 -> Institution 2 -> Institution 3 -> … • Untrustworthy input sources • Google images for “Thomas Henzinger”

  11. Case 1: Finding a good image • Showing a picture makes it more concrete • “Hey, that’s her/him/me!” • Most people have a picture on their homepage • … but how to find it? • “logo.jpg”, “book.jpg”, “photo.jpg”, “me.jpg” • Use various heuristics • Image size/proportions, image name, face detection (!) • Use additional sources • Google image search (“faces only”)

  12. Case 2: Getting the birth year • Easy: Wikipedia with structured entry • Fairly easy: birth date on homepage • Otherwise: • “thomas henzinger was born” (data gem) • First publication+25 (dirty hack, approximation) Not used yet: • Try to parse CV (Abitur+19, Bachelor+23, …)

  13. Things we got right • Modularity • One data source = one process • Each process can be run stand-alone • Each data field is optional • Display module 100% separate from computation • Using (plain text) files • Great for debugging/inspection/editing • Automatic database with caching • Uncertainty is certain • Each value has a “certainty” • Take into account when merging results

  14. Things we’d get right next time • From the beginning, allow additional clues: • “John Smith” @ Columbia University • “John Smith” in computer science • Think more about GUI design • Display “alternative homepage” • Hide empty fields • Display “uncertain” fields differently • Dynamic author disambiguation (maybe) • Oh, there are various “John Smith”s

  15. Bonus Slide • Some “scientists” people have searched for • Napoleon • Goethe • Mozart • Huckleberry Finn • Some “scientists” nobody searched for • Gyro Gearloose • The Brain

  16. Thank you!Any questions, comments, insults?

  17. Things to Do • Let user filter by uncertainty • only display “sure” results • display all alternatives • Filter by additional information • “John Smith” @ Columbia University • “John Smith” in computer science • Show links to external sources • Wikipedia entry • DBLP entry • Parse CV (where it’s available) • education • past affiliations

More Related