420 likes | 724 Views
Proxy Servers. What Is a Proxy Server?. Intermediary server between clients and the actual server Proxy processes request Proxy processes response Intranet proxy may restrict all outbound/inbound requests the intranet server. What Does a Proxy Server Do?. Between client and server
E N D
What Is a Proxy Server? • Intermediary server between clients and the actual server • Proxy processes request • Proxy processes response • Intranet proxy may restrict all outbound/inbound requests the intranet server
What Does a Proxy Server Do? • Between client and server • Receives the client request • Decides if request will go on to the server • May have cache & may respond from cache • Acts as the client with respect to the server • Uses one of it’s own IP addresses to get page from server
Usual Uses for Proxies • Firewalls • Employee web use control (email etc.) • Web content filtering (kids) • Black lists (sites not allowed) • White lists (sites allowed) • Keyword filtering of page content
User Perspective • Proxy is invisible to the client • IP address of proxy is the one used or the browser is configured to go there • Speed up retrieval if using caching • Can implement profiles or personalization
Main Proxy Functions • Caching • Firewall • Filtering • Logging
Web Cache Proxy • Our concern is not with browser cache! • Store frequently used pages at proxy rather than request the server to find or create again • Why? • Reduce latency: faster to get from proxy & so makes the server seem more responsive • Reduce traffic: reduces traffic to actual server
Proxy Caches • Proxy cache serves hundreds/thousands of users • Corporate and intranets often use • Most popular requests are generated only once • Good news: Proxy cache hit rates often hit 50% • Bad news: Stale content (stock quotes)
How Does a Web Cache Work? • Set of rules in either or both • Proxy admin • HTTP header
Don’t Cache Rules • HTTP header • Cache-control: max-age=xxx, must-revalidate • Expires: date… • Last-modified: date… • Pragma: no-cache (doesn’t always work!) • Object is authenticated or secure • Fails proxy filter rules • URL • Meta data • MIME type • Contents
Getting From Cache • Use cache copy if it is fresh • Within date constraint • Used recently and modified date is not recent
2. Firewalls • Proxies for security protection • More on this later
3. Filtering at the Proxy • URL lists (black and white lists) • Meta data • Content filters
Filtering label base Web doc URL lists keywords URLs ratings URLs ratings
The Problem: the Web • 1 billion documents (April 2000) • Average query is 2 words (e.g., Sara name) • Continual growth • Balance global indexing and access and unintentional access to inappropriate material
Filtering Application Types Proxies • Black lists • White lists • Keyword profiles • Labels
Black and White Lists • Black list : URLs proxy will not access • White list: URLs proxy will allow access
How Is Filtering/selection Done? • Build a profile of preferences • Match input against the profile using rules
Black and White Lists • Black list of URLs • No access allowed • White list of URLs • Access permitted
Lists in Action • 1 billion documents! • Who builds the lists • Who updates them • Frequency of updates
Labels • Metadata tags • Rule driven: PICS rules for example • Labels are part of document or separate • Separate = label bureau
Labels • Metadata (goes with page) • Label Bureau (stored separately from page)
Meta Data as part of HTML doc <HTML> <HEAD> <META HTTP-EQUIV=“keywords” CONTENT=“federal”> <META HTTP-EQUIV=“keywords” CONTENT=“tax”> </HEAD> …… </HTML> Browser and/or proxy interpret the metadata
Metadata Apart From Doc • Label bureaus • Request for a doc is also a request for labels from one or more label bureaus • Who makes the labels • Text analysis • Community of users • Creator of document
Labels: Collaborative Filtering Search Engine Label Bureau B Labels Author Labels Label Bureau A Web Site Rating Service
PICS and PICS Rules • Tools for communities to use profiles and control/direct access • Structure designed by W3 consortium • Content designed by communities of users
PICS Rating Data (PICS1-1 “http//www.abc.org/r1.5” by “John Doe” labels on “1998.11.05” until “2000.11.01” for http://www.xyz.com/new.html ratings (violence 2 blood 1 language 4) )
Using a URL List Filtering (PicsRule-1.1 (Policy (RejectByURL (http://www.xyz.com:*/*) Policy (AcceptIf “otherwise”) ) )
Using the PICS Data (PicsRule-1.1 (serviceinfo ( http://www.lablist.org/ratings/v1.html shortname “PTA” bureauURL http://www.lablist.org/ratings UseEmbedded “N” ) Policy (RejectIf “((PTA.violence >3) or (PTA.language >2))”) Policy (AcceptIf “otherwise”) ) )
Example: Medical PICS labels • Su – UMLS vocab word: 0-9999999 • Aud- audience: 1-patient, 3-para, 5-GP, etc. • Ty-information type: 5-scientist, 3-patient, 4-prod • C-country: 1-Can, 4-Afghan, etc. • Etc. • Ratings(su 0019186 aud 3:5 Ty 3 C 1)
User Profiles for Labels • Rules for interpreting ratings • Based on • User preferences • User access privileges • Who keeps these • Who updates these • How fine is the granularity
Labels and Digital Signatures Labels can also be used to carry digital Signature and authority information
Example (''byKey'' ((''N'' ''aba21241241='') (''E'' ''abcdefghijklmnop=''))) (''on'' ''1996.12.02T22:20-0000'') (''SigCrypto'' ''aba1241241=='')) (''Signature'' ''http://www.w3.org/TR/1998/REC-DSig-label/DSS-1_0'' (''ByName'' ''plipp@iaik.tu-graz.ac.at'') (''on'' ''1996.12.02T22:20-0000'') (''SigCrypto'' ((''R'' ''aba124124156'') (''S'' ''casdfkl3r489'')))))
Text analysis of Page content • Proxy examines text of page before showing it • Generally keyword based • Profile of ‘black’ and/or ‘white’ keywords
Profiles for Text analysis • Keywords (+ weights sometimes) • ‘Reflect’ interest of user or user group • May be used to eliminate pages • ‘All but’ • May be used to select pages • ‘Only those’
Keyword matching algorithms • Extract keywords • Eliminate ‘noisy’ words with stop list (1/3) • Stem (computer compute computation) • Match to profile • Evaluate ‘value’ of match • Check against a threshold for match • Show or throw!
the for of on and is to with in by a as be this will are from that or at been an was were have has it (27 words) Stop List (35%)
Matching Profile to Page • Similarity? • How many profile terms occur in doc? • How often? • How many docs does term occur in? • How important is the term to the profile?
Cosine Similarity Measurement • Profile terms weighted PW (0,1) importance • Document terms weighted TW (0,1) • frequency in doc • frequency in whole set • Overall closeness of doc to profile (all profile terms)[TW *PW] -------------------------------------------- ((all profile terms)[TW2]*[PW2])
What works well? Nothing
What’s the problem? • Site Labels • Who does them? • Are they authentic? • Has the source changed? • A billion docs? • Black and White lists • Ditto • Text analysis of page contents • Poor results