330 likes | 453 Views
NUWeb System. sw@gais.cs.ccu.edu. WWW Architecture. Web Server (e.g., Apache, IIS) Browser (e.g., IE, Firefox) Addressing and Information Channel (DNS, URL, SearchEngine) Abstract Model: Provider (server), Consumer (client), Channel Client-Server architecture, Centralized Service.
E N D
NUWeb System sw@gais.cs.ccu.edu
WWW Architecture • Web Server (e.g., Apache, IIS) • Browser (e.g., IE, Firefox) • Addressing and Information Channel (DNS, URL, SearchEngine) • Abstract Model: • Provider (server), Consumer (client), Channel • Client-Server architecture, Centralized Service
Problems of the WWW due to the fundamental design • Naming/Addressing problem: • Physical naming/addressing • Static Binding through DNS • URL may not be a good design, (hard-to-remember) • DNS could be slow • Information flow organization not designed in the first place, • Hotspot bottleneck problem, bandwidth waste problem, • Cache and Proxy tech are added separately afterwards, • Linkrot problem • Dead links, wrong links, faked links, • Approximately up to 15% of links • Need static IP, need to apply for URL, need knowledge in building up and managing Websites • Creating and maintaining a website is costly • Webpage creation is not easy • Divide the computer world into two hierarchies • Server: Website owners, service providers • Client: ordinary users
Weaving the Web(quoted from wikipedia) • In Berners-Lee's book, Weaving the Web, several recurring themes are apparent: • It is just as important to be able to edit the Web as browse it. Wikis are a step in this direction, although Berners-Lee considers them merely a shadow of the WYSIWYG functionality of his first browser. • Computers can be used for background tasks that enable humans to work better in groups. • Every aspect of the Internet should function as a Web, rather than a hierarchy. Notable current exceptions are the Domain Name System and the domain naming rules managed by ICANN. • Computer scientists have a moral responsibility as well as a technical responsibility.
What Is NUWeb? • Marriage of WWW with P2P • Technologically: • NUWeb = WebServer + Browser + WNS + SearchEngine + Proxy/Cache + WebBuilder + Blog + CommunityEngine + KIM + P2P – URL – DNS and – Cost • Logically: • A New Web System for any net user to build his/her own web in an extremely easy-to-use way. • A platform for web-building, information sharing, information management, community, and service management • A platform for Webilization • A project to pursue Wemocracy
NUWeb Functions • A platform for Public Sharing and Publishing • Personal website/blog • Public community • Search Engine, • A platform for Private Sharing and Community • Personal community builder • Sharing management • A platform for personal information / knowledge management, content engine,
NUWeb Software Architecture • NUWeb system is composed of three subsystems • NUWeb.CC CyberCenter • WNS, (web name service), • Search engine, Cache • Commuity services, (Photo, Blog, Video…) • NUWeb CP (Community Portal) • Community services, (Blog, Photo, Video…) • Search Engine service, • Proxy and Cache • NUWeb PP (Personal Portal) • NUWeb browser, kim, • NUWeb server, • NUWeb personal portal/blog builder
How it works • Personal Web server on Windows platform • Auto indexing, thumbnail, • Auto page generation and run-time rendering • Auto caching, • Bundled with php/perl platform • Registration to WNS in the set up, • Site name, user-account, SiteKey, … • UPNP to handle firewall/NAT, • Packet forwarding Proxy to handle the cases where UPNP does not work correctly.
How it works (2) • Each time a client gets on line, send the current IP and name/key info to the WNS center. • The connection request to a personal site will first send the name of the site to the WNS to get the IP of the target site (dynamic binding) • If the requested site is not online, then the center will redirect the request to the cache server. • If the site is connected through proxy, then connect it through relay proxy.
Naming and Dynamic Addressing • A page is a textual web document. It contains UltraLinks or tags and the display of such page might instantiate the display of some other objects such as included images. • An object is either a richtext document such as pdf, msdoc, msppt, etc., a multimedia file, or any singular file that can be accessed in the web space. • A resource is either a page or an object • GRN, global resource naming • SiteUniqName#objectname[#class#type#location] • fixed IP is not necessary • ABN (AddressByName), ABI (AddressById), ABC(AddressByContent) • USI (UniversalSiteId),
NUWeb CyberCenter • GRI: Global Resource Index • A distributed index structure for objects/pages on the NUWeb space • Use hash data structure • Search engine, Community Service, Portal for NUWeb • Proxy & Caching • Auto backup and versioning • Info filtering, content switching • Packet forwarding, center relay • Relay casting, media streaming • Hierarchical search • Collaborative cache (super cache)
Site Initialization • When a new site is installed: • Register the following info • SiteUniqName, to be interacted by the center • Titles of the site (at most T bytes) • Abstract of the site (at most P bytes) • tags, (if inappropriate, such as infringing others right, will be abolished by the center) • Country/city/county, real world geography info • Profile of personal info • Residents : SUN.resident will identify a user • Decide which directories to be open to public • Decide which directories to be open to private connections • Decide whether to open caching of the public directory
Site Initialization • The server will build an index for the pages/objects that are covered in the site . The index for public and private areas are separated such that the privacy will be secured. • The index is on the name and signature level, plus the content of pages, the support for object content index such as ms-doc files pdf files will be optional • After the site is set up, the user will be asked to provide a list of friends to which the system will send invitation letters.
NUWeb Services • NUSite, NUBlog • NUSearch, NUSM • NUCommunity, NUBBS, • NUBot, NUWatch, NUPush • NUCache, NUProxy • NUPedia, knowledge authoring/manager • NUMail, P2P secure mail system • NUJournal
Searching • The search in the nuweb center includes: • Search pages/objects by name (WNS) • Page content search • * attributed search , for example, search for pages authored by Hamming • The indexer in each nusite will send the raw-index to the center, and the center will build an index . The raw-index is a record containing indexable texts for each page or object. A text extractor will be used to extract text from rich text documents such as MS-DOC/PPT documents. The upload of such raw index will get approval from the users first. • Before rendering the search result to the user, the searcher needs to check whether the result page/object exists at that moment. • It uses the SSN to check the SiteDB and to see whether that site is avalable. It also use grn to check where such resource is available in the cache.
Caching • Caching • Every site page will be automatically cached, unless explicitly disabled • In the first phase, the caching will be done in the center and the NUWeb CP cache spaces. Objects will be cached if accessed • The client will cache it in its cache spool, and an index will be sent to the center to notify the center that it has such object in cache. • In the second phase, the caching will be done by collaborative caching in the p2p space too, assuming that some of the personal sites are willing to participate. • The cache object will be indexed by GRN and MD5 • Note that if an object is modified, it will trigger a update to the global cache space to remove the original cache indexed by GRN • Each cache object will record a timestamp of the content (the time such content is created.)
GRI & Collaborative Proxy • GRI: • Object indexed by MD5-signature & GRN • Home page indexed by GRN • Instance indexed by MD5 • Syntax: • GRN: SUN#OBN • Distributed/Collaborative GRI • Multi-tier Collaborative Proxy
Indices (1) • In the nuweb center, there are several indices: • SiteDB: indexed by SSN • Last live time, access cnt, data size, • When alive, each site will periodically send alive info to the center (every K minutes) • NameDB: indexed using gaisindex • Each name is associated with a SSN by which we can check whether such page/object exists. • Each name will have a record, which will have a SSN value, and a GRN cache flag • In the search result of name db, if a record does not have a online instance (either roiginal site or the cache copy), it will have a flag indicating “not available”
Indices(2) • MD5 index, objects/pages indexed by MD5 signature. Each site will produce MD5 signatures for each object, and the (grn,md5) info will be sent to the center to be indexed.The return of a MD5 lookup is the source SSN/IP or the cache site/s IP • Page/document Content index • Indexed through gais search engine
NUWeb Portal Service • Search engine for the NUWeb cyberspace • Websites, pages, pictures, videos, documents, articles, etc., … • Browsing and Viewing • What’s hot, what’s new, what’s cool, • Automatically generated through page rendering tool based on a CountDB and list manager.
NUWeb DB • NUWeb cache is implemented through NUWeb DB system. • NUWeb DB is to store Web Objects and relationship and provide search function. • Web DB: • ODB, (Object DB) • NDB, (Name DB) • IDB, (Index DB) • TDB, (Term DB) • UDB, (User DB) • SDB, (Site DB) • Page Engine • Access Log DB (PV DB) • Access Control • Query Interface (including SQL) *
Web DB implementation • ODB and NDB is the kernel storage DB • The key technique used in ODB and NDB is the Hash DB which needs to minimize the disk seeks and maximize the memory usage. • PV DB (Access log DB) is implemented on top of ODB and NDB. • Term DB is implemented on top of ODB too. Term DB will record the term frequency, term score … information.
Web DB implementation (2) • Site DB records the site info such as access frequency, size, dynamics, etc. • IDB is a real time index engine for all the objects stored in Web DB. • Access Control: • Authorization: permission list based • Authentication: through an authentication center in WNS server. • SQL is not supported yet, on the todo list.
NUDB • Net User’s DataBase • Easy to use, • No background of database is needed. • No need to program • Define the spec and start to use, • Spec can be adjusted flexibly • Scalable • Combine the advantages of Table processing software such as Excel and Database systems • Portable, computable, mergeable
NUDB implmentation • Physical DB Kernel • Hash DB • Inverted Index • Pattern Matching • Schema Layer, and Query Processing • User Interface Layer • Data Presentation Management • DUA (Database User Agent, 類似 MUA)
NUBlog • AJAX Based Blog System • Personal Blog Home Base • Can have multiple copies in the web • Creation, Management, Posting • Import, Export: • XMLRPC • Robot, simulating Browser behaviour
NUWatch • Personal Web Agent • Event Watch, News Watch • Service Watch, • Site Watch, • Commerce Watch,
NUWatch Implementation • Personal Profile Manager • Matching Platform • On the fly matching • Batch mode matching through searching • Data Source Agents • Per user agent • Centralized agent (can reduce overhead) • Notification Agent • Relay casting to speed up • Gateway to message system
NUCommunity • Personal and Regional Community Engine • Forum, Vote, • Calendar, File Sharing, • Address Book, DB, .. • Interaction mechanism, (auto notification,..) • A community is conceptually a given a NUWeb site • A community is treated like a user in the NUWeb space’s authentication and authorization
Access Control • Support both password-based and membership based protections. • Each directory is associated with a protection data structure • Authentication in WNS server • Use Permission List technique for membership based protection • The protection is a directory base, no inheritance will be assumed.
NUJournal • Why the publication is through paper?! • Traditionally, publication HAD TO BE published through paper in the old age • Journal is both a channel and a barrier • Most of the papers entered the dead state once published • A new model of publication • Separate the concept of publication and evaluation • Publication is an autonomous will, and publication can be through own website!, reviewed, commented by readers, or reviewers. • Journal is a marketplace to glue/guide the accesses of publications and to comment and evaluate the publications • A publication can be a long time living object • Other authors can join the published work along the time, if they make substantial contributions to the work. • A publication is evaluated by its contribution and impact.