210 likes | 296 Views
The lifespan, accessibility and archiving of dynamic documents. K. Wegrzyn-Wolska katarzyna.wegrzyn@esigetel.fr ESIGETEL Ecole Sup é rieure d’Ing é nieurs en Informatique et G é nie des T é l é communications. ____________________________________________________
E N D
The lifespan, accessibility and archiving of dynamic documents. K. Wegrzyn-Wolska katarzyna.wegrzyn@esigetel.fr ESIGETEL Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications ____________________________________________________ OSWIR 2005 - Workshop on Open Source Web Information Retrieval in association with the 2005 IEEE/WIC/ACM International Conferences on Web Intelligence
Presentation • Introduction • The Lifespan and Age of Dynamic Documents • Dynamic Document Categories • News published on the Web • Weblog sites • Search Engines • Archiving • Statistical Evaluation • Conclusions Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Introduction • Some questions : • Real documents? • Temporary presentation of data? • Created automatically? • Created as a response to the users' questions? • HTML page with some dynamic parts? • (layers, scripts, etc.) • Created on-line by the Web server? Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Dynamic documents categories • individual demands from the user : • Results from Search Engines, • Response from data forms, • etc. • automatically by specialised application : • different News sites, • forums, • etc. • different behaviour and characteristics Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The Lifespan and Age of Dynamic Documents • Question : • How to evaluate the lifespan of dynamic documents? • documents disappear immediately from the computers’ memory after consultation • How to determine age of the dynamic documents? • http header Modified and Expired or the value fixed in HTML file with the META tag • Answer : • period where the response for the same questions doesn’t change • period it is the lifespan visible by the user Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The Lifespan and Age of Dynamic Documents • News sites • Various informations • General sites, • World news • Local news, • Etc. • characteristics : • Automatically created, • Updated instantaneously, • Archiving Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
news url updating Archiving French Google http://news.google.fr ~20 min 30 days Google http://news.google.com ~20 min 30 days Voilà News http://actu.voila.fr/ 1 day 1 week Voila http://actu.voila.fr/Depeche/ (~30min) 1 week CNN http://www.cnn.com/ Yahoo!News http://fr.news.yahoo.com/ instantaneously 1 week TF1 news http://news.tf1.fr/news/ instantaneously News now http://www.newsnow.co.uk/ 5 min Les Infos http://www.lesinfos.com/ Since 2000 CategoryNet http://www.categorynet.com Every day Never ending CompanynewsGroup http://www.companynewsgroup.com 40 per day 2003 & 2004archived; 1999 – 2003 in a futur The Lifespan and Age of Dynamic Documents Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The Lifespan and Age of Dynamic Documents • Weblogs: • Daily updated, • Form: • Web page modified, • Varied information, • General: • Dynamic pages • Regulary updated Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The Lifespan and Age of Dynamic Documents • Search Engines : • response : • Dynamic pages • created on-line • Lifespan, accessibility : • period when the Search Engine's answer doesn’t change, • updating frequency of index-databases • Example : • Google : 4 weeks Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Archiving • Question ? • how to store (archive) the dynamics documents? • Solution : • simple : printed version • saved by the user • put into special caching and archiving systems • applications, which try to save up to date Web image • example : Wayback Machine Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Archiving (Google News by WeybackMachine) Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Archiving (ActuVoila by WeybackMachine) Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Statistical evaluation • index-database updating frequency : • frequency analysis for the indexing robots visits used by Search Engines • statistic tests (news et Weblog sites) Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Statistical evaluation • 4 categories of different sites : • Sportstrategiesthe sport news servicemodified very regular, • very regular News site,with a constant update time(every hour • BBC • broadcast on-line, • the lifespan is very irregular because the information is updated instantaneously , when available • TF1, • updated frequently during the day • no modifications during the night • Weblog (Slashdot.org) • changes here very quickly, new articles are broadcast very often • the current discussions continue incessantly • lifespan of these dynamically changed pages is extremely short Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Statistical evaluation News of JO 2004 at Athens broadcast by the site Sportstrategies Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Statistical evaluation BBC News Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Statistical evaluation TF1 News : 24/24 Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The statistical evaluation TF1 News : working hours Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
The statistical evaluation Weblog Slashdot Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Tested service Lifespan mean min max Slashdot.org 77 sec 10 sec 22 min BBC.News 8,5 min 1 min 66 min TF1.news(24/24) 19,5 min 1 min 502 min TF1.news (working hours) 6,3 min 1 min 49 min Sportsynergies 56 min 9 min 61 min The statistical evaluation • Lifespan Comparison Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004
Conclusion • Dynamic documents: • don’t exist in reality, • disappear from the computer memory directly after consultation. • real lifespan is very short • But… • Can be accessible for a long time and can be stored by special archiving systems • Management of the archived dynamic documents: • lifespan is identical to that of static documents, because the dynamic documents are stored in the same way as static ones. Katarzyna Wegrzyn-Wolska : Colloque EBSI-ENSSIB 2004