160 likes | 304 Views
Automated subject classification of engineering Web pages in hierarchical browsing: A user study. Koraljka Golub KnowLib, Dept. of Electrical and Information Technology, Lund University http://www.it.lth.se/knowlib/. Motivation.
E N D
Automated subject classification of engineering Web pages in hierarchical browsing: A user study Koraljka Golub KnowLib, Dept. of Electrical and Information Technology, Lund University http://www.it.lth.se/knowlib/
Motivation • To re-examine the value of traditional controlled vocabularies in the world of Web pages • For browsing • For automated classification Introduction
Purpose of the study • Purpose • Investigate performance of an automated classification algorithm on a collection of engineering web pages, in the context of hierarchical browsing • Two major research questions: • Are users able to navigate the Ei classification structure? • How well are the web pages classified, as judged by the users? Introduction
Controlled vocabulary • Ei classification scheme • 6 main classes • decimally subdivided • up to 5 hierarchical levels • pre-existing intellectual mappings between terms in the Ei thesaurus to terms in the Ei classification scheme Table 1. The number of different types of terms for 92 sub-classes from class 9 4 Civil Engineering … 44 Water and Waterworks Engineering 441 Dams and Reservoirs … 445 Water Treatment 445.1 Water Treatment Techniques 445.1.1 Potable Water Treatment Techniques Background
The classification algorithm • the algorithm classifies documents into classes of the Ei classification scheme • it searches for terms from the Ei controlled vocabulary, in text of documents to be classified • based on thesaurus terms and captions, a term list is created: Weight: Term (single word, Boolean term or phrase) = Class • the algorithm searches for strings from a given term list in the document to be classified and if the Term is found, the Class(es) assigned to that string in the term list are assigned to the document • one class can be designated by many terms, and each time a term is found, the corresponding Weight is added to the score for the class • the scores for each class are summed up and classes with scores above a certain cut-off value (heuristically defined) are selected as the final ones for the document being classified Background
Document collection • automatically created, i.e., Web pages were collected using the Combine focused crawler (Ardö 2007) • 18,895 Web pages • once crawled, the Web pages were classified using the described algorithm Methodology
User study design • it comprised the following steps, with the last three ones repeated for each of four search tasks • Invitation to participate. • Participation consent form. • Written instructions. • Pre-study questionnaire. • Search task description. • Search session, in which every move was logged. • Post-task questionnaire on participants’ certainty of their decisions and general satisfaction. Methodology
Tasks 1) “Your task is to find all the web pages in the system dealing with the topic of particle accelerators.” • fifth hierarchical level (932.1.1) • the class caption the same as the topic name 2) “Your task is to find all the web pages in the system dealing with the topic of magnetic instruments.” • at the fourth hierarchical level (942.3) • the class caption the same as the topic name 3) “Your task is to find all the web pages in the system dealing with the topic of differentiation and integration.” • the fourth hierarchical level (921.2) • the class caption entirely different from the topic name 4) “Your task is to find all the web pages in the system dealing with the topic of professional organizations in the field of engineering.” • at the fifth hierarchical level (901.1.1) • the class caption (Societies and Institutions) entirely different from the topic name Methodology
Data collection • browsing • logging the browsing steps and determining if correct classes were found • compared against an “ideal” browsing path • “ideal” class for each search topic • classification • through correctness assessments: “correct”, “partly correct”, “incorrect” and “impossible for me to say” • pre-study questionnaire on participants’ background and their previous searching and browsing experience • a post-task questionnaire on participants’ certainty of their decisions and general satisfaction • clarifying data collected for six participants by observation based on the “think-aloud” protocol • general comments – optional (31 by 16 participants collected) Methodology
Participants • 40 participants, students and researchers in the field of engineering • very good English skills and a lot of online searching experience • they used search engines once or twice a week • usage of professional information services • library catalogues and Lund University’s service providing free access to commercial databases once or twice a year • browsing they used once or twice a month Methodology
Browsing • on average for all the four tasks, the majority (29 out of 40 participants) found the right class • appx. two other classes per task were considered correct by at least two participants • some were indeed on the topic, esp. in Task 4, where 4 other were chosen: • 912.2 Management • 901.1 Engineering Professional Aspects (broader) • 901.3 Engineering Research • 912.1 Industrial Engineering • for two of the tasks the ideal browsing path takes four steps and for other two five steps • the participants who found the right class took 15 steps; all participants took 16 steps • they reported that it was quite easy finding it and were quite certain they found the right class Results
Browsing Table 2. Ideal browsing steps for Task 3. • in all tasks majority selected correct second and third hierarchical level classes • more wrong classes were chosen at the fourth hierarchal level, which could be attributed to: • participants’ unfamiliarity with the subject at the required specific level; • class captions being different from topic names • the classification scheme could have been inappropriate Results
Automated classification • top ranked web pages in each of the four classes were on average deemed partly correct • as with browsing, evaluations between the four tasks differed • in compliance with previous results based on a pre-classified collection of research abstracts (Golub et al. 2007) • major problem: large differences among participants in their judgements – a number of web pages were evaluated as correct, partly correct and incorrect by different participants • a major reason – subjectivity in deciding which topic a document is dealing with • different task interpretations • little or very much text Results
In conclusion… • the majority of the participants deemed the whole experience to be between neutral and pleasing • further research • determining other reasons behind browsing failures • deal with problems of disparate evaluations of one and the same web page • methodology to encourage participants make more homogeneous decisions • harvest web pages that are more uniform in terms of quality or genre (see, for example, Custard and Sumner 2005; Nicholson 2003) • design more narrowly specified search tasks for a certain purpose Concluding remarks
References Ardö, A. (2007) “Focused crawler: Combine system homepage”, available at: http://combine.it.lth.se/ (accessed 3 September 2007). Chen, H., and Dumais, S.T. (2000), “Bringing order to the Web: automatically categorizing search results”, Proceedings of the ACM International Conference on Human Factors in Computing Systems, Den Haag, pp. 145-152. Custard, M. and Sumner, T. (2005), “Using machine learning to support quality judgments”, D-Lib Magazine, October Vol. 11 No. 10, available at: http://www.dlib.org/dlib/october05/custard/10custard.html (accessed 3 September 2007). Golub, K., Hamon, T., and Ardö, A. (2007), “Automated classification of textual documents based on a controlled vocabulary in engineering”, submitted to Knowledge Organization. Koch, T., and Zettergren, A.-S. (1999) “Provide browsing in subject gateways using classification schemes”, EU Project DESIRE II, available at http://www.mpdl.mpg.de/staff/tkoch/publ/class.html. Large, A., Tedd, L., and Hartley, R. (1999), Information seeking in the online age, K. G. Saur, London etc. Nicholson, S. (2003), “Bibliomining for automated collection development in a digital library setting: using data mining to discover Web-based scholarly research works”, Journal of the American Society for Information Science and Technology Vol. 54 No. 12, pp. 1081-1090. Schwartz, C. (2001), Sorting out the Web: approaches to subject access, Ablex, Westport, CT. Svenonius, E. (2000), The intellectual foundations of information organization, MIT Press, Cambridge, MA.