Outline

Representation of hypertext documents based onterms, links and text compressibility Julian SzymańskiDepartment of Computer Systems Architecture, Gdańsk University of Technology, PolandWłodzisław DuchDepartment of Informatics, Nicolaus Copernicus University, Toruń, PolandSchool of Computer Engineering, Nanyang Technological University, Singapore

Outline • Text representations • Words • References • Compression • Evaluation of text representations • Wikipedia data • SVM & PCA • Experimental results and conclusions • Future directions

Textrepresentation • Amount of the information in the Internet grows rapidly. Thus machine support is needed for: • Categorization (supervised or unsupervised) • Searching / retrieval • Human understand the text. Machine doesn't. To process the text machine requires it in computationable form. • Results of text processing strongly depends on the methods used for text representation. • Processing natural language – several approaches to that problem: • Logic (ontologies), • Statistical processing of large text copora, • Geometry mainly used in machine learning. • Machine learning for NLP uses text features • The aim of the experiments presented here is to find hypertext representation suitable for automatic categorization • Information retrieval 4 Wiki project – improvement of existing Wikipedia category system

Text representation with features • k – document number, • n – feature number, • c – feature value Convenient way for machine processing of the is a vector of the features. Text set is represented as a matrix of the N features related with text k by the weight c. Where features come from?

Words • The most intuitive approach is to take words as features. Words content should describe well subject of the text. • n-th word has value C in context of k-th document calculated as: • where • tf – term frequency. Describes how many times the word n appears in k document. • idf – inverse document frequency. Describes how seldom n word appears in whole text set. Proportion of nuber of all documents and nuber of the documents containing the given word. • Problem: high dimensional sparse vectors. • BOW - Bag of Words that looses syntax • Preprocessing: stopwords, stemming. Features -> terms • Some other possibilities n-grams, profiles of the letter frequiencies.

References • Scientific articlescontainsbibliography. Web documentscontainshyperlinks. Theycan be used as representatonspacewheredocumentisrepresented by otherdocumentitisrefferenced to. • Typicallybinaryvectorcontaining 0 – lack of reference to thegivendocument 1 – existance of thereference • Somepossibleextensions: • Not allarticlesareequqll. Ranking algorithmssuch as PageRank, HITS allow to measureimportance of thedocuments and provideintseadbinarywalueweightthatdeccribeimportancewhile one articlepoints to theanother. • We canusereferences of higher order, thatcapturesreferences not onlyfromneighbours but alsoloogfurther. • Similarylikewordsrepresentation, sparsevectors but much morelowerdimensions, • Poor for capturingsemantic.

Compression Usually we need to show differences and similarities between text in the repository. They can be calculated using eg. Cosine distance which is suitable for high dimensional, sparse vectors. Square matrix describing text similarity. Other possibility is to make representation space based on algorithmic information estimated using standard file compression techniques. Key idea: If two documents are similar their concatenation will lead to a file size slightly larger than the size of a single compressed file. Two similar files will be compressed better that two different. complexity-based similarity measure as a fraction by which the sum of the separately compressed files exceeds the size of the jointly compressed file. where A and B denote text files, and the suffix p denotes the compression operation.

The data • The three ways to generate numerical representation of texts have been compared on a set of articles selected from the Wikipedia • Articles that belog to sub categories of Super category Science: • Chemistry →Chemical compounds, • Biology →Trees • Mathematics →Algebra • Computer science →MS (Microsoft) operating systems • Geology →Volcanology.

Rough view of the class distribution PCA projections of the data with two principal components having the highest variance Projection of dataset on two highest principal components for text representation based on terms, links and compressionfor Number of components used that complete90% of variance and cumulative sum of primary components variance for successive text representations

SVM classification Classification may be used as a method for validation of the text representations. The better results classifier obtains – the better representation is. Information extracted by different text representations may be estimated by comparing classifier errors in various feature spaces. Multiclass classification with SVM performed with one-versus-other class approach has been used with two-fold crossvalidation repeated 50 times for accurate averaging of the results. Raw rrepresentation based on complexity gives the best results. Reducing the dimensionality removing the features that are related only to one article improves the results. Introducing cosine kernel improves considerably results

SVM and PCA reduction Selecting components that complete 90% of variance has been used for dimensionality reduction It worsen the results of classification for terms and links (to high reduction?) PCA does not influence complexity representation as in previous results introduction of cosine kernel improves classification. For terms it is even slightly better

Summary Complexity measure allowed for much more compact representation, as seen from the cumulative contribution of principal components and achieved best accuracy in PCA-reduced space with only 36 dimensions. After using cosine kernel term based representation is slightly more accurate. Explicit representation of kernel spaces and the use of linear SVM classifier allows to find important reference documents for a given category, as well as identify collocations and phrases that are important for characterization of each category. Distance-typed kernels improves results and reduces dimensionality in terms and links representations. Improvement is also in the case of representation based on complexity where similarity, based on distance, is second-order transformation.

Future directions • Different methods of representation extract different information from texts. They show different aspects of the documents . • In future we plan to combine representations and use one, joint representation. • We plan introduce more background knowledge and capture some semantics. • Wordnet can be used as semantic space where words from the article are mapped. • Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning. • Mapping requires usage of disambiguation techniques. • It allow to use activations of a WordNet semantic network and then calculate distances between them what should give better semantic similarity measures. • Large scale classifier for whole Wikipedia.

Thank for yorattention

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: