Project 1: Text Classification by Neural Networks

Project 1:Text Classification by Neural Networks Ver 1.1

CLASSIC3 Dataset

CLASSIC3 • Three categories: 3891 documents • CISI: 1,460 document abstracts on information retrieval from Institute of Scientific Information. • CRAN: 1,398 document abstracts on Aeronautics from Cranfield Institute of Technology. • MED: 1,033 biomedical abstracts from MEDLINE. (C) 2006, SNU Biointelligence Laboratory

1 1 0 0 0 3 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 2 1 1 1 0 2 0 1 0 1 0 0 1 1 3 1 0 0 1 2 0 0 0 0 1 0 1 0 0 1 0 0 0 3 0 0 2 1 0 0 0 0 1 0 0 3 0 0 1 0 0 1 0 1 1 0 0 2 1 0 1 1 0 1 0 0 0 0 0 0 0 3 1 0 0 Text Presentation in Vector Space 문서집합 stemming stop-words elimination feature selection . . . VSM representation Term vectors Bag-of-Words representation d1 d2 d3 dn baseball specs graphics hockey Term-document matrix unix Dataset Format space (C) 2006, SNU Biointelligence Laboratory

ML algorithm Dimensionality Reduction term (or feature) vectors individual feature Scoring measure (on individual feature) Sort by score scores choose terms with higher values documents in vector space Term Weighting TF or TF x IDF TF: term frequency IDF: Inverse Document Frequency N:Number of documents ni: number of documents that contain the j-th word (C) 2006, SNU Biointelligence Laboratory

Construction of Document Vectors • Controlled vocabulary • Stopwords are removed • Stemming is used. • Words of which document frequency is less than 5 is removed.  Term size: 3,850 • A document is represented with a 3,850-dimensional vector of which elements are the frequency of words. • Words are sorted according to their values of information gain.  Top 100 terms are selected  3,830 (examples) x 100 (terms) matrix (C) 2006, SNU Biointelligence Laboratory

Experimental Results

Data Setting for the Experiments • Basically, training and test set are given. • Training : 2,683 examples • Test : 1,147 examples • N-fold cross-validation (Optional) • Dataset is divided into N subsets. • The holdout method is repeated N times. • Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set. • The average performance across all N trials is computed. (C) 2006, SNU Biointelligence Laboratory

ANN Sources • Source codes • Free software  Weka • NN libraries (C, C++, JAVA, …) • MATLAB tool box • Web sites • http://www.cs.waikato.ac.nz/~ml/weka/ • http://www.faqs.org/faqs/ai-faq/neural-nets/part5/ (C) 2006, SNU Biointelligence Laboratory

Submission • Due date: October 12 (Thur) • Both ‘hardcopy’ and ‘email’ • Used software and running environments • Experimental results with various parameter settings • Analysis and explanation about the results in your own way • FYI, it is not important to achieve the best performance (C) 2006, SNU Biointelligence Laboratory

Project 1: Text Classification by Neural Networks

Project 1: Text Classification by Neural Networks

Presentation Transcript

Neural Tube Defects

Data pre-processing for neural networks

Vector Space Text Classification

Text Classification

Lexical networks, lexical centrality, and text mining

Hierarchical Neural Networks for Object Recognition and Scene “Understanding”

The neural basis of self-knowledge

Text Features

Vehicular Ad hoc Networks (VANET)

Feed-Forward Neural Networks

Automatic Text Summarization Introduction and Research Problems

Neural networks for data mining

From Smart Dust to Reliable Networks

Project: IEEE P802.15 Working Group for Wireless Personal Area Networks (WPANs)

NEURAL NETWORKS

Data Mining: Classification and Prediction

Letter to a B Student

Financial Informatics –XIII: Neural Computing Systems

Mini-course on Artificial Neural Networks and Bayesian Networks

Neural Signaling