100 likes | 206 Views
EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS. Felicia Decker and Lois Delcambre Portland State University. PREVIOUS WORK. Course Intro to Databases We found 6 courses – on the web – with all lectures Lecture notes ppt/pdf/html
E N D
EFFORTS TO AUTOMATE LABELING OF LECTURES WITH COMPUTING ONTOLOGY TERMS Felicia Decker and Lois Delcambre Portland State University
PREVIOUS WORK • Course • Intro to Databases • We found 6 courses – on the web – with all lectures • Lecture notes • ppt/pdf/html • Hand-labeled each lecture topic with Computing Ontology (CO) terms • used this to validate the CO • leaf CO terms correspond to lecture topics
CURRENT WORK • Will the words that appear in these lecture notes help us choose CO terms?Are there “signature” words for each topic? • Tools • Lucene • Converter tools (ppt/pdf/html -> text) • Microsoft Excel
LUCENE • Index lecture notes • text from one lecture = one document • documents/lectures from one course = one collection (with an index) • Provides us with • Term frequency (tf) • Inverse document frequency (idf) • Tf-idf • Currently using single words, just now introducing stemming
CONVERTER TOOLS • Lecture notes come in different formats • PPT -> text • Apache POI • PDF -> text • TextMiningTool 1.1.42 • Xpdf-3.02 • HTML -> text • Copy/paste • Internet Explorer – save webpage as text
EXCEL • After using Lucene to get tf, idf and tf-idf data for each term in the given index… • Select a CO term: e.g., Normalization • Using CO-labeled lecture notes (previous work), choose the lectures labeled with Normalization • Compile tf/idf/tf-idf data into one spreadsheet
HAND-LABEL WORDS FROM LECTURES AS “IMPORTANT” • Signature words were human-selected from Database Management Systems by Ramakrishnan and Gehrke, 3rd Ed. • Use Find All/Replace All function in Excel to highlight all signature words that identify Normalization
INITIAL EFFORT: RESULTS • Conclusions • Tf-idf is not a strong indicator • Cannot solely rely on tf-idf • ‘Running example’ • While good for teaching • We don’t care about this data • Stemming is important • Use of phrases may help
NEXT STEPS • Intersection of terms across all classes • May solve ‘running example’ problem • Compute average rank • Compute average tf-idf (?) • Union all documents with the same CO label(union text from all the lectures on normalization, union text from all lectures on query optimization, etc.) • Look at tf-idf • Consider various classification algorithms (looking to see if there are some implemented for Lucene)