200 likes | 226 Views
INF 141: Information Retrieval. Discussion Session Week 2– Winter 2011 TA: Sara Javanmardi. Contact Information. Sara Javanmardi Email: sjavanma {at} uci .edu. Policies. Discussion
E N D
INF 141: Information Retrieval Discussion Session Week 2– Winter 2011 TA: Sara Javanmardi
Contact Information Sara Javanmardi • Email: sjavanma {at} uci.edu
Policies • Discussion • Attendance is not mandatory but highly recommended (we will discuss good practices in doing projects) but it is mandatory for the quiz sessions. (Q1:1/26, Q2:2/9, Q3:2/23, Q4:3/9) • Questions • Post your questions to the course forum. • Follow the course blog and put comments on the posted blogs. • Assignments • Late assignments (you lose 1% per hour). • Bring questions about the assignment to the discussion session. • Questions sent in the last 24 hours before an assignment’s deadline might not receive answers from the teaching staff.
Policies • Grading • If you have questions, please talk to the TA first, then with the instructor. • Re-grade • Double check before you bring it. • Within 1 week, accompanied by a clear explanation of what needs to be reconsidered and why.
Course Material • In addition to the book • Search Engines: Information Retrieval in Practice
Assignment 2 • Goals: • General and Extra Credit Question • Try to read different sources and cite them • Programming Questions • Writing the code in Java • Work on efficiency of the code • It should take less than 0.5 sec to extract the palindromes or lipograms
Finding Longest Palindrome & Lipogram • Palindrome • Lipogram • Sample input and output shows palindrome, lipogram, and rhopalic, respectively. • Submit your output for the test input.
Finding The Longest Rhopalic • A rhopalic is a sequence of words in which each word increases by one character. • Example: “I do not know where family doctors acquired illegibly perplexing handwriting; nevertheless, extraordinary pharmaceutical intellectuality, counterbalancing indecipherability, transcendentalises intercommunications incomprehensibleness” • It can start with different length words. • Words are separated by at least 1 space, white space, or punctuation. • Sample Java Code
Extracting Rhopalics Starting from index 0, how many tokens after index 0 constitute a rhopalic?
Longest Palindrome • Example: • Input1: uka abb aacd ab ba • Output1: start=2, end=9 start=13, end=17 Which ones is the longest? (Length = end-start+1) Input String Start & end index Is Palindrome
Is palindrome • You have two pointers to the start and end index of a string, • if input[i]==input[j] {i++; j--}, while(i<=j) abbbbba j i
Is palindrome • Or loop over the string and when you are at index m, assume that you are in the middle of a palindrome, move a pointer forward and the other one backward abbbbba i j m
Suffix Tree • You have to run is palindrome for every possible string. • A more efficient way is using a suffix tree
Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 4 5 # $ a b a 3 b b 4 $ # # a $ 2 1 b $ 3 2 1
Let s = cbaaba$ then sr = abaabc# a # b $ c 7 7 a b $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 $ 3 3 c # a $ 4 1 2 2 1
Finding maximal palindromes • A palindrome: caabaac, cbaabc • Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr
Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr
Read/Write From Files • The FileReader and FileWriter always use the system’s default character encoding. If this default is not appropriate (for example, when reading an XML file which specifies its own encoding), then reading and writing will be incorrect. • See more : Reading and writing text files • Scanner vs Buffered reader
Read/Write From Files • You can call these methods from IO.java • To write: StringBuffer sb = new StringBuffer(); IO.writeToFile(sb.toString(), outputFilename, "UTF-8"); • To read: String content = IO.getFileContent(inputFilename, "UTF-8"); StringTokenizer stLine = new StringTokenizer(content, "\n");