500 likes | 616 Views
Search Interfaces and String Manipulation. Lecture 5. Outline. Search Interaction String class and important methods. Basic Search Engine Interfaces. Basic Search Engine Interfaces. Basic Search Interfaces. Interfaces to Scholarly Materials. Interfaces to Scholarly Materials.
E N D
Outline • Search Interaction • String class and important methods
Text Search Interaction • Search interfaces vary a great deal • The search cues they accept • Keywords, controlled vocabularies • (Scholarly materials) • author names, institutional affiliations, etc.
Text Search Interaction • The way they interpret search cues • Cues may be matched against multiple fields versus a single field • The operators they accept • Boolean, exact match, expanded match, exact order, all terms present, broad versus narrow, etc. • The way results are presented
Text Search Interaction • Four-phase framework • Formulation • Action • Review of results • Refinement
Search Formulation • Which source to search? • Many resources, with different scope, content, depth of coverage, and time range
Search Formulation • Different sources offer different types of fields to search • Title, author, publisher, date, controlled vocabulary • Language, media type, local vocabularies (Medical/Health), shelving location, etc. [Consider the IUCAT example]
Search Formulation • For full text resources there are considerations related to structured versus unstructured queries • In structured queries users can utilize simple operators such as “computer manual” +Java – the quote and plus signs are operators
Search Formulation • In structured searches users can specify word searches versus phrase searches • Often phrase searches, e.g., “air quality” is more precise than word searches on the words air and quality
Search Formulation • In unstructured searches the words are used by the search engine in the “best” or the “most logical” way • However, the transformation heuristics are kept hidden from the user • For example, the words “bees and not honey” would simply be transformed into “bees honey” -- producing a query that is unintended by the user
Search Formulation • The problem with structured searches is that there are no standards for conducting such searches and users often face inconsistencies across interfaces • For example, usage of * is non-standard
Search Formulation Problem Summary • Sources • Fields • Structured versus Non-structured
Search Action • Search / Go / “I am feeling lucky” -- different variations of conducting searches • Query language, Query by example (names of fields, values of fields and find-like-this) and visual querying approaches
Search Action • Staged or dynamic querying is another approach
Search Preview • Display query adjustment feedback: • Misspelled terms • Stop words • Stemming • Capitalization • Word order • Word expansion
Search Preview • Display of intermediate results – proportion of query terms matching documents • Context of matches in documents
Search Refinement • History features • Terms searched • Operators and constraints used • Number of documents retrieved per search • Resubmission of search cues or searches
Search Refinement • Relevance feedback • Be able to specify which document found relevant and system should utilize such information to refine searches
Search Result Presentation • Integrated view • Text information from multiple sources • Text and data in other formats • Alternative views • Conceptual/abstract vs. URLs • Visualizations
Integrated Result Presentation • Usually requires using / taking advantage of certain meta data or API • AMAZON A9 provides a nice overview • Must be careful in the layout of information presentation • Some pre-processing is needed (e.g., identifying people in ASK Jeeves) • Allow customization of the layout
Providing abstract/broad views of data • One common technique is clustering • Requires identification of key terms, their distribution, and ways the “clump” or are separated from each other • Techniques are generally referred to as unsupervised learning • No training is required
Automatic Term Extraction and Clustering • Terms are counted • Frequency in documents • Based on Zipf distribution principle: Rank of terms (descending order) times frequency is roughly a constant • Document frequency • Inverse document frequency principle: Terms appearing in a relatively few documents in the collection are generally good terms • Take a look at the VCGS demo (under the LAIR research page)
An Example: The tiny-text corpus • Doc1:Human machine interface for PARC computer applications • Doc2:A survey of user opinion of computer system response time • Doc3:The EPS user interface management system • Doc4:System and human system engineering testing of EPS • Doc5:Relation of user perceived response time to error measurement • Doc6:The generation of random, binary, ordered trees • Doc7:The intersection graph of paths in trees • Doc8:Graph minors IV: Widths of trees and well-quasi-ordering • Doc9:Graph minors: A survey
Why String Class? • string = a word • String (class) = is a complex object type used to represent words in Java • Text Processing is used in many utility software, as well as in information retrieval and computational linguistics • The String class allows storage of sequence of characters as individual units called strings (generally as a word or a token)
Why do we really need the String Class? • Java provides a String class as a powerful way to manipulate characters • The String class offers numerous built-in ways of manipulating strings (check the API documentation!) • The other ways -- namely char array or a byte array provides a limited means of handling strings
Different Ways of Instantiating a String In the Declaration Area: String s1 = “This Java thing bites!”; Declaration & Body Initialization String s1; public void init() { s1 = new String(“Java”); ...
String Arrays Declaration • Declaration and initialization can be combined: • String b[] = new String[100], a[] = new String[27] • Two string arrays are being declared, one can contain 100 & the other 27 elements
String Arrays • Arrays can be passed to methods: • Example: String sort_list [] = {“my”, “life”, “well”} bubblesort(sort_list); private bubblesort(String sort_holder[]) { // sort method }
Arrays • A simple java program that uses char array: public class characs { public static void main( String[] args) { char char_list[] = {'a', 'b', 'c', 'd', 'e'}; for (int i=0; i < char_list.length; i++) System.out.println(char_list[i]); } // end of main method } // end of characs class
Arrays • A simple java program that uses String Array: public class children { public static void main( String[] args) { String[] names = {"Tom", "James", "Francis", "Debbie"}; for (int i=0; i < names.length; i++) System.out.println(names[i]); } // end method main } // end class children
String • In Java textual manipulation is usually handled using the class String and other related classes (e.g., StringBuffer and StringTokenizer) • String objects are immutable: they cannot be modified once created • Let’s look at some examples …
String Constructors public class StringConstructors { public static void main( String args[] ) { char charArray[] = { 'b', 'i', 'r', 't', 'h', ' ', 'd', 'a', 'y' }; String s = new String( "hello" ); // use String constructors String s1 = new String(); String s2 = new String( s ); String s3 = new String( charArray ); String s4 = new String( charArray, 6, 3 ); System.out.printf( "s1 = %s\ns2 = %s\ns3 = %s\ns4 = %s\n", s1, s2, s3, s4 ); // display strings } // end main } // end class StringConstructors
Comparing String Objects // test for equality (ignore case) if ( s3.equalsIgnoreCase( s4 ) ) // true System.out.printf( "%s equals %s with case ignored\n", s3, s4 ); else System.out.println( "s3 does not equal s4" ); // test compareTo System.out.printf( "\ns1.compareTo( s2 ) is %d", s1.compareTo( s2 ) ); System.out.printf( "\ns2.compareTo( s1 ) is %d", s2.compareTo( s1 ) ); System.out.printf( "\ns1.compareTo( s1 ) is %d", s1.compareTo( s1 ) ); System.out.printf( "\ns3.compareTo( s4 ) is %d", s3.compareTo( s4 ) ); System.out.printf( "\ns4.compareTo( s3 ) is %d\n\n", s4.compareTo( s3 ) ); // test regionMatches (case sensitive) if ( s3.regionMatches( 0, s4, 0, 5 ) ) System.out.println( "First 5 characters of s3 and s4 match" ); else System.out.println( "First 5 characters of s3 and s4 do not match" ); // test regionMatches (ignore case) if ( s3.regionMatches( true, 0, s4, 0, 5 ) ) System.out.println( "First 5 characters of s3 and s4 match" ); else System.out.println( "First 5 characters of s3 and s4 do not match" ); } // end main } // end class StringCompare public class StringCompare { public static void main( String args[] ) { String s1 = new String( "hello" ); // s1 is a copy of "hello" String s2 = "goodbye"; String s3 = "Happy Birthday"; String s4 = "happy birthday"; System.out.printf( "s1 = %s\ns2 = %s\ns3 = %s\ns4 = %s\n\n", s1, s2, s3, s4 ); // test for equality if ( s1.equals( "hello" ) ) // true System.out.println( "s1 equals \"hello\"" ); else System.out.println( "s1 does not equal \"hello\"" ); // test for equality with == if ( s1 == "hello" ) // false; they are not the same object System.out.println( "s1 is the same object as \"hello\"" ); else System.out.println( "s1 is not the same object as \"hello\"" );
Locating Substring • Sometimes it is useful to determine if a character or a word appears in a sentence or phrase • Java offers the method indexOf and lastIndexOf to determine this • An example follows …
Locating a character or word in a string public class StringIndexMethods { public static void main( String args[] ) { String letters = "abcdefghijklmabcdefghijklm"; // test indexOf to locate a character in a string System.out.printf( "'c' is located at index %d\n", letters.indexOf( 'c' ) ); System.out.printf( "'a' is located at index %d\n", letters.indexOf( 'a', 1 ) ); System.out.printf( "'$' is located at index %d\n\n", letters.indexOf( '$' ) ); // test lastIndexOf to find a character in a string System.out.printf( "Last 'c' is located at index %d\n", letters.lastIndexOf( 'c' ) ); System.out.printf( "Last 'a' is located at index %d\n", letters.lastIndexOf( 'a', 25 ) ); System.out.printf( "Last '$' is located at index %d\n\n", letters.lastIndexOf( '$' ) ); // test indexOf to locate a substring in a string System.out.printf( "\"def\" is located at index %d\n", letters.indexOf( "def" ) ); System.out.printf( "\"def\" is located at index %d\n", letters.indexOf( "def", 7 ) ); System.out.printf( "\"hello\" is located at index %d\n\n", letters.indexOf( "hello" ) ); // test lastIndexOf to find a substring in a string System.out.printf( "Last \"def\" is located at index %d\n", letters.lastIndexOf( "def" ) ); System.out.printf( "Last \"def\" is located at index %d\n", letters.lastIndexOf( "def", 25 ) ); System.out.printf( "Last \"hello\" is located at index %d\n", letters.lastIndexOf( "hello" ) ); } // end main } // end class StringIndexMethods
StringBuffer • As String objects are not mutable we need a different Java object to help us gain more control over “strings” • Let’s take a look at an example
StringBuffer public class StringBufferChars { public static void main( String args[] ) { StringBuffer buffer = new StringBuffer( "hello there" ); System.out.printf( "buffer = %s\n", buffer.toString() ); System.out.printf( "Character at 0: %s\nCharacter at 4: %s\n\n", buffer.charAt( 0 ), buffer.charAt( 4 ) ); char charArray[] = new char[ buffer.length() ]; buffer.getChars( 0, buffer.length(), charArray, 0 ); System.out.print( "The characters are: " ); for ( char character : charArray ) System.out.print( character ); buffer.setCharAt( 0, 'H' ); buffer.setCharAt( 6, 'T' ); System.out.printf( "\n\nbuf = %s", buffer.toString() ); buffer.reverse(); System.out.printf( "\n\nbuf = %s\n", buffer.toString() ); } // end main } // end class StringBufferChars
StringBuffer • Allows us to modify string values • Append • Insert • Delete • Provides other useful methods such as reverse and setCharAt
StringTokenizer • Often we need a way to quickly separate words or “tokens” as found in text • Tokens are usually separated by spaces, tabs, or new line characters • Additional functions are desired such as way to determine what characters separate tokens and how many there are
An Example public class TokenTest { public static void main( String args[] ) { // get user input Scanner scanner = new Scanner( System.in ); System.out.println( "Enter a sentence to be reversed" ); String sentence = scanner.nextLine(); // display tokens in reverse order System.out.println( "\nReversed sentence:" ); StringTokenizer tokens = new StringTokenizer( sentence ); int count = tokens.countTokens(); String reverse[] = new String[ count ]; int index = count - 1; while ( tokens.hasMoreTokens() ) reverse[ index-- ] = tokens.nextToken(); for ( int k = 0; k < count; k++ ) System.out.printf( "%s ", reverse[ k ] ); System.out.println(); } // end main } // end class TokenTest