1 / 50

Search Interfaces and String Manipulation

Search Interfaces and String Manipulation. Lecture 5. Outline. Search Interaction String class and important methods. Basic Search Engine Interfaces. Basic Search Engine Interfaces. Basic Search Interfaces. Interfaces to Scholarly Materials. Interfaces to Scholarly Materials.

jessie
Download Presentation

Search Interfaces and String Manipulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Interfaces and String Manipulation Lecture 5

  2. Outline • Search Interaction • String class and important methods

  3. Basic Search Engine Interfaces

  4. Basic Search Engine Interfaces

  5. Basic Search Interfaces

  6. Interfaces to Scholarly Materials

  7. Interfaces to Scholarly Materials

  8. Text Search Interaction • Search interfaces vary a great deal • The search cues they accept • Keywords, controlled vocabularies • (Scholarly materials) • author names, institutional affiliations, etc.

  9. Text Search Interaction • The way they interpret search cues • Cues may be matched against multiple fields versus a single field • The operators they accept • Boolean, exact match, expanded match, exact order, all terms present, broad versus narrow, etc. • The way results are presented

  10. Shneiderman, Croft, and Byrd

  11. Text Search Interaction • Four-phase framework • Formulation • Action • Review of results • Refinement

  12. Search Formulation • Which source to search? • Many resources, with different scope, content, depth of coverage, and time range

  13. Search Formulation • Different sources offer different types of fields to search • Title, author, publisher, date, controlled vocabulary • Language, media type, local vocabularies (Medical/Health), shelving location, etc. [Consider the IUCAT example]

  14. Search Formulation • For full text resources there are considerations related to structured versus unstructured queries • In structured queries users can utilize simple operators such as “computer manual” +Java – the quote and plus signs are operators

  15. Search Formulation • In structured searches users can specify word searches versus phrase searches • Often phrase searches, e.g., “air quality” is more precise than word searches on the words air and quality

  16. Search Formulation • In unstructured searches the words are used by the search engine in the “best” or the “most logical” way • However, the transformation heuristics are kept hidden from the user • For example, the words “bees and not honey” would simply be transformed into “bees honey” -- producing a query that is unintended by the user

  17. Search Formulation • The problem with structured searches is that there are no standards for conducting such searches and users often face inconsistencies across interfaces • For example, usage of * is non-standard

  18. Search Formulation Problem Summary • Sources • Fields • Structured versus Non-structured

  19. Search Action • Search / Go / “I am feeling lucky” -- different variations of conducting searches • Query language, Query by example (names of fields, values of fields and find-like-this) and visual querying approaches

  20. Search Action • Staged or dynamic querying is another approach

  21. Search Preview • Display query adjustment feedback: • Misspelled terms • Stop words • Stemming • Capitalization • Word order • Word expansion

  22. Search Preview • Display of intermediate results – proportion of query terms matching documents • Context of matches in documents

  23. Search Refinement • History features • Terms searched • Operators and constraints used • Number of documents retrieved per search • Resubmission of search cues or searches

  24. Search Refinement • Relevance feedback • Be able to specify which document found relevant and system should utilize such information to refine searches

  25. Search Result Presentation • Integrated view • Text information from multiple sources • Text and data in other formats • Alternative views • Conceptual/abstract vs. URLs • Visualizations

  26. Integrated Result Presentation • Usually requires using / taking advantage of certain meta data or API • AMAZON A9 provides a nice overview • Must be careful in the layout of information presentation • Some pre-processing is needed (e.g., identifying people in ASK Jeeves) • Allow customization of the layout

  27. Providing abstract/broad views of data • One common technique is clustering • Requires identification of key terms, their distribution, and ways the “clump” or are separated from each other • Techniques are generally referred to as unsupervised learning • No training is required

  28. Automatic Term Extraction and Clustering • Terms are counted • Frequency in documents • Based on Zipf distribution principle: Rank of terms (descending order) times frequency is roughly a constant • Document frequency • Inverse document frequency principle: Terms appearing in a relatively few documents in the collection are generally good terms • Take a look at the VCGS demo (under the LAIR research page)

  29. An Example: The tiny-text corpus • Doc1:Human machine interface for PARC computer applications • Doc2:A survey of user opinion of computer system response time • Doc3:The EPS user interface management system • Doc4:System and human system engineering testing of EPS • Doc5:Relation of user perceived response time to error measurement • Doc6:The generation of random, binary, ordered trees • Doc7:The intersection graph of paths in trees • Doc8:Graph minors IV: Widths of trees and well-quasi-ordering • Doc9:Graph minors: A survey

  30. Broad Views of Data: Basic Spring-Graph Presentation

  31. Broad Views of Data: Improved Flat Clustering

  32. Broad views of data: Scatter-Gather

  33. Broad View of “Use” or “Users”

  34. Why String Class? • string = a word • String (class) = is a complex object type used to represent words in Java • Text Processing is used in many utility software, as well as in information retrieval and computational linguistics • The String class allows storage of sequence of characters as individual units called strings (generally as a word or a token)

  35. Why do we really need the String Class? • Java provides a String class as a powerful way to manipulate characters • The String class offers numerous built-in ways of manipulating strings (check the API documentation!) • The other ways -- namely char array or a byte array provides a limited means of handling strings

  36. Different Ways of Instantiating a String In the Declaration Area: String s1 = “This Java thing bites!”; Declaration & Body Initialization String s1; public void init() { s1 = new String(“Java”); ...

  37. String Arrays Declaration • Declaration and initialization can be combined: • String b[] = new String[100], a[] = new String[27] • Two string arrays are being declared, one can contain 100 & the other 27 elements

  38. String Arrays • Arrays can be passed to methods: • Example: String sort_list [] = {“my”, “life”, “well”} bubblesort(sort_list); private bubblesort(String sort_holder[]) { // sort method }

  39. Arrays • A simple java program that uses char array: public class characs { public static void main( String[] args) { char char_list[] = {'a', 'b', 'c', 'd', 'e'}; for (int i=0; i < char_list.length; i++) System.out.println(char_list[i]); } // end of main method } // end of characs class

  40. Arrays • A simple java program that uses String Array: public class children { public static void main( String[] args) { String[] names = {"Tom", "James", "Francis", "Debbie"}; for (int i=0; i < names.length; i++) System.out.println(names[i]); } // end method main } // end class children

  41. String • In Java textual manipulation is usually handled using the class String and other related classes (e.g., StringBuffer and StringTokenizer) • String objects are immutable: they cannot be modified once created • Let’s look at some examples …

  42. String Constructors public class StringConstructors { public static void main( String args[] ) { char charArray[] = { 'b', 'i', 'r', 't', 'h', ' ', 'd', 'a', 'y' }; String s = new String( "hello" ); // use String constructors String s1 = new String(); String s2 = new String( s ); String s3 = new String( charArray ); String s4 = new String( charArray, 6, 3 ); System.out.printf( "s1 = %s\ns2 = %s\ns3 = %s\ns4 = %s\n", s1, s2, s3, s4 ); // display strings } // end main } // end class StringConstructors

  43. Comparing String Objects // test for equality (ignore case) if ( s3.equalsIgnoreCase( s4 ) ) // true System.out.printf( "%s equals %s with case ignored\n", s3, s4 ); else System.out.println( "s3 does not equal s4" ); // test compareTo System.out.printf( "\ns1.compareTo( s2 ) is %d", s1.compareTo( s2 ) ); System.out.printf( "\ns2.compareTo( s1 ) is %d", s2.compareTo( s1 ) ); System.out.printf( "\ns1.compareTo( s1 ) is %d", s1.compareTo( s1 ) ); System.out.printf( "\ns3.compareTo( s4 ) is %d", s3.compareTo( s4 ) ); System.out.printf( "\ns4.compareTo( s3 ) is %d\n\n", s4.compareTo( s3 ) ); // test regionMatches (case sensitive) if ( s3.regionMatches( 0, s4, 0, 5 ) ) System.out.println( "First 5 characters of s3 and s4 match" ); else System.out.println( "First 5 characters of s3 and s4 do not match" ); // test regionMatches (ignore case) if ( s3.regionMatches( true, 0, s4, 0, 5 ) ) System.out.println( "First 5 characters of s3 and s4 match" ); else System.out.println( "First 5 characters of s3 and s4 do not match" ); } // end main } // end class StringCompare public class StringCompare { public static void main( String args[] ) { String s1 = new String( "hello" ); // s1 is a copy of "hello" String s2 = "goodbye"; String s3 = "Happy Birthday"; String s4 = "happy birthday"; System.out.printf( "s1 = %s\ns2 = %s\ns3 = %s\ns4 = %s\n\n", s1, s2, s3, s4 ); // test for equality if ( s1.equals( "hello" ) ) // true System.out.println( "s1 equals \"hello\"" ); else System.out.println( "s1 does not equal \"hello\"" ); // test for equality with == if ( s1 == "hello" ) // false; they are not the same object System.out.println( "s1 is the same object as \"hello\"" ); else System.out.println( "s1 is not the same object as \"hello\"" );

  44. Locating Substring • Sometimes it is useful to determine if a character or a word appears in a sentence or phrase • Java offers the method indexOf and lastIndexOf to determine this • An example follows …

  45. Locating a character or word in a string public class StringIndexMethods { public static void main( String args[] ) { String letters = "abcdefghijklmabcdefghijklm"; // test indexOf to locate a character in a string System.out.printf( "'c' is located at index %d\n", letters.indexOf( 'c' ) ); System.out.printf( "'a' is located at index %d\n", letters.indexOf( 'a', 1 ) ); System.out.printf( "'$' is located at index %d\n\n", letters.indexOf( '$' ) ); // test lastIndexOf to find a character in a string System.out.printf( "Last 'c' is located at index %d\n", letters.lastIndexOf( 'c' ) ); System.out.printf( "Last 'a' is located at index %d\n", letters.lastIndexOf( 'a', 25 ) ); System.out.printf( "Last '$' is located at index %d\n\n", letters.lastIndexOf( '$' ) ); // test indexOf to locate a substring in a string System.out.printf( "\"def\" is located at index %d\n", letters.indexOf( "def" ) ); System.out.printf( "\"def\" is located at index %d\n", letters.indexOf( "def", 7 ) ); System.out.printf( "\"hello\" is located at index %d\n\n", letters.indexOf( "hello" ) ); // test lastIndexOf to find a substring in a string System.out.printf( "Last \"def\" is located at index %d\n", letters.lastIndexOf( "def" ) ); System.out.printf( "Last \"def\" is located at index %d\n", letters.lastIndexOf( "def", 25 ) ); System.out.printf( "Last \"hello\" is located at index %d\n", letters.lastIndexOf( "hello" ) ); } // end main } // end class StringIndexMethods

  46. StringBuffer • As String objects are not mutable we need a different Java object to help us gain more control over “strings” • Let’s take a look at an example

  47. StringBuffer public class StringBufferChars { public static void main( String args[] ) { StringBuffer buffer = new StringBuffer( "hello there" ); System.out.printf( "buffer = %s\n", buffer.toString() ); System.out.printf( "Character at 0: %s\nCharacter at 4: %s\n\n", buffer.charAt( 0 ), buffer.charAt( 4 ) ); char charArray[] = new char[ buffer.length() ]; buffer.getChars( 0, buffer.length(), charArray, 0 ); System.out.print( "The characters are: " ); for ( char character : charArray ) System.out.print( character ); buffer.setCharAt( 0, 'H' ); buffer.setCharAt( 6, 'T' ); System.out.printf( "\n\nbuf = %s", buffer.toString() ); buffer.reverse(); System.out.printf( "\n\nbuf = %s\n", buffer.toString() ); } // end main } // end class StringBufferChars

  48. StringBuffer • Allows us to modify string values • Append • Insert • Delete • Provides other useful methods such as reverse and setCharAt

  49. StringTokenizer • Often we need a way to quickly separate words or “tokens” as found in text • Tokens are usually separated by spaces, tabs, or new line characters • Additional functions are desired such as way to determine what characters separate tokens and how many there are

  50. An Example public class TokenTest { public static void main( String args[] ) { // get user input Scanner scanner = new Scanner( System.in ); System.out.println( "Enter a sentence to be reversed" ); String sentence = scanner.nextLine(); // display tokens in reverse order System.out.println( "\nReversed sentence:" ); StringTokenizer tokens = new StringTokenizer( sentence ); int count = tokens.countTokens(); String reverse[] = new String[ count ]; int index = count - 1; while ( tokens.hasMoreTokens() ) reverse[ index-- ] = tokens.nextToken(); for ( int k = 0; k < count; k++ ) System.out.printf( "%s ", reverse[ k ] ); System.out.println(); } // end main } // end class TokenTest

More Related