340 likes | 588 Views
Techniques for Information Searching and Retrieval of Multimedia Digital Library. Presented by: Vincent Cheung Supervised by: Prof. Michael Lyu Prof. K. W. Ng 18 December, 1999. Abstract.
E N D
Techniques for Information Searching and Retrieval of Multimedia Digital Library Presented by: Vincent Cheung Supervised by: Prof. Michael Lyu Prof. K. W. Ng 18 December, 1999
Abstract • Digital Library is getting more and more popular, due to its strength in searching and retrieving information. • The trend that more multimedia information are needed to be stored instead of pure text. • As the nature of multimedia information is very different from that of pure text, new challenge in information searching and retrieval techniques is arose.
Presentation Outline • General Information Retrieval Methods • Multimedia & Their Retrieval Techniques • Retrieval Techniques in Other Information Searching Application • An Indexing Tool Implemented • Conclusion and Q&A Session
Overview- Information Searching and Retrieval Procedures • Give indexes to the existing information • Store information with good organization • Get the user queries • Search the information • Evaluate the importance of all query results • Present the results to the users • Process the feedback of the users
User Queries Display to Users Extract the keywords of user query for further searching Start operation for retrieved answers by evaluating their rankings and construct the output Dictionaries Formulate the keywords with logical operations (e.g. AND, OR, etc) Perform logical combination of terms to obtain answers which satisfy the logical restrictions Matching Items Search operations by comparing keywords for documents and search requests Indexed Database Unmatched Items Flowchart of Retrieval Processes
Indexing Aim: to give abstract of the document and label it with a few keywords • Manual indexing • Using whole passage • “Content Words” counting • Natural language processing
Query Modification Aim: to modify the query such that it can yield the largest amount of relevant results Problems related to linguistic: • Words carry out only syntactic functions • Words supply the same or related meaning • Words can be used in different senses, depends on contents • Different structures represent the same idea
Solving Linguistic Problems • Use of Dictionaries: • Negative Dictionary • Thesaurus (or Synonym Dictionary) • Phrase Dictionary • Use of Fuzzy Logic for matching synonym: • Construct a set of fuzzy relations, which represented by fuzzy graphs that are obtained from statistics of occurrence and co-occurrence of keywords.
Searching and Storage Aim: Good organization in storing can give good performance in searching. • Two main principals of file organization: direct and inverted systems • Direct system: files are stored in order by document numbers, and items are retrieved by sequential scan of the complete files. • Advantage of Direct system: allows several searches to perform at the same time.
Searching and Storage (cont’) • Inverted system: arrange the files in order by a set of keywords or index terms. Each item is normally listed as many times as there are assigned keywords. • Advantage of Inverted system: only need to extract from the files in the sections that correspond to the index terms used in queries • More other methods: variations of these two principals
Evaluation on Searching Results • Aim: to rank the list of answers from the search by using some ranking functions • Different ranking functions for calculating the weight of returned answers • One simple and popular function: Counting the occurrence of query keywords • Not very fair… longer passages would have higher opportunity to contain more keywords
Feedback Aim: to let users redefined the query statements for more responsive results • Asking users to give feedback to the query results because of unclear queries, change in user interest, etc. • Query statements may be modified, and system should performs further searching. The relevant items should produce higher correlation than the original.
Does the user have to terminate the search, or has the maximum permission no. of iteration been reached? Read the max no. of documents to be examined by users for successive iterations. Then do the searching. Proceed with evaluation of successive iterations and print results User input Yes Exit No Modify query using relevance judgements for the first nidocuments of previous iteration Search document collection with newly constructed modified query and produce user output Flowchart of Feedback
Concept Based Query • An object oriented method for indexing • Conceptual indexes (classes) are used, and a decision tree hierarchy is formed by those classes. • Users make the same queries • Instead of returning answering documents, list of concepts are returned at first time. • Then narrow their search by indicating the desired classes or concepts
Characteristics of Multimedia • Large in file size • May be dynamic in nature (e.g. audio or video) instead of static (e.g. text, image) • No simple methods for indexing or describing the contents of the files • Varies kinds of file formats (e.g. JPEG, GIF, TIFF in images, MOV, MPEG in video)
Existing Multimedia Digital Library - Informedia • Convert multimedia to text - Speech Recognition and Optical Character Recognition. So, indexing and searching can be done by traditional methods • Face Recognition - non-text-based technique, for matching faces of persons in videos • Presenting Results - Poster frame, Filestrip, and skimming. Give users a faster review of the query answers for choosing desired video
Internet Search Engines • Internet is similar to Digital Library • a huge database • heterogeneous information • dynamic • decentralized • Common Internet search engines are using centralized index database • Disadvantages: • heavy workload of server • inefficient use of bandwidth • bad quality of results
Distributed Search Engine • Local proxy servers can be enhanced to perform web searching, a network of search engines then can be established • Faster response time and network traffic can be reduced • Better results should be given
Video-on-Demand Systems • VoD systems deliver videos to clients upon their requests • VoD system is similar to Digital Library • deliver videos upon user requests, which are large in content sizes • Efficient retrieval is needed, and it can be archived only if there is an efficient storage method.
How Data be Stored in VoD • Primary design goal is to maximize the ratio of the number of concurrent streams to system cost while guaranteeing glitch-free operation • An array of magnetic harddisks, and a large RAM buffer are used. • RAM is faster in I/O rates than harddisks, so popular videos are put in RAM • A popular video should not be stored with other popular videos. Better balance of workload. • RAID is used and I/O is done by the whole array of disks at the same time.
Image Databases • Documents are not indexed by verbal description, as it may not be able to well-described the contents. • Other means would be used, e.g. histogram representation, shape chains, etc. • Similar to Digital Library: • They are storing multimedia information.
Motion Databases • Implemented by Deng (1997). Closer to digital library. • Index the video by three primary features: • color (color histogram) • texture (Gabor texture features) • motion (motion histogram) • Good for sports or movie data
Chinese Searching Engines • Similar methods as English can be used • Chinese is very different from English as it is less structural. (e.g. 吃了小明的狗) Cannot parse the sentence according the grammers • It is difficult to extract the idea in documents and identify the keywords for indexing • Subject-verb-object (SVO) can be used for identify the syntactic components
An Indexing Tool: Chinese Subtitles Extraction in Video • Many dialects in Chinese, but Chinese Characters is common in anywhere • Many video programs have Chinese subtitles nowadays • Extract text from digital video programs can help for indexing, searching and retrieval
Features of Subtitles • Characters are in foreground • They are monochrome • They are rigid, from frame to frame • They are upright • They have size restrictions • They contrast with the background • They appear in clusters at a limited distance aligned to a horizontal line
Implementation • Two main challenges: • to segment the character areas • to recognize the characters • Four phases: • extract the subtitle block from the background • extract each character from subtitle block • recognize the Chinese Characters • process the whole video
Sample Frame • ATV video news in MPEG format about Airport Authority • First, extract one frame from the video
Edge Filtering • Do edge filtering to the frame by using Sobel filter.
Subtitle Block Extraction High Density of Edge indicates there is a subtitle block
Character Extraction • Filter the area with background and keep the subtitle block • Use the same method, segment the characters
Results of Recognition • A Chinese Character Image Library is built for recognition • 5401 frequently used Chinese characters • Simple subtraction is used for recognition • Characters segmented • Characters recognized
Evaluation • The successful rate of segmenting the characters is quite high (~90% in general) • Low successful rate in character recognition (~15% in general) • Better algorithms for character recognition would be tried • Can be used for indexing video clips for digital library
Conclusion • Information Retrieval is relating to many different fields: linguistic, image processing, data organization, hardware utilization, etc. • Many procedures in Information Retrieval: indexing, searching, organizing data, etc. • Choose one specific area to work on in the coming semester.