680 likes | 921 Views
2. Introduction. With the rapid expansion of video data, there is an increasing demand for retrieving and browsing videosCurrent video retrieval techniques merely support for retrieving related
E N D
1. Toward Multimedia: A String Pattern-based Passage Ranking Model for Video Question Answering
2. 2 Introduction With the rapid expansion of video data, there is an increasing demand for retrieving and browsing videos
Current video retrieval techniques merely support for retrieving related “documents”
To provide multimedia Q/A, it implies:
Video content extraction
Objects, sounds, speech, images, motions, etc
Text-based Q/A
Pinpoint exact answers rather than returning documents
3. 3 Related Works (Video) Extracting video contents is a very difficult but important task
Objects, sounds, speech, images, motions, etc
Among them, text in videos, especially for the closed captions is the most powerful features
Common OCR (optical character recognition) > SR (speech recognition)
The well-known Informedia project (Wactlar, 2000) and TREC VID tracks (Over et al., 2005)
But both of them serves for simple retrieval only,
Find shots of [a ship or boat]
4. 4 Related Works (TextQ/A) TREC-Q/A gave the pilot competition on extracting answers from huge document corpus
Most top-performed Q/A systems required combining many domain and language dependent resources:
Parsers (Charniak, 2002)
Named Entity Taggers (Florian et al., 2003)
Elaborate ontology (Yang et al., 2003)
WordNet (almost every Q/A studies)
They are far difficult to port to different languages or domains
5. 5 Related Works (VideoQ/A) Lin et al. (2001) presented the earlier work
Simple OCR techniques and combining with simple term weighting schemes
Un-advanced work in OCR
The thesaurus was hand-created
Yang et al. (2003) proposed the earliest video Q/A system
Made use many linguistic resources
NER, Parser, WordNet, WWW, …
Applying the news articles to correct speech errors
Keyword frequency-based answer selection
Cao et al. (2004) designed a domain-dependent Q/A system
For online education
Pattern-based (manually constructed) answer selection
Wu et al. (2004) showed the first cross-language video Q/A system
Applying density-based method for answer selection
Convert each language into English (only support for English query)
Zhang and Nunamaker (2004) developed a videoQ/A technique based on retrieving short clips
The short clips were segmented manually
Applied a simple TFIDF-like weighting
6. 6 In this paper We propose a passage ranking algorithm for extending textQ/A to videoQ/A
Users interact with our system through natural language questions
Passages are able to answer the question
Lin et al. (2003) showed that users prefer passages rather than short answers since it contains context
Our method are:
Multilingual portable
Effective
7. 7 Outline Introduction
Related works
Our videoQ/A Method
Video Processing
Passage Ranking Algorithm
Experiments
Settings
Results
Conclusion
8. 8 System Architecture
9. 9 Video Processing
10. 10 Video Processing Text localization:
Purpose:
Localize the text areas in frames
Related works:
Top-down (Cai et al., 2002)
Bottom-up (Fan et al., 2001)
11. 11 Video Processing Extraction & Tracking:
Purpose:
Extract text color and multi-frame integration
Related works:
Text extraction (Ryu et al., 2005)
Multi-frame integration
12. 12 Video Processing OCR
Purpose:
Recognizing the characters in text components
Related works:
Simple OCR (Wu et al., 2004; Hong et al., 1995)
13. 13 System Architecture
14. 14 Chinese word segmentation There is no explicit boundary between words in most oriented languages (Chinese, Japanese, Korean, etc)
We could adopt two approaches to extract words in those text
A well-trained Chinese word segmentation (SIGHAN bake-off, see Levow, 2006)
N-gram (widely used for NTCIR cross-language retrieval, see Kishida et al., 2007)
15. 15 An example
16. 16 System Architecture
17. 17 What is a sentence?
18. 18 Document Retrieval and Passage Segmentation Passage segmentation:
Sliding window with size=3 and one previous sentence overlapping
Initial retrieval model
Okapi-BM25 (Robertson et al., 2000; Savoy, 2005)
Top-1000 relevant passages for further re-ranking
One can replace BM-25 with better retrieval models
19. 19 System Architecture
20. 20 Ranking Algorithm Related works
Introduction
Limitations
The importance of N-gram and word density
Our method
Suffix Tree
Algorithms for finding the best match sequence
Preprocessing
Re-tokenization and Weighting
21. 21 Ranking Algorithm (Related works) The ranking model receives the segmented passages and ranks the top-N passages to response the question
Tellex et al. (2003) compared seven portable passage retrieval algorithm
Density-based is the best
Cui et al. (2005) further improve the density-based method with 17% relatively MRR (main reciprocal ratio) score
But it is necessary to prepare training data, WordNet, and parsers at first
22. 22 Ranking Algorithm (Related works) Parsing is a very complex work, in particular to Chinese
Word segmentation
Part-of-Speech tagging
Constituent parsing / Dependency parsing
Full parsing is also very slow
1 sentence cost 0.8-1.3 second (Charniak’s parser)
Besides, the development of labeled corpora is a laborious work
Develop a trained-passage ranker to another language is also very expensive
What about the OCR errors ?
23. 23 Ranking Algorithm (N-gram) Traditional ranking algorithm biased to give more weight for high-frequent words rather than N-gram
N-gram is useful but much less ambiguous than its individual unigrams
For example, it is often the case that
“OpticalnCharacternRecognition” = “Optical?Character?Recognition”
24. 24 Ranking Algorithm (Density) Dense “distinct” word distribution is useful
If the passage contains abundant “identical” question words, potential answer words might occur
Basic assumption of the density-based algorithms
We should state that “distinct” word distribution does apart from the classic word distribution
Classical density-based method simply account the match word distributions
The first term of the SiteQ’s method is “keyword frequency”
In comparison, we focus on find the “only one” best fit match word of each question term
25. 25 Ranking Algorithm (Frequency) Frequency is not always useful
Usually a passage contain Chinese stopwords, and punctuations:
” ;!????,
In our case, many unrecognizable or false-alarm words are also appear
? ? ? ? ? ? ? ?
26. 26 Ranking Algorithm Our ranking algorithm both takes the two “views” into account
In other words, find the best match sequence for the passage that results in the “long” N-gram matching and “dense” N-gram distribution
In addition, each match word is restricted to appear at most once in the sentence
27. 27 Ranking Algorithm Unfortunately finding the best match is an NP-Complete problem => O(2n)
Match or Mismatch
Thus we propose an algorithm to approximately find the best fit match sequence to be scored
Induction:
Probabilistic view to score the importance of a passage
Propose an algorithm to find the match sequence
An example to estimate the score
Time complexity analysis
Compared with “density”, “frequency” methods
28. 28 System Architecture
29. 29 Question Analysis At first, we remove all the Chinese stopwords from the given question using the maximum-N-gram matching algorithm
Decreasingly check N-gram, N-1 gram, … 2-gram, 1-gram in the sentence
The stoplist is selected via
Estimate the N-gram (1,2,3) frequency
Sorting
Selected by a Chinese native expert
897 = 571 (English stopwords) +326 (semi-manually)
30. 30 Question Suffix Tree
31. 31 Passage Suffix Tree
32. 32 String Matching Insert the question string into the Passage Suffix Tree, we can find the common subsequences for question string
33. 33 String Matching Hence we observe the following common subsequences
Similarly, we can insert the passage string into the Question Suffix Tree to get:
34. 34 Scoring Function The passage score is ranked by
? is used to adjust the importance of the density score
QW_Density(Q, P) estimates the Q word density in P
QW_Weight(Q, P) measures the sum of weight of the matched question words in P
35. 35 QW_Density Quantifies the weighted word density distribution
Modify SiteQ’s second term
As our hypothesis, we want the long string patterns
36. 36 Discriminative Power Also the discriminative power should be taken into consideration
37. 37 QW_Density By re-tokenizing and re-weighting, the QW_Density can be computed as follows
38. 38 QW_Weight This term estimates
How much content information the passage has given the question
39. 39 Combining Density and Weight We further taking first two or last two sentences into account
Answers might occur before/after the sentences that contain useful term match
40. 40 Outline Introduction
Related works
Our videoQ/A method
Experiments
Settings
Results
Conclusion
41. 41 Settings The testing question data set (about 250) is mainly collected by Web-logs
We use the MRR, precision, pattern-recall score to estimate the proposed Q/A method
Pattern-recall: number of answer patterns found in top-5 rank
To compare with state-of-the-art, we adopted six effective but multilingual portable ranking algorithms
TFIDF, BM-25, Language Model, INQUERY, Cosine, SiteQ
42. 42 For askers
43. 43 For askers
44. 44 Statistics of the collected Discovery videos
45. 45 Comparison
46. 46 Results (character-level)
47. 47 Results (word-level)
48. 48 Large-scale experiments
49. 49 Auto-Translate into English
50. 50 Re-ranking the six retrieval models
51. 51 Conclusion This paper propose a new passage ranking algorithm for Chinese video QA
250 collected questions are evaluated in the 75.6 Hrs videos
Outperform the BM-25, Language Model, INQUERY, etc.
Applying the word-segmentation for video QA is not a good idea:
drop Avg. 10% for most retrieval models
Can we parse the OCR transcripts as articles??
Word-segmentation (0.94), POS tagging (0.91-0.92), Parsing (0.846)
52. 52 Future Directions Speech is another important clue
Now we are investigating some well-known toolkits
CMUs SphinX, Cambrige’s HTK
Effective parse the transcript (especially to Asian-like languages)
How to improve the error-recognized and false-alarm words
Domain adaptation (from news articles to the video)
53. 53 Thanks Prof. Yue-Shi Lee and Chia-Hui Chang gave great amount of comments and everything support
Database Lab (National Central Univ.) and Data Mining Lab (Ming-Chuan Univ.) for usage testing and truly comments
54. 54 References Yang, H., Chaison, L., Zhao, Y., Neo, S. Y., & Chua, T. S. (2003). VideoQA: question answering on news video. In Proceedings of the 11th ACM international conference on multimedia (ACMM) (pp. 632-641).
Zhang, D., & Nunamaker, J. (2004). A natural language approach to content-based video indexing and retrieval for interactive E-learning. Journal of IEEE Transactions on Multimedia, 6, 450-458.
Wu, Y. C., Lee, Y. S., & Chang, C. H. (2004). CLVQ: cross-language video question/answering system. In Proceedings of 6th IEEE international symposium on multimedia software engineering (MSE) (pp. 294-301).
Lyu, M. R., Song, J., & Cai, M. (2005). A comprehensive method for multilingual video text detection, localization, and extraction. Journal of IEEE transactions on circuits and systems for video technology, 15, 243-255.
Lienhart, R., & Wernicke, A. (2002). Localizing and segmenting text in images and videos. Journal of IEEE transactions on circuits and systems for video technology, 12, 256-268.
Lin, C. J., Liu, C. C., & Chen, H. H. (2001). A simple method for Chinese videoOCR and its application to question answering. Computational linguistics and Chinese language processing, 6, 11-30.
Cao, J., & Nunamaker, J. F. (2004). Question answering on lecture videos: a multifaceted approach. In Proceedings of the joint conference on digital libraries (JCDL) (pp. 214-215).
Cao, J., Roussinov, D., Robles, J., & Nunamaker J. F. (2005). Automated question answering from videos: NLP vs. pattern matching. Hawaii international conference on system science (HICSS) (pp. 43(b)-43(b)).
Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., & Myaeng, S. H. (2007). Overview of CLIR task at the sixth NTCIR workshop. In Proceedings of the 6th NTCIR Workshop.
55. 55 SPVQA system
56. 56 SPVQA system
57. 57 SPVQA system
58. 58 SPVQA system
59. 59 Online-Demonstration
60. 60 Discussions Our method outperforms TF-based and density-based methods
Our method is suitable for videoOCR transcripts
Even OCR error appears within keywords
Question: ???????????
Answer: ????????????????????
61. 61
62. 62 Error Analysis OCR error in key question words
“Where is the headquarter of FBI?”
But most “FBI” were incorrect identified as
98I or 28I
Synonyms and anaphora
Our method focus on the surface terms
Failed to identify the “It”, “He”, “She”, … etc.
63. 63 Error Analysis Lake of language-dependent analysis
Chinese word tokenization
Chinese stopword removal
???????????
?????? obtains more weight
In this case
We should focus on “??(?)” and “???”
Machine translation errors
Out-of-vocabulary
Hotshepsut => conspicuous Zhai
(???)
64. 64 In natural language, this is quite uncommon
65. 65 Repeat pattern does not hurt Q/A Most repeat words are:
Meaningless words:
?, ?
Punctuations:
, . ?
OCR false-alarm words:
? ?
Experiment also demonstrates that with employing simple Longest Common Subsequence is the same as using the proposed method (or enumerate all of the state sequences)
66. 66 Video Processing (Experiments) We use a small subset of Discovery videos
30 short clips
NTSC 352x240 MPEG-1
1684 frames (sample 2 frames per second)
2166 text areas
67. 67 Experimental Result (Text detection)
68. 68 Experimental Result (OCR)
69. 69 VideoOCR Efficiency Analysis