Categorizing Multimedia Documents Using Associated Text

Categorizing Multimedia Documents Using Associated Text Thesis Proposal By Carl Sable

Indoor vs. Outdoor Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. They are clockwise from the top: Russian President Boris Yeltsin, U.S. President Bill Clinton, French President Jacques Chirac, Canadian Prime Minister Jean Chretien, Italian Prime Minister Romano Prodi, EU President Willem Kok, EC President Jacques Santer, British Prime Minister Tony Blair, Japanese Prime Minister Ryutaro Hashimoto and German Chancellor Helmut Kohl. Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh. All 89 passengers and crew survived the accident, mostly with minor injuries. Most of the passengers were expatriate Bangladeshis returning home from London.

Event Categories Politics Struggle Disaster Crime Other

Disaster Images Workers Responding Affected People Wreckage Other

Overview of Talk • Contributions. • Corpora. • Two previous systems. • Harder categories. • Interactions between categories. • Video. • Schedule. • Conclusions.

Contributions • Use of text categorization to categorize multimedia documents. • Introduction of novel techniques. • Use of NLP to handle tough categories. • Exploration of interactions between various sets of categories.

Manual Categorization Tool

Reuters • Common corpus for comparing methods. • Over 10,000 articles, 90 topic categories. • Binary categorization. 5 grain, wheat, corn, barley, oat, sorghum 9 earn 448 gold, acq, platinum http://www.research.att.com/~lewis/reuters21578.html

Lots of Previous Literature • “Bag of words” with weights. • Term frequency (TF). • Inverse document frequency (IDF). • Variety of methods: Rocchio, k-nearest neighbors (KNN), naïve Bayes (NB), support vector machines (SVMs). • Systems trained with labeled examples.

Density Estimation • Start with advanced Rocchio system. • For each test document, compute similarity to every category. • Find all documents from training set with similar category similarities. • Use categories of close training documents to predict categories of test documents.

Example Category score vectors for training documents: Actual Categories: Category score vector for test document: 85, 35, 25, 95, 20 (Crime) Distances: 20.0 Disaster  Struggle  Politics  100, 75, 20, 30, 5 (Struggle) Crime  Other  92.5 100, 40, 30, 90, 10 106.4 40, 30, 80, 25, 40 (Disaster) Predictions: Rocchio: Struggle DE: Crime (Probability .679) 27.4 91.4 80, 45, 20, 75, 10 (Struggle) 36.7 60, 95, 20, 30, 5 (Politics) 90, 25, 50, 110, 25 (Crime)

Bin System (AT&T) • Group words with similar “features” together into a common “bin”. • Based on training data, empirically estimate a term weight for words in each bin. • Smoothing, works well even if there is not enough data for individual words. • Doesn’t assume simple relationships between features.

Sample Words Indoor Indicators “conference” “bed” Outdoor Indicators “airplane” “earthquake” Ambiguous “Gore” “ceremony”

Determine Bins for “airplane” • Per category bins based on IDF and category counts. • IDF(“airplane”) = 5.4. • Examine first half of training data: • Appears in 0 indoor documents. • Appears in 2 outdoor documents.

Lambdas for “airplane” • Determined at the bin level. • Examine second half of training data:

Sample Words with Scores • Indoor Indicators • “conference” • +5.91 • “bed” • +4.58 • Outdoor Indicators • “airplane” • -3.78 • “earthquake” • -4.86 • Ambiguous • “Gore” • +0.74 • “ceremony” • -0.32

Results • Both systems did OK on Reuters. • DE performed best for Indoor vs. Outdoor. • Bin system performed best for Events.

Per category measures: Simple accuracy or error measures are misleading for binary categorization. Precision and recall. F-measure, average precision, and break-even point (BEP) combine precision and recall. Macro-averaging vs. micro-averaging. Macro treats all categories equal, micro treats all documents equal. Macro usually lower since small categories are hard. Standard Evaluation Metrics (1) contingency table: p = a / (a + b) r = a / (a + c)

Results for Reuters

Standard Evaluation Metrics (2) • Mutually exclusive categories: • Each test document has only one correct label. • Each test document assigned only one label. • Performance measured by overall accuracy:

Results for Indoor vs. Outdoor • Columbia system using density estimation shows best performance. • Even beats SVMs. • System using bins very respectable.

Results for Event Categories • System using bins shows best performance. • Columbia system respectable.

Improving Bin Method • Experiment with more advanced binning rules. • Fall back to single word term weights for frequent words.

Why are Disaster Images Hard? • Small corpus (124 training images, 124 test images). • Most words not important. • Important words associated with test images have likely never occurred in training set.

Approach • Extract important information (e.g. subjects and verbs). • Compare test words to training words. • Previously seen words indicate strong evidence. • Consider large, unsupervised corpus of subject/verb pairs. • Add evidence to categories for any verb or subject ever paired with new subject or verb.

Subjects and Verbs

First Try • Very simple subject/verb extraction. • Rather small unsupervised corpus. • Very simple similarity metric.

Results of Simple Extraction

Doing It Right • Better subject/verb extraction with parser. • Much larger unsupervised corpus. • Better similarity metric. • Maybe stemming.

Previous Interactions • Combining Pcut and DE for Reuters. • Combining text system and image feature system for Indoor vs. Outdoor. • Using number of people information to improve Indoor vs. Outdoor probabilities.

Those Results Pcut + DE for Reuters Text + Image for In/Out Adding # of People for In/Out

Future Interactions • Improve accuracy. • Determine new information. • Likely use density estimation or bins.

For Example… Indoor+ Disaster Outdoor Indoor + Politics Meeting / Press Conference

Closed Captioned Video • Apply image system to video. • New categories. • New modality, new challenges. 669 726 "HEADLINE NEWS" -- I'M KIMBERLEY KENNEDY, IN FOR DAVID GOODNOW. 750 878 THE FIRE HAS BEEN PUT OUT, BUT SO HAVE THE HOPES OF THOUSANDS OF PEOPLE COUNTING ON A VACATION TO MEXICO THIS WEEK. 930 1087 THE CARNIVAL CRUISE LINER "ECSTACY" WAS ONLY TWO MILES INTO ITS TRIP FROM MIAMI TO COZUMEL WHEN A FIRE BROKE OUT IN A LAUNDRY ROOM. 1134 1291 THE COAST GUARD CAME TO THE RESCUE AND DOUSED THE FLAMES AS HUNDREDS OF PASSENGERS DONNING LIFE JACKETS WATCHED ON. 1331 1388 THE FIRE CHARRED THREE LOWER DECKS BEFORE FIREFIGHTERS BROUGHT IT UNDER CONTROL. ...

Schedule

Conclusions • Categorization of multimedia data. • Applying novel techniques. • Using NLP for hard categories. • Exploring interactions between systems and categories.

Categorizing Multimedia Documents Using Associated Text

Categorizing Multimedia Documents Using Associated Text

Presentation Transcript

Text Classification from Labeled and Unlabeled Documents using EM

Categorizing

Categorizing Appropriate WECC Documents

Categorizing

Text and Documents

Categorizing Grid

Categorizing Rights

Scanning Documents Using

Primary Sources: Analyzing Text Documents

Authoring of scalable multimedia documents

ICA of Text Documents

Text Localization, Enhancement and Binarization in Multimedia Documents

Categorizing

Multimedia and Text Indexing

Text Classification from Labeled and Unlabeled Documents using EM

Semantic Adaptation Of Multimedia Documents

Using S.E.C. Documents

Text Classification from Labeled and Unlabeled Documents using EM

Annotation of Multimedia Documents.

Multimedia and Text Indexing