140 likes | 209 Views
1 st Conference on Email and Anti-Spam, CEAS 2004 Learning to Extract Signature and Reply Lines from Email. Vitor R. Carvalho & William W. Cohen Carnegie Mellon University. Idea. Reply lines. Sig Lines. Motivation. Names, Dates, Times, etc. Preprocessing for:
E N D
1st Conference on Email and Anti-Spam, CEAS 2004Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho & William W. Cohen Carnegie Mellon University
Idea Reply lines Sig Lines
Motivation Names, Dates, Times, etc Preprocessing for: *email information extraction *content-based email classifiers “Speech Act”, Topic, etc Anonymization of email corpora Automatic personal address management Email Text-To-Speech Systems
Related work: • Sproat, Chen & Hu; “Emu: An e-mail preprocessor for text-to-speech”, “geometrical and linguistic analysis for e-mail signature” Our work: • 3 tasks: • Sig detection ( has a signature?) • Sig line extraction (in which lines?) • Reply line extraction • Compare state-of-the-art learning algorithms • Supervised learning
Data Total: 33013 lines (3321 sig lines, 5587 reply-to lines)
Sig Detection Task • Last K lines of the email message • Example: if URL pattern is detected in each of the last 3 lines, then the msg representation contains the features url1, url2 and url3
Sig Detection Results • 5-fold cross-validation on 1203 labeled messages (617 positive, 586 negative) • Sproat et al. (1999): “SIG fields are rarely longer than ten lines”. • Typical mistakes: ASCII drawing only, only the nickname of the sender, or only a few quoted sentences.
Signature Extraction Task • Email message represented as a sequence of lines • Each line is a set of features (sequential classification)
Last Lines • Effective method to extract signature and reply lines in email messages • Sequence of lines representation (+ neighbor lines features) • Comparison of state-of-the-art learning algorithms • Implementation available on the Minorthird package (Cohen, 2004)