1 / 19

Creating and Using a Correlated Corpora to Glean Communicative Commonalities

Creating and Using a Correlated Corpora to Glean Communicative Commonalities. Outline. Motivation Corpora collection General Corpora Characteristics Word count Readability Future directions. Motivation. How do computer-mediated communication genres differ from traditional genres?

paytah
Download Presentation

Creating and Using a Correlated Corpora to Glean Communicative Commonalities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating and Using a Correlated Corpora to Glean Communicative Commonalities

  2. Outline • Motivation • Corpora collection • General Corpora Characteristics • Word count • Readability • Future directions LREC

  3. Motivation • How do computer-mediated communication genres differ from traditional genres? email interview blog essay chat discussion • How consistent are communicative features across genres for a single individual? • If such commonalities exist, how can they be utilized for document classification? LREC

  4. Email sample(2E1S3) I do not feel that gender discrimination is a problem in the United States at the moment. My supervisor at my current job is a woman, and everyone respects her the same as the owner of the company, who is a man. I think this issue was more prevalent earlier last century. In these modern times, it really is not an issue in my opinion. LREC

  5. Blog sample(2B1S2) While gender discrimination is something that should always be avoided ideally, there are some problems I have with the issue in general.  As the discussion starter states, discrimination because of sex is defined as adverse action against another person, that would not have occurred had the person been of another sex.  LREC

  6. Chat sample(2C1S1) • Are there a lot of issues like this in the news, because to me generder discrimination is a thing of the past  • Aren't men found to be naturally more apt in certain fields, and women in others?  • Did any of you experienece any personal discrimination at your jobs, or witness it or anything?  • I definitely agree with that • Unless one person decides another person is not right for a job solely based on gender, I don't believe it is discrimination LREC

  7. Aim: Collect a correlated corpora of text samples • Including both computer-mediated and not c-m • Including both individual and interactive, spoken and text • Across 6 genres: • email, essay, interview (phone) • blog, chat, discussion • From the same individuals • On 6 distinct topics LREC

  8. Corpora CollectionSeptember 2006 through November 2007 Participants • All college students, aged 18-29 • 12 students in pilot study • 21 participants completed both Phase 1 (email, essay, interview) & Phase 2 (blog, chat, discussion) • 10M/11W • 18 Caucasian/3 African-American • all had English as the primary language spoken at home LREC

  9. Topics • Piloted via individual interviews with a separate group • Selected for • production of expression • comfort of participates for the topic • Topics: • Catholic Church • Gay Marriage • Iraq War • Legalization of Marijuana • Privacy as a U.S. Citizen • Gender Discrimination • Each introduced via a “starter” question LREC

  10. Other Design Issues LREC

  11. All .txt files produced • Interviews and Discussions transcribed • by trained psychology students • punctuation inserted • non-fluencies preserved • Discussion and Chat dismembered to individual files • Multiple blog entries combined to a single file LREC

  12. Resulting Corpora • Blogs entries were combined into single files. • The 21 fully parallel corpora were used in this paper. • Limitations: size, homogeneity of subjects, non- • spontaneity of discourse LREC

  13. General Corpora Characteristics • Word Count • by topic • by genre • by gender of communicant • Readability: Flesch reading ease & Flesch-Kincaid grade level • by topic • by genre • by gender of author LREC

  14. Word Count • No main effect for gender • No main effect for topic • Significant topic x gender interaction for Church and Discrimination LREC

  15. Word Count (con’t) • Significant Main Effect for genre • Discussion had highest word counts • Direct communication produced higher word counts LREC

  16. Readability • No significant main effect for gender • Significant main effect for genre • Discussion and interview had highest reading ease • Main effect for topic LREC

  17. Readability (con’t) • reading ease of conversational genres high • reading ease of non-conversational genres low LREC

  18. Future Possibilities • additional features for genderID, authorship • sentence complexity • cohesion of text • feature change across time within a topic • classification by topic order • classification by genre • conversational dynamics in chat vs. discussion LREC

  19. Thank you. Questions? www.cs.loyola.edu/~res LREC

More Related