1 / 21

Keyword Ranking: Sentiment Analysis with BaseX

Keyword Ranking: Sentiment Analysis with BaseX. Seminar XML und Datenbanken Universität Konstanz oliver.egli@uni-konstanz.de. Agenda. Idea Sentiment Analysis Algorithm BaseX Implementation Quality of Results Performance Future Work Questions. Idea.

etta
Download Presentation

Keyword Ranking: Sentiment Analysis with BaseX

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Ranking: Sentiment Analysis with BaseX Seminar XML und Datenbanken Universität Konstanz oliver.egli@uni-konstanz.de

  2. Agenda • Idea • Sentiment Analysis • Algorithm • BaseXImplementation • Quality ofResults • Performance • Future Work • Questions

  3. Idea • Future useformachinebasedevaluationofnews. • News will becollectedwith RSS forBaseX. • Rate textnodeswhethertheircontentis positive or not. • General approachsearched, tomakefunctionuseabletootherprojects.

  4. Sentiment Analysis • In literature, twofundamentally different approachesforbuilding a sentenceclassification model canbefound: • - unsupervisedlearningapproaches (input = a lexiconwith a listof positive / negative terms) • - supervisedlearningapproaches (input = preclassifiedsentencesastrainingdata)

  5. Sentiment Analysis (2) • Decisiontouseunsupervisedapproach, because: • Domain independant • Noneedtohavetrainingdata • Relatively easy to understand algorithm • Approach iscalled: Polarity Analysis

  6. Algorithm – basicidea • Lists ofsignalwords • Eachwordisclassifiedas positive or negative. • Numberof positive signalwords – numberof negative signalwords. • List ofnegationwords • Signal wordswithin a distanceof 3 getinverted

  7. Algorithm – materials • dbvis (d.oelke) providedinformationandmaterials: • Literature / papers • Two different wordlists • Somejavahelperclasses (BasicFormFinder) • Andanswerstosomequestions…

  8. BaseXImplementation • New functioncalledsent:pol() • FNSent • PosNegWordList • NegationWordList

  9. Quality ofResults • <articlename="Assault" pol="-302"/> • <article name="Irish Civil War" pol="-161"/> • <articlename="Garry Kasparov" pol="111"/> • <article name="Microsoft" pol="104"/> • Whatwouldyouexpect for « Down syndrome » ?

  10. But… • Down syndrome: sent:pol() = + 151 • http://en.wikipedia.org/wiki/Down_syndrome • Cuisineofthe United States: sent:pol() = + 81 • http://en.wikipedia.org/wiki/Cuisine_of_the_United_States

  11. Normalization • Howtointerpret a valueof pol: + 1873 ? • Itmeans: wehave 1873 more positive wordsthan negative words in a text. • Very positive for a textwith 2000 words! • Whatabout a textwith 500‘000 words? • Additional function: sent:normedpol() • Calculatestheproportionofpos / negwordsandshowsit in an interval [-1; 1], where -1 is negative and 1 is positive.

  12. Normalization(2) • Comparethe bible.xml • sent:pol() = + 5022 • sent:normedpol() = 0.1082207105240602

  13. More examples • <artname="Microsoft Windows" pol="34" norm="0.3541666666666665"/> • <artname="Microsoft" pol="104" norm="0.4369747899159664"/> • <artname="Apple Inc." pol="135" norm="0.5018587360594795"/> • <art name="Bill Clinton" pol="154" norm="0.23619631901840488"/> • <artname="Angela Merkel" pol="101" norm="0.5179487179487179"/> • <art name="Barack Obama" pol="220" norm="0.32738095238095233"/> • <art name="Kim Jong-il" pol="43" norm="0.11590296495956887"/>

  14. Performance • Test File:enwiki-20100130-pages-articles.xml (Wikidump) • DB size: 28.5 GB • Test Query: • (for $p in //*:page • let $t := $p/*:title • let $s := sent:pol($p//*:text) • where $s != 0 • return <article name="{ $t }" sentiment="{ $s }"/>)[position() = 1 to 10000] • First run: time needed: 51856ms

  15. Optimization • ReplacedBasicFormFinderwithBaseXstemmer, that was presented last week. • Result: secondrun: time needed: 17374ms

  16. Optimization (2) • ChangedPosNegWordListstoTokenSets. • ChangedNegationWordListfrom Array toTokenSet • Changing all stringtypestobyte[] level. Tokens • Result: Third testrunisanother 8seconds faster. • time needed: 9661ms

  17. Results • Functionthatcanhelpwith an easy classificationoftext. • Useofnormedpol() makestheresultscomparable. • Itis not an exactmeasurmentforthesentiment in a text. • It „just“ countswords. • But itshows a tendency

  18. Conclusions / Results • Nosolutiontoidentify a sarcastic/ironictext. • Notrainingdataneeded, andthususeable on anytopic. • ReuseablefunctionforotherBaseXusers. • Itisuptotheusertodecidehowmuch sense thefunctionmakes on different kindoftext.

  19. Conclusions / Results • Experience withapplyingthealgorithmtonewstextisthatitworksgood. (dbvis) • Should fit well tofurtherwork, wherethegoal will probablybe, togainnewinformation out ofhugeamountofnewsdata in BaseX, usingthisfunctions.

  20. Future Work • More Testing • Finalizethecode • Integratefunctionwith RSS Import (Master Project) • Visualization (Master Thesis)

  21. Discussion

More Related