Keyword Ranking: Sentiment Analysis with BaseX

Keyword Ranking: Sentiment Analysis with BaseX Seminar XML und Datenbanken Universität Konstanz oliver.egli@uni-konstanz.de

Agenda • Idea • Sentiment Analysis • Algorithm • BaseXImplementation • Quality ofResults • Performance • Future Work • Questions

Idea • Future useformachinebasedevaluationofnews. • News will becollectedwith RSS forBaseX. • Rate textnodeswhethertheircontentis positive or not. • General approachsearched, tomakefunctionuseabletootherprojects.

Sentiment Analysis • In literature, twofundamentally different approachesforbuilding a sentenceclassification model canbefound: • - unsupervisedlearningapproaches (input = a lexiconwith a listof positive / negative terms) • - supervisedlearningapproaches (input = preclassifiedsentencesastrainingdata)

Sentiment Analysis (2) • Decisiontouseunsupervisedapproach, because: • Domain independant • Noneedtohavetrainingdata • Relatively easy to understand algorithm • Approach iscalled: Polarity Analysis

Algorithm – basicidea • Lists ofsignalwords • Eachwordisclassifiedas positive or negative. • Numberof positive signalwords – numberof negative signalwords. • List ofnegationwords • Signal wordswithin a distanceof 3 getinverted

Algorithm – materials • dbvis (d.oelke) providedinformationandmaterials: • Literature / papers • Two different wordlists • Somejavahelperclasses (BasicFormFinder) • Andanswerstosomequestions…

BaseXImplementation • New functioncalledsent:pol() • FNSent • PosNegWordList • NegationWordList

Quality ofResults • <articlename="Assault" pol="-302"/> • <article name="Irish Civil War" pol="-161"/> • <articlename="Garry Kasparov" pol="111"/> • <article name="Microsoft" pol="104"/> • Whatwouldyouexpect for « Down syndrome » ?

But… • Down syndrome: sent:pol() = + 151 • http://en.wikipedia.org/wiki/Down_syndrome • Cuisineofthe United States: sent:pol() = + 81 • http://en.wikipedia.org/wiki/Cuisine_of_the_United_States

Normalization • Howtointerpret a valueof pol: + 1873 ? • Itmeans: wehave 1873 more positive wordsthan negative words in a text. • Very positive for a textwith 2000 words! • Whatabout a textwith 500‘000 words? • Additional function: sent:normedpol() • Calculatestheproportionofpos / negwordsandshowsit in an interval [-1; 1], where -1 is negative and 1 is positive.

Normalization(2) • Comparethe bible.xml • sent:pol() = + 5022 • sent:normedpol() = 0.1082207105240602

More examples • <artname="Microsoft Windows" pol="34" norm="0.3541666666666665"/> • <artname="Microsoft" pol="104" norm="0.4369747899159664"/> • <artname="Apple Inc." pol="135" norm="0.5018587360594795"/> • <art name="Bill Clinton" pol="154" norm="0.23619631901840488"/> • <artname="Angela Merkel" pol="101" norm="0.5179487179487179"/> • <art name="Barack Obama" pol="220" norm="0.32738095238095233"/> • <art name="Kim Jong-il" pol="43" norm="0.11590296495956887"/>

Performance • Test File:enwiki-20100130-pages-articles.xml (Wikidump) • DB size: 28.5 GB • Test Query: • (for $p in //*:page • let $t := $p/*:title • let $s := sent:pol($p//*:text) • where $s != 0 • return <article name="{ $t }" sentiment="{ $s }"/>)[position() = 1 to 10000] • First run: time needed: 51856ms

Optimization • ReplacedBasicFormFinderwithBaseXstemmer, that was presented last week. • Result: secondrun: time needed: 17374ms

Optimization (2) • ChangedPosNegWordListstoTokenSets. • ChangedNegationWordListfrom Array toTokenSet • Changing all stringtypestobyte[] level. Tokens • Result: Third testrunisanother 8seconds faster. • time needed: 9661ms

Results • Functionthatcanhelpwith an easy classificationoftext. • Useofnormedpol() makestheresultscomparable. • Itis not an exactmeasurmentforthesentiment in a text. • It „just“ countswords. • But itshows a tendency

Conclusions / Results • Nosolutiontoidentify a sarcastic/ironictext. • Notrainingdataneeded, andthususeable on anytopic. • ReuseablefunctionforotherBaseXusers. • Itisuptotheusertodecidehowmuch sense thefunctionmakes on different kindoftext.

Conclusions / Results • Experience withapplyingthealgorithmtonewstextisthatitworksgood. (dbvis) • Should fit well tofurtherwork, wherethegoal will probablybe, togainnewinformation out ofhugeamountofnewsdata in BaseX, usingthisfunctions.

Future Work • More Testing • Finalizethecode • Integratefunctionwith RSS Import (Master Project) • Visualization (Master Thesis)

Discussion

Keyword Ranking: Sentiment Analysis with BaseX