210 likes | 384 Views
Keyword Ranking: Sentiment Analysis with BaseX. Seminar XML und Datenbanken Universität Konstanz oliver.egli@uni-konstanz.de. Agenda. Idea Sentiment Analysis Algorithm BaseX Implementation Quality of Results Performance Future Work Questions. Idea.
E N D
Keyword Ranking: Sentiment Analysis with BaseX Seminar XML und Datenbanken Universität Konstanz oliver.egli@uni-konstanz.de
Agenda • Idea • Sentiment Analysis • Algorithm • BaseXImplementation • Quality ofResults • Performance • Future Work • Questions
Idea • Future useformachinebasedevaluationofnews. • News will becollectedwith RSS forBaseX. • Rate textnodeswhethertheircontentis positive or not. • General approachsearched, tomakefunctionuseabletootherprojects.
Sentiment Analysis • In literature, twofundamentally different approachesforbuilding a sentenceclassification model canbefound: • - unsupervisedlearningapproaches (input = a lexiconwith a listof positive / negative terms) • - supervisedlearningapproaches (input = preclassifiedsentencesastrainingdata)
Sentiment Analysis (2) • Decisiontouseunsupervisedapproach, because: • Domain independant • Noneedtohavetrainingdata • Relatively easy to understand algorithm • Approach iscalled: Polarity Analysis
Algorithm – basicidea • Lists ofsignalwords • Eachwordisclassifiedas positive or negative. • Numberof positive signalwords – numberof negative signalwords. • List ofnegationwords • Signal wordswithin a distanceof 3 getinverted
Algorithm – materials • dbvis (d.oelke) providedinformationandmaterials: • Literature / papers • Two different wordlists • Somejavahelperclasses (BasicFormFinder) • Andanswerstosomequestions…
BaseXImplementation • New functioncalledsent:pol() • FNSent • PosNegWordList • NegationWordList
Quality ofResults • <articlename="Assault" pol="-302"/> • <article name="Irish Civil War" pol="-161"/> • <articlename="Garry Kasparov" pol="111"/> • <article name="Microsoft" pol="104"/> • Whatwouldyouexpect for « Down syndrome » ?
But… • Down syndrome: sent:pol() = + 151 • http://en.wikipedia.org/wiki/Down_syndrome • Cuisineofthe United States: sent:pol() = + 81 • http://en.wikipedia.org/wiki/Cuisine_of_the_United_States
Normalization • Howtointerpret a valueof pol: + 1873 ? • Itmeans: wehave 1873 more positive wordsthan negative words in a text. • Very positive for a textwith 2000 words! • Whatabout a textwith 500‘000 words? • Additional function: sent:normedpol() • Calculatestheproportionofpos / negwordsandshowsit in an interval [-1; 1], where -1 is negative and 1 is positive.
Normalization(2) • Comparethe bible.xml • sent:pol() = + 5022 • sent:normedpol() = 0.1082207105240602
More examples • <artname="Microsoft Windows" pol="34" norm="0.3541666666666665"/> • <artname="Microsoft" pol="104" norm="0.4369747899159664"/> • <artname="Apple Inc." pol="135" norm="0.5018587360594795"/> • <art name="Bill Clinton" pol="154" norm="0.23619631901840488"/> • <artname="Angela Merkel" pol="101" norm="0.5179487179487179"/> • <art name="Barack Obama" pol="220" norm="0.32738095238095233"/> • <art name="Kim Jong-il" pol="43" norm="0.11590296495956887"/>
Performance • Test File:enwiki-20100130-pages-articles.xml (Wikidump) • DB size: 28.5 GB • Test Query: • (for $p in //*:page • let $t := $p/*:title • let $s := sent:pol($p//*:text) • where $s != 0 • return <article name="{ $t }" sentiment="{ $s }"/>)[position() = 1 to 10000] • First run: time needed: 51856ms
Optimization • ReplacedBasicFormFinderwithBaseXstemmer, that was presented last week. • Result: secondrun: time needed: 17374ms
Optimization (2) • ChangedPosNegWordListstoTokenSets. • ChangedNegationWordListfrom Array toTokenSet • Changing all stringtypestobyte[] level. Tokens • Result: Third testrunisanother 8seconds faster. • time needed: 9661ms
Results • Functionthatcanhelpwith an easy classificationoftext. • Useofnormedpol() makestheresultscomparable. • Itis not an exactmeasurmentforthesentiment in a text. • It „just“ countswords. • But itshows a tendency
Conclusions / Results • Nosolutiontoidentify a sarcastic/ironictext. • Notrainingdataneeded, andthususeable on anytopic. • ReuseablefunctionforotherBaseXusers. • Itisuptotheusertodecidehowmuch sense thefunctionmakes on different kindoftext.
Conclusions / Results • Experience withapplyingthealgorithmtonewstextisthatitworksgood. (dbvis) • Should fit well tofurtherwork, wherethegoal will probablybe, togainnewinformation out ofhugeamountofnewsdata in BaseX, usingthisfunctions.
Future Work • More Testing • Finalizethecode • Integratefunctionwith RSS Import (Master Project) • Visualization (Master Thesis)