330 likes | 473 Views
eTRACES at GESIS. Brigitte Mathiak , Farag Ahmed and Andreas Oscar Kempf brigitte.mathiak@gesis.org Leipzig, 07-05-2012. eTRACES for Social Sciences. Text Re-Use. Context of the quotation. Knowledge transfer. Who cites whom?. Transfer of ideas. Text Re-Use. Who influences whom?.
E N D
eTRACES at GESIS Brigitte Mathiak, Farag Ahmed and Andreas Oscar Kempf brigitte.mathiak@gesis.org Leipzig, 07-05-2012
eTRACESfor Social Sciences Text Re-Use Context of the quotation Knowledge transfer
Who cites whom? Transfer of ideas Text Re-Use Who influences whom? Why? Analysis • Tracking ideas through time for a number of applications: • Better ranking • Better filtering (based on ideas, not words) • Objective criteria on idea generation • To help literature analysis • Motivation of the author • Strengthening own arguments • Information for the reader • Separation • Critique • …
Why eTRACES* is interesting • Text re-use instead of bibliometrics to find inter-document relationships • We are the first to use this on Social Sciences texts • Analysis of citation intention • Results become immediately available to the end user *from GESIS point of view
AP 5.1 Social Scientific Annotation • The Habermas-Luhmann-Debate • Habermas, Jürgen/Luhmann, Niklas (1971) Theorie der Gesellschaft oder Sozialtechnologie. Was leistet die Systemforschung? Frankfurt/Main: Suhrkamp. • We chose about 30 Documents in that context • The texts are annotated with CiTO (Citation Typing Ontology) • Two dimensions: intention and type • The Method is based on qualitative social science research • Especially reconstructive and sequential analysis
Theoretical background for the Methodology „Erzähltheorie“ by Fritz Schütze (1976, 1977) • Development of central categories for the formal analysis of stories („Erzählungen“) • Distinctionbetween three different modes:story, description, argumentation/evaluation Expansion for this project: • Distinction between direct and indirect citation and paraphrasing/summarizing of authors in scientific texts
Methodology • We start with reconstructive and sequential text analysis • When looking at citations, the functional reason for the citation is most important, which can be deduced from the overall context • Texts are segmented within the text and differentiated according to mode • That way describing, argumentative and evaluating passages can be identified and differentiated from the summarizing passages
CiTO Auszug aus CiTO (CitationTypingOntology)
Direct Citation with Text-Reuse Text-Reuse w/o sourceFriedrich Schiller, Wilhelm Tell I,3 / Tell Paraphrase with source
Cites as authority Includes quotation from Refutes
Goals of the annotation • The annotation will be (is already) used to • Train algorithms to find similar pattern automatically • Make original social scientific research • Support bibliometrical research at our institute • We plan to annotate a second distinctly different data set before the end of the project for comparison
AP 2 Data Cleansing • The DGS corpus has 5,594 documents with 523,834 unique terms • It includes the proceedings of the German Society for the Social Sciences spanning 100 years • There are mostly German texts as PDF • Some are derived from OCR, newer ones have been converted directly
AP 2 Data Cleansing (Re-) OCR CitationRecommender Text Extraction and Clean-up StandardSearch FilteredSearch Cleaned data VisualizedSearch Unified Database SentimentAnalysis
AP 2 Data Cleansing • PDF conversion proves to be difficult. Some of the OCR has been based on bad scans, there are missing line breaks, irregular spaces • By using a dictionary method, we are able to cope with most of those mistakes automatically Scan with bad quality Letters are too much spread, leading to spaces inside of words
Data cleansing Statistics* • 155 untreated OCR documents were automatically identified and Re-OCRed * Function words were excluded from the corpus statistics
Example: Topic Trends over time 1 frankfurt luhmann moderne theorie begriff form modernen ordnung macht subjekt soziologische unterscheidung differenz sinn 3 deutschen menschen geschichte deutsche deutschland jahrhunderts jahre gesellschaft jahren jahrhundert welt kultur revolution krieg 2 internet beziehungen evaluation daten forschung methoden qualitative online verfahren informationen sozialforschung netzwerke gruppe netzwerk
First resultswithtextre-use • Atfirstwefoundmainlyreferencesandduplicatedocuments • The algorithmisvery robust versus wrongspacerecognition • Example: • In fact, Weber’s sociology of religion turned into an ambivalent intellectualist and moralistic affirmation of asceticism, individualism, professionalism, and institutional rationalization. • The affirmative project of modernity is largely engaged in a reversion of Nietzsche’s critique, turning it into an ambivalent intellectualist and moralistic affirmation of asceticism, individualism, professionalism, and institutional rationalization. • Wecanseeherethatthecontextandintentionisimportant
Components – Current Status • StandardSeachisimplemented([Histo] Suche). • FilteredSearch: near duplicate can be filtered, based on “Tracer” tool, ASV-Leipzig more filters are to come • VisualisedSearch: supports users in exploring and navigating through the displayed result is still in the concept phase • SentimentSearch: improve the retrieved results, by recommendingspecificarticles to the user, also still in the concept phase
CitationRecommender • Integrates all tools, available as mock-up
Sentiment Analysis • The task of studying whether the expressed opinion in a piece of text is positive, negative, or neutral • Why sentiment analysis is important : • Support a decision making (hearing others opinion about a certain thing) • For our goal, to support citation search e.g., ranking based on the work quality rather than citations frequency
Sentiment Analysis ofCitation Challenges • Citation context extraction: • Citation context boundaries can vary greatly. Therefore a fixed window size might not effectively include all citation terms • Citations that are in close proximity can interact with each other which leads to ownership ambiguity for the surrounding words • Citing author motivation is not an easy to identify automatically e.g., is it persuasiveness or to notifythereaderaboutsomethingorpositive, negative or neutral mining
Sentiment Analysis ofCitation Challenges • Not much work has been done in this regard, but esp. for Humanities it is very relevant • It is a very tough problem, therefore even small advances are valuable, e.g. semi-automatic or partial processes
Sentiment Analysis of CitationMain Components • In order to perform a sentiment analysis of citation, we need: • Citing author information (name, address, organization etc.) • Citing article information (paper id, title, place of publication etc.) • Citation context (which words the citing author used to describe the cited article) • Cited article information • Author information • Paper information
Recommendation/Sentiment Search Overview Userfeedback Recommended top n-documents based on citation context analysis query Initial retrieved top n-documents Evaluation Retrieval Model Re-Ranking Sentiment analysis Das Prozeßergebnisauf seiten des Individuums läßtsich in den Termini von Marcia (Marcia 1980, S.161)zwar gut beschreiben, die Prozeßqualitätund insbesondere die 'Transaktionen' ... zwischen Außen- und Innenwelt bleiben im Verborgenen. Documents Database Citation context extraction Marcia, Jarnes E. (1980). Identity in adolescence. In Joseph Adelson (Hrsg.), Handbook of adolescent psychology (S. 159-187). New York: Wiley. Author’s info. extraction
Documents Re-ranking Cycle query Initial retrieved documents Process next document Documents Re-ranking D1 D2 D3 Paper info. extraction Dm D209 Dn D1 D1020 Dn D3 D2 Citation context analysis to Re-weight initial retrieved documents Paper id Documents database Cited on Citation context Citation context Citation context
More Future Work • Based on the citation temperature used in eAqua • We will color often cited works (e.g. Luhmann, Marx, Weber) based on the citation context • Agreeable sections will be distinguishable from controversial and from negatively judged sections
Conclusion • Text re-use instead of bibliometricsto find inter-document relationships • Build interactive tool to support effectively citation context extraction • In eTRACEScitation ranking will be done based on the work quality rather than citations frequency • CitationRecommender supports social scientists to perform their information search tasks in an effective way