290 likes | 430 Views
Tags in the cloud : Crowdsourcing semantic annotation with CATMA. Jan Christoph Meister University of Hamburg. www.catma.de. CATMA - an integrated textual markup and analysis tool. Text vs. sentence, or: What ‘ s so different about processing texts?.
E N D
Tags in thecloud:Crowdsourcingsemanticannotationwith CATMA Jan Christoph Meister University of Hamburg www.catma.de
CATMA - an integrated textual markup and analysis tool CLARIN's Turn Towards The Literary Text
Text vs. sentence, or: What‘s so different about processing texts? • structural complexity: min TEXT > 2 (SENTENCE) • structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences • structural dynamic: TEXT processing represents & simulates cognitive and empirical processes TEXT yields more INTERPRETATIONS than SENTENCE +CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“ CLARIN's Turn Towards The Literary Text
The what and why of MarkUp procedural, descriptive & discursive function discursive function • discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration „What might this text mean to us?“ • declarative markup: informs a human reader how to process a text as a communicative device „How is this text put together and how does it function in its communicative universe?“ • procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string „What is the correct operation to perfom on this input?“ performative function CLARIN's Turn Towards The Literary Text
facilitate collaboration & non-deterministic annotation allow for multiple markup allow for overlap allow for concurrent tagging conceptualize markup as dynamic & recursive allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop Hermeneutic „must haves“ of discursive markup CLARIN's Turn Towards The Literary Text
MarkUp types & data models stand off, discursive <1,5, word class = “Preposition”> <1,5, segment = “SentenceStart”> <1,8, POS = “noun phrase”> <1,5, word class = “Adverb”> <1,38, speech act = “declaration”> <1,11, POS = “verb phrase”> network There is no such thing as “no-mark up”. <1,5, word class = “Adverb”> <1,5, segment = “SentenceStart”> <1,5, POS = “verb phrase element”> There is no such thing as ”no-mark up”. stand off, descriptive relational nested inline, deterministic <SentenceStart><Adverb>There</Adverb></SentenceStart> is no such thing as “no-mark up”. sequential inline, deterministic <SentenceStart>There</SentenceStart> is no such thing as “no-mark up.” linear implicit There is no such thing as “no-mark up”. (Coombs, Renear, DeRose 1987) opaque CLARIN's Turn Towards The Literary Text
Implementation in CATMA www.catma.de CLARIN's Turn Towards The Literary Text
The CATMA/CLÉA approach to markup • text range based model • a tag references a text range with a start and an end offset • external standoff markup • markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users • markup is stored in a standoff manner to allow overlapping • markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity CLARIN's Turn Towards The Literary Text
Example for overlapping markup in CATMA (NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up) CLARIN's Turn Towards The Literary Text
TEI feature structure tag declaration & overlapping markup • <fs xml:id="CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5" n="1_7985fdf0-77a5-4060-9a3d-2d977e0ab954" type="catma_tag"> • <f xml:id="CATMA_aa9b3727-187e-4fb8-9990-e7880912a409" name="catma_tagname"> • <string>Keynote_speaker&affiliation</string> • </f> • <f xml:id="CATMA_564825ba-28b2-4dab-b136-b87c8a3d9e28" name="catma_displaycolor"> • <numeric value="-13421569"/> • </f> • </fs> <ptr target="Abstracts.doc#range( /.21736, /.21888)" type="inclusion"/> <seg ana="#CATMA_0a252cc2-96d2-4ed4-8fb8-52380550ec0b #CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5 #CATMA_8513fe2d-2e35-4d0a-a3a2-07528bcfa012"> CLARIN's Turn Towards The Literary Text
Question 1: How can we model a collaborative mark up practice? CLARIN's Turn Towards The Literary Text
Answer 1: CATMA’S “n-meta-data set to-1object data instance”-model meta-data • procedural • declarative • hermeneutic user markup 1..n 0 A Tagsets TEXT object-data CLARIN's Turn Towards The Literary Text
Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow? TEXT CLARIN's Turn Towards The Literary Text
Example for recursion: a simple querie across the object data/meta data divide Step 1: object data querie ... an additional meta-data constraint Step 2: refinement by adding ... CLARIN's Turn Towards The Literary Text
... which is why(reg="\b\S*\Qez\E(?=\W)") where (tag="Keynote_speaker&affiliation") generates this: CLARIN's Turn Towards The Literary Text
Answer 2: CATMA’S dynamic data model, e.g.(n meta-data set to 1 object instance)>n+1 TEXT markup 1..n markup 1..n meta-data • procedural • declarative • hermeneutic 0 A Tagsets object-data TEXT object-data 0 A CLARIN's Turn Towards The Literary Text
Question 3: How can we implement this practice in a system? CLARIN's Turn Towards The Literary Text
Answer 3: Call the big sister – CLÉA! CLÉA Data Base Model CLARIN's Turn Towards The Literary Text
CATMA/CLÉA: User and resource administration CLARIN's Turn Towards The Literary Text
Manage corpora & sourcedocuments, markupcollectionsand tag libraries CLARIN's Turn Towards The Literary Text
Annotatetextsorcorporausingpre-definedorready-made tags CLARIN's Turn Towards The Literary Text
Buildandexecutequeries on sourcetext & tags, oranycombinationthereof CLARIN's Turn Towards The Literary Text
Visualizeresults CLARIN's Turn Towards The Literary Text
What’s in it for CLARIN? • Import any text or corpus into CATMA/CLÉA • Run standard analytical procedures automatically or inter actively on upload (indexing, POS tagging etc.) • Annotate and analyse texts or corpora collaboratively • Share and export markup from the CATMA/CLÉA data base in multiple formats • CLÉA = Collaborative • Literature Éxploration and Annotation CLARIN's Turn Towards The Literary Text
Mille grazie to my CATMA/CLÉA development team • Evelyn Gius • Malte Meister • Marco Petris • Lena Schüch • and to our funders • University of Hamburg (2009) • Google DH Awards (2010-2013) • BMBF (2013-2016) CLARIN's Turn Towards The Literary Text
Tag definition each Tag has a type each Tag has a color each Tag can have additional user defined properties CLARIN's Turn Towards The Literary Text
Tag instance each Tag instance is of a type a Tag instance can have individual values for the user defined properties CLARIN's Turn Towards The Literary Text
Tag referencing • The content of a range is referenced by a pointer to an external entity. • The URI is based on the RFC 5147 for pointing to plain text. CLARIN's Turn Towards The Literary Text
Potential problems and possible solutions • referencing ranges based on character offsets are vulnerable to modifications of the content • possible solution: automated adjustments with checksums and context information, and • track versioning and revision history in the source document header • the encoding of the tags is machine readable but not interoperable out of the box • possible solution: defining the feature structure encoding of tags in terms of the open annotation framework CLARIN's Turn Towards The Literary Text