1 / 10

Textklassifikation

Textklassifikation. Der Scirus-Classifier. Überblick. Komplexes Programm: Porno-Filter Extraktion von Namen Klassifikation aufgrund von Text Klassifikation nach URL/Title Feste Klassifikation aufgrund einer URL-Liste Extraktion von Titel/Autor/Abstract etc bei Artikeln

aelan
Download Presentation

Textklassifikation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Textklassifikation Der Scirus-Classifier

  2. Überblick • Komplexes Programm: • Porno-Filter • Extraktion von Namen • Klassifikation aufgrund von Text • Klassifikation nach URL/Title • Feste Klassifikation aufgrund einer URL-Liste • Extraktion von Titel/Autor/Abstract etc bei Artikeln • Ausgabe von Refinement-Termen • Hier nur von Interesse: Klassifikation aufgrund des textuellen Inhalts

  3. Textklassifikation • Lexikonbasiert: • Phrasen oder Wörter • Erhalten Gewicht für jede Kategorie • Starke Indikatoren • Klassifikation durch Berechnung eines Scores: • Für jedes Vorkommen wird für jede Kategorie ein Zähler hochgesetzt • Normalisierung nach Dokumentlänge • Schwellenwert

  4. Konfiguration

  5. Konfigurations-Datein //Number of words to process for subject identification NWDS=2000000 MINWORDS=100 THRESHOLD=1 SUBJ=gen all 0 0 SUBJ=chem all 1 0 SUBJ=comp all 2 0 SUBJ=eng all 3 0 SUBJ=env all 4 0 SUBJ=geo all 5 0 SUBJ=astro all 6 0 SUBJ=life all 7 0 SUBJ=math all 8 0 SUBJ=mat all 9 0 SUBJ=med all 10 0 ….

  6. Aufruf CIS Subject Identifier and Content Extractor Version 5.0 USAGE: classifier [-h[elp]] [-os|l[A]] [-it|f|h] [-s[ilent]] [-c CONFIG_FILE] [-nout] [-uat] [-URL<filename>] [-smd<number>] [-ps] [-t FILES_TO_IDENTIFY] -h: print help -c CONFIG_FILE: Name of the configuration file. Default is ././config.txt -os|l[A]: Output format -os: Short: only print well identified subjects(default) -ol: Long: print all subjects -ot: Topics only are output; one line Format: filename:WORDCOUNT#GENERALSCIENCESCORE#TOPICSWITHSCORE ´ -oA: Store and print all phrases for a topic ´ -oT: Print all phrases found in the dictionary ´ (Used for dictionary testing only) -T[t][i][o]: Tasks to carry out and to output (default: all are set) t: Topic identification i: Information from content extractor o: Offensive content filter -it|h|f: Input format -it: Plain text -ih: HTML-file -if: HTML-file preceded by header -nINTEGER :Minumum number of words in a document -MINTEGER :Maximum number of words to be processed in a document tokenizer stops after INTEGER words Documents with less words will get tag 'not_enough_data' -mINTEGER :Minimum score for accepted documents -rINTEGER : maximum relative count for phrase form/thousand In thousand phrases one phrase form will only be counted INTEGER times. -NINTEGER :Maximum number of phrases to output in results for topics -t FILES_TO_IDENTIFY List of files for which subject should be identified. Default: stdin. -D[r] D1|D2[:F1|F2[:FB1|FB2]]: process all files in directory and recurse Dr: descend recursively into subdirectories D1: name of directory to list or recurse F1... : filename patterns (my contain *) FB1: Patterns for forbidden directories (not recursed) -s: print only some important messages, not all. -nout: Turn off URL/Title classifier. -uat: Use all titles for classification (not just those enclosed in <head>). -URL<filename>: Filename of the URL list (format: <file><tab><url><newline>). -smd<number>: Maximum number of words for small documents (default see config file). -ps: Print title and url scores -xml: Print XML output

  7. Ablauf • Einlesen des Textes bis zur spez. Anzahl von Wörtern • Abgleich mit dem Lexikon • Berechnen des Scores • Ausgabe des Ergebnisses in Abhängigkeit vom Schwellenwert

  8. Scoring Formel • Sei: • d Dokument, • c Kategorie, • t Term, • l(t) = Länge von t, • wn(t) = Wortanzahl in t, • q(t,c) Gewicht von t für c und • s(t,c) starker Indikator t für c • T(c) Klassifikations-Schwellenwert für c • W = min(Wörter im Dokument, max proz. Wörter) • Score(d,c) = ∑td (l(t)/2 + (wn(t) -1) x 2) x q(t,c))/W • Si-score(d,c) = ∑td s(tc) • d wird als c klassifiziert gdw. Si-score(d,c) > 1 && score(d,c) > T(c)

  9. Klassifikations-Lexikon • Format: TERM.INFO1/INFO2/... • INFO: TOPICS#FREQUENCY#QUALITY#LENGTH#TYPE#ALONE#OUTPUT • TOPICS: MAIN:SUB • FREQUENCY: 1 (not used) • QUALITY: 0...9 • LENGTH (number of words) • TYPE: 0..3 • 0: genuine topic-subtopic indicator • 1: only to distinguish between subtopics, not indicating topic itself • 2: as 0, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0 • 3: as 1, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0 • ALONE: 0/1 : strong indicator • OUTPUT: Ø,$, PHRASE

  10. Klassifikations-Lexikon • Beispiel • a vinculo matrimonii.18:0#1#0#3#0#0#$ • a-37 aircraft.14:0#1#1#3#0#1#a 37 aircraft • a-address register.2:0#1#1#3#0#1#a address register • a-bomb survivors.7:0#1#8#3#0#1#a bomb survivors • a-c substitutions.15:0#1#8#3#0#1#a c substitutions/7:0#1#8#3#0#1#a c substitutions • a-calcium-calmodulin kinase.11:0#1#8#4#0#1#a calcium-calmodulin kinase • a-chromanoxyl radical.7:0#1#8#3#0#1#a chromanoxyl radical • a-crystallin gene.15:0#1#8#3#0#1#a crystallin gene/7:0#1#8#3#0#1#a crystallin gene • a-d conversion.3:0#1#1#3#0#1#a d conversion • a-d converter.13:0#1#1#3#0#1#a d converter/3:0#1#1#3#0#1#a d converter/9:0#1#1#3#0#1#a d converter • a-deficient mice.11:0#1#7#3#0#1#a deficient mice/15:0#1#8#3#0#1#a deficient mice • a-delta activity.11:0#1#8#3#0#1#a delta activity

More Related