340 likes | 472 Views
Tokeniser. Francisco Miguel Pérez Romero University of Sevilla. Roadmap. Introduction Class Diagram Libraries Conclusions. Roadmap. Introduction Class Diagram Libraries Conclusions. Web Wrapping. Extractor. Information retrieval. Ontologiser. Verifier. FormFiller. Navigator.
E N D
Tokeniser Francisco Miguel Pérez Romero University of Sevilla
Roadmap Introduction Class Diagram Libraries Conclusions
Roadmap Introduction Class Diagram Libraries Conclusions
Web Wrapping Extractor Information retrieval Ontologiser Verifier FormFiller Navigator Query
Tokeniser • Tokenisation Rules • Configuration File • Web Page • Parser
TokeniserUsage • Web Page Classification • Information Extraction Learners • Information Extraction
Example Config File Token List Tokeniser XML File Token List Web Page
Concepts • Configuration File • Token • Tokenisation types
Roadmap Introduction Class Diagram Libraries Conclusions
Example • 3 TokenClasses: • Word • Space • Digit Space Digit
Roadmap Introduction Class Diagram Libraries Conclusions
ComparisonFeatures 1 • Comparison Features: • Javadoc documentation? • Support UNICODE UTF-8 • Support UNICODE UTF-16 • Named Groups • Indexable Groups > 9 • Negative Groups • Nested groups • Lazy qualifications?
ComparisonFeatures 2 • Comparison Features: • Fuzzy matching? • Support POSIX? • Support Ignore Case? • Support New Line Option? • Use State Machine? • Support accent?
Libraries • Tabla 1
Libraries • Tabla 2
Libraries • Tabla 3
Benchmark 1 • Regular Expression List • String List • Matching all one another • Time in ms
Benchmark 1: 10000 Iterations • org.apache: -> 7078 ms • com.stevesoft : -> 19782 ms • kmy.regex : -> 781 ms • java.util : -> 1266 ms • jregex.Pattern : -> 1000 ms • org.apache.oro : -> 2156 ms • dk.brics.automaton : -> 265 ms • com.karneim.util.collection : -> 407 ms
Benchmark 1: 20000 Iterations • org.apache: -> 11796 ms • com.stevesoft : -> 26641 ms • kmy.regex : -> 906 ms • java.util : -> 1891 ms • jregex.Pattern : -> 1422 ms • org.apache.oro : -> 3375 ms • dk.brics.automaton : -> 312 ms • com.karneim.util.collection : -> 610 ms
Benchmark 1: 50000 Iterations • org.apache: -> 28656 ms • com.stevesoft : -> 63297 ms • kmy.regex : -> 1781 ms • java.util : -> 4281 ms • jregex.Pattern : -> 3219 ms • org.apache.oro : -> 7641 ms • dk.brics.automaton : -> 531 ms • com.karneim.util.collection : -> 1312 ms
Benchmark 2 • Source Code • Matching tags
Benchmark 2: Amazon • org.apache : -> 218 ms • com.stevesoft : -> 63 ms • kmy.regex : ->94 ms • java.util : -> 0 ms • jregex.Pattern : -> 93 ms • org.apache.oro : -> 32 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 47 ms
Benchmark 2: Marca • org.apache : -> 62 ms • com.stevesoft : -> 47 ms • kmy.regex : ->93 ms • java.util : -> 0 ms • jregex.Pattern : -> 94 ms • org.apache.oro : -> 16 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 62 ms
Benchmark 2: Ebay • org.apache : -> 31 ms • com.stevesoft : -> 125 ms • kmy.regex : ->266 ms • java.util : -> 0 ms • jregex.Pattern : -> 156 ms • org.apache.oro : -> 47 ms • dk.brics.automaton : -> 0 ms • com.karneim.util.collection : -> 172 ms
Tosum up… • Dk.brics.automaton is the faster • Dk.brics and com.karneim fail with URL • Kmy.regex or java.util
Roadmap Introduction Class Diagram Libraries Conclusions
Conclusions • Tokenisation test • Searching information • A real project • Experience