120 likes | 237 Views
Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9k? Architecture Crawler Postprocessor Extractor Web Service Summary. What is Over9K about?.
E N D
Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung
Agenda • What is Over9k? • Architecture • Crawler • Postprocessor • Extractor • Web Service • Summary
What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.
Crawler • Web crawler: Nutch • Domains we crawl: • www.cnbc.com • www.reuters.com • www.marketwatch.com • … (6 total) • Nutch’sSuccesses • Nutch’s Failures
Postprocessor • Components: • NBClassifier • Classifies articles using Naives-Bayes • DateParser • Parses date using regular expressions • PageGetter • Retrieves training data from RSS feeds
IE • Tried several systems for IE • Gate • OpenCalais • CRF++
Comparison of IE tools • OpenCalais: • Web service. Easy to use. • Not extensible. No machine learning process. • Has usage quotas • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. • High precision for defined patterns, low recall if there are sentences of undefined patterns.
Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Need big set of training set. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge.
Web Service • Technologies used: • YUI Toolkit • PHP • Apache • CSS • Javascript • Layout description
Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Communication is key. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc. • Developing a model to predict volatility
Q&A Thanks!