190 likes | 360 Views
IS698 – Web Mining. Min Song, Ph.D. Course Web Page http://web.njit.edu/~song/courses/web_mining/is698_webmining_syllabus.html and Moodle. Course structure. The course has two parts: Lectures - Introduction to the main topics One projects (done either individually or group)
E N D
IS698 – Web Mining Min Song, Ph.D. Course Web Page http://web.njit.edu/~song/courses/web_mining/is698_webmining_syllabus.html and Moodle
Course structure • The course has two parts: • Lectures - Introduction to the main topics • One projects (done either individually or group) • 1 research project. • Lecture slides will be made available on the course web page and on Moodle.
Grading • Class Participation: 10% • Assignments: 20% • Midterm: 25% • Projects: 45%
Prerequisites • Knowledge/Experience of • Java programming
Teaching materials • Required Text • Web Data Mining: Exploring Hyperlinks, Contents and Usage data. By Bing Liu, Springer, ISBN 3-450-37881-2. • References: • Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann, ISBN 1-55860-489-8. • Principles of Data Mining, by David Hand, Heikki Mannila, Padhraic Smyth, The MIT Press, ISBN 0-262-08290-X. • Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley, ISBN 0-321-32136-7. • Machine Learning, by Tom M. Mitchell, McGraw-Hill, ISBN 0-07-042807-7
Topics • Introduction • Data pre-processing • Association rules and sequential patterns • Classification (supervised learning) • Clustering (unsupervised learning) • Post-processing of data mining results • Question Answering • Full-Text mining • Partially (semi-) supervised learning • Opinion mining and summarization • Link analysis
Feedback and suggestions • Your feedback and suggestions are most welcome! • I need it to adapt the course to your needs. • Let me know if you find any errors in the textbook. • Share your questions and concerns with the class – very likely others may have the same. • No pain no gain • The more you put in, the more you get • Your grades are proportional to your efforts.
Rules and Policies • Statute of limitations: No grading questions or complaints, no matter how justified, will be listened to one week after the item in question has been returned. • Cheating: Cheating will not be tolerated. All work you submitted must be entirely your own. Any suspicious similarities between students' work will be recorded and brought to the attention of the Dean. The MINIMUM penalty for any student found cheating will be to receive a 0 for the item in question, and dropping your final course grade one letter. The MAXIMUM penalty will be expulsion from the University. • Late assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Web mining: Examples • Link analysis • How does Google work? • How to find communities on the Web? • Structured data extraction • Web information integration
Example: Web data extraction Data region1 A data record A data record Data region2
Resources • ACM SIGKDD • Data mining related conferences • Data mining: KDD, ICDM, SDM, … • Databases: SIGMOD, VLDB, ICDE, … • AI: AAAI, IJCAI, ICML, ACL, … • Web: WWW, … • Information retrieval: SIGIR, CIKM, … • Kdnuggets: http://www.kdnuggets.com/ • News and resources. You can sign-up! • Our text and reference books
What is web mining? • The process of discovering knowledge from web page content, hyperlink structure, and usage data • Builds on existing data and text mining techniques, but adds many new tasks and algorithms • Three types, based on sources of data (often combined in practice): • Web structure mining • Web content mining • Web usage mining
Importance of web data mining • The web is unique! • Amount of information is huge and still growing, on almost any topic, and changes continuously • No single editorial control: significant variations in quality, much duplication, and data formats vary widely • Significant information is linked (within and between web sites) • Web reflects a virtual society ---interactions among people, organizations, and automated systems, no longer limited by geography • The Web presents challenges and opportunities for mining
How to make best use of data? • Knowledge discovered from web data can be used for competitive advantage. • Online retailers (e.g., amazon.com) are largely driven by data mining. • Web search engines are based on information retrieval (text mining) and data mining, and NLP. • Web surfers/searchers need tools to find, recommend, organize, and extract useful information from the Web
Semester Research Project • Individual, or groups of two (will grade each other) • Plus formal and informal feedback from instructor • Should be the beginning of what could be a publishable project. • On some aspect of web mining • Topic will be given by instructor or proposed by student and approved by instructor • Students present • Ideas early in the semester for feedback • Completed project at the end of the semester • Write a scientific paper at the end. • Publish as a technical report if not more (some have been published at AMIS and under review)
Project: Biomedical Fulltext Mining • Input data for Web Mining (particularly web content mining) consists of document surrogates, short web pages, email messages, etc. • Fulltext data (books and online articles) has become publically available. • Currently fulltext mining is not well studied. • Study fulltext mining in the context of Biomedical research problems.
Required Software • Java (jdk1.6.0 or above) • Tomcat 6 • Apache-ant-1.7.1 • Eclipse 3.4 • BioFulltextMiner.zip (http://base.njit.edu/vline/BioFullTextMiner.zip)