170 likes | 187 Views
CIS392 Text Processing, Retrieval, and Mining. Overview of Semester Projects. Forming project groups. You can work independently. A group can have up to 3 people.
E N D
CIS392 Text Processing, Retrieval, and Mining Overview of Semester Projects Semester Projects
Forming project groups • You can work independently. • A group can have up to 3 people. • Post names of your group members in the webboard “semester project” conference by March/24 midnight and indicate the type of project (programming project or case analysis) that your group chooses. • If you don’t have a group and would like to be assigned to one, e-mail Dr. Wu before March/23. Semester Projects
Weekly progress report • After every group posted their members, I will open more conferences and new topics on webboard for groups to report their weekly progress. • All weekly progress reports due on Monday before class (3/31, 4/07, 4/14, 4/21) until project presentation day. Semester Projects
Programming Projects • Choose one of the following: • A text retrieval system • A question-answering system • An information extraction system Semester Projects
Text retrieval system • Your tasks: • Develop the system • Provide system documentation: using flow charts to design system components. • System functions: • Automatic indexing (stopped word removal, stemming, high and low frequency word removal) • Full text Boolean searching • Documents: LISA http://web.njit.edu/~wu/lisa/text.zip. Semester Projects
Question-answering system • Your tasks: • Collect your own documents, at least 300 docs in one specific domain, e.g.: user opinions on certain products, etc. • Analyze the domain and define keywords (you need to know enough about the domain so you know what keywords are) • Design a question-answering system • Provide system documentation: use flow charts to design system components and list keywords you selected. Semester Projects
Question-answering system (cont) • System functions: • Indexes selected keywords • Finds answers for domain specific questions using co-occurrence of terms. • Provides answers in sentence level, i.e.: answers should be sentences, not complete documents (the latter should be available through a link). • Provides alternative info when no answer is found. Semester Projects
Information extraction system • Your tasks: • Collect your own documents, at least 300 docs in one specific domain, e.g.: user opinions on certain products, etc. • Analyze the domain and define entities, events, and templates. (you need to know enough about the domain so you know what keywords are.) Semester Projects
Information extraction system (cont) • A list of questions (at least 10 that need answers from the unstructured part of the documents) that can be answered by your system, e.g.: What are problems mentioned in the reviews? • (Please use knowledge you learned from the epinions.com exercise that we did in the class.) Semester Projects
Information extraction system (cont) • Develop an info extraction system that can process the docs you collected and answer questions that you pre-specified. • Provide system documentation: 1. use flow charts to design system components, 2. list keywords, events, templates and questions/analysis you defined, and 3. a list of the system outputs. Semester Projects
Case Analysis Project • Suppose you are IT consultants and you provide solutions to business problems. • Suppose you now have a client wanting to analyze certain aspects of the business, e.g.: what customers think about their and competitors’ product(s). • Provide a report detailing what you find (including software tools you used and results gathered using each tool). Semester Projects
Case Analysis Project (cont) • Your tasks: • Come up with a fictional client and define the business domain, products and problems. • Conduct search to find sources of documents Semester Projects
Case Analysis Project (cont) • Collect your own documents, at least 300 docs in one specific domain, e.g.: user opinions on certain products, etc. Use only web spiders and crawlers (Spiders R Us http://ai.bpa.arizona.edu/spidersrus/index.html, Teleport Pro, etc) to collect docs. • Analyze the domain and define keywords (you need to know enough about the domain so you know what keywords are) Semester Projects
Case Analysis Project (cont) • Use text analysis and mining tools such as TextAnalyst (by www.megaputer.com), Nenet (http://koti.mbnet.fi/~phodju/nenet/Nenet/General.html) and others to analyze documents. • Be creative! Search the web and find as many tools as you may need. Semester Projects
Case Analysis Project (cont) • Prepare a report including one page execute summary, and the full report: • Describe the business environment (who are the players? and what products are available in the market?) • Define client’s problem, e.g.: what customers think about their and competitors’ product(s) • Please indicate, as a consultant, how you would proceed to find answers to this problem. Semester Projects
Case Analysis Project (cont) • Your research on sources of relevant useful documents (at least 2 sources). Describe each source (who owns it? how reliable is it?) and how many documents found. • Your analysis of documents, including noticeable items and keywords. Semester Projects
Case Analysis Project (cont) • Your research on text tools (its developer, usage, price, where to obtain). • Using text tools to analyze documents • Your analysis of the business problem, using the results from text analysis tools. Semester Projects