1 / 59

Supervisor: Mr. Phan Trường Lâm

Capstone Project Documents Management. Supervisor: Mr. Phan Trường Lâm. Students: Vũ Nhật Linh Lê Quang Hoàn Nguyễn Duy Quyền Hoàng Nam Nguyễn Thế Anh. Team information. Agenda. Introduction. Project plan. System Requirement Specifications. System Analysis and Design. Testing.

aldan
Download Presentation

Supervisor: Mr. Phan Trường Lâm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capstone Project Documents Management • Supervisor: • Mr. PhanTrườngLâm Students: VũNhậtLinh LêQuangHoàn NguyễnDuyQuyền Hoàng Nam NguyễnThếAnh

  2. Team information

  3. Agenda Introduction Project plan System Requirement Specifications System Analysis and Design Testing Deployment and User Guide Summary Demo and Q&A

  4. Initial Idea Literature Review of Existing System Proposal & Product Introduction 1 2 3 4 5 6 7 8

  5. Initial Idea 1 2 3 4 5 6 7 8

  6. Initial Idea 1 2 3 4 5 6 7 8 We decide to develop a new system that integrated: • Collect documents • Organize these documents • Extract keyword • Ranking • Searching

  7. Literature Review of Existing System 1 2 3 4 5 6 7 8 • Methods that these websites use • to build their systems: • Big database • Search • Ranking and highlight return results • Compare documents to detect plagiarism

  8. Literature Review 1 2 3 4 5 6 7 8 • Achievements of the existing systems • Attractive • Easy to use • Speed & Reliability • Quality Results • Ensuring Security • Awareness • Limitations of the existing systems • Costs • Privacy

  9. Proposal 1 2 3 4 5 6 7 8 • Collect and manage Capstone projects • Support looking up Capstone projects • Avoid repeating and copying idea • Ranking results • Refer to other materials • Friendly interface like Google • Public for everyone • Inside and outside University • Chipper to build • Free to use

  10. Product 1 2 3 4 5 6 7 8 Mobile application (in future) Web application

  11. Project Plan 1 2 3 4 5 6 7 8 Development environment Process Project organization Project schedule Risk management

  12. Development Environment 1 2 3 4 5 6 7 8 HARD WARE 2 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz 1 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz SOFT WARE

  13. Process 1 2 3 4 5 6 7 8 • Follow Waterfall model

  14. Project organization 1 2 3 4 5 6 7 8

  15. Project organization 1 2 3 4 5 6 7 8 • Controlling and Monitoring • Meeting • Assign task • Tracking task • Issue resolve • Review task • Report

  16. Project organization 1 2 3 4 5 6 7 8 • Communication control • Online activity • Email • Chat • Phone • Offline activity • Kick-Off project • Team building

  17. Project Schedule 1 2 3 4 5 6 7 8 Overall plan

  18. Risk Management 1 2 3 4 5 6 7 8

  19. System Requirement Specifications 1 2 3 4 5 6 7 8 • User Requirements • System Requirements Non-functional requirements

  20. User Requirements 1 2 3 4 5 6 7 8 • Lecturers and Students: • Search project documents. • Download documents. • Librarians: • Edit profile. • Search documents. • Add/Edit/Delete document. • Add/Edit/Delete category. • Administrator • Edit profile. • Add/Edit/Delete account.

  21. User Requirements 1 2 3 4 5 6 7 8 • Other requirement • Searched results will be ranked. • Document has following information: • Name • Author • Supervisor • Category • Description

  22. User Requirements 1 2 3 4 5 6 7 8 • Input files: • Keyword file • Abstract file • Full document file • Other materials

  23. System Requirements 1 2 3 4 5 6 7 8 • Communicate via the protocol HTTP to complete interactions based on service with client computers and use standard protocols. • Configuration • Server: Windows Server 2008 operating system .NET framework 3.5 SQL server 2008 IIS 7 • Client: Web browser

  24. Usability Availability Reliability Security Performance Security Maintainability Non-functional Requirements 1 2 3 8 5 6 7 4 Non-functional Requirement

  25. System Analysis and Design 1 2 3 4 5 6 7 8 • Architectural design • Detail design • Database design • Coding convention • Extract Keyword algorithm • Ranking

  26. Architectural design 1 2 3 4 5 6 7 8 MVC architecture design pattern Overall architecture

  27. Detail design 1 2 3 4 5 6 7 8 CProDMS Component Diagram

  28. Database design 1 2 3 4 5 6 7 8 Entity diagram

  29. Coding convention 1 2 3 4 5 6 7 8 • Follow: • Microsoft .NET Library Standards • FxCop rules and Code Analysis for Managed Code Warnings

  30. Study Algorithm Introduction Evaluation Extract Keyword Algorithm 1 2 3 4 5 6 7 8 Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information (YUTAKA MATSUO and MITSURU ISHIZUKA) (Dec. 10, 2003)

  31. Meaning Algorithm – What is the keyword? 1 2 3 4 5 6 7 8 Keyword Frequency Position

  32. Algorithm – Step by step 1 2 3 4 5 6 7 8 Discard stop words Stem Extract frequency Preprocessing Calculate X’2 value Expected probability Select frequent term Processing Output

  33. Original Text Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Algorithm – Studying 1 8 7 2 5 4 3 6 Step2 Example: Step1 Stemmed Words Information powerful weapon modern society day overflowed huge amount data electronic newspaper articles emails web pages search results Often information receive incomplete such further search activities required enable correct interpretation usage information Informatpower weapon modern societi day overflow huge amoun data electronic newspaper articl email web page search result Often informat receive incomplet such further search activrequir enable correct interpret usaginformat Discarded Stop Words Information is the most powerful weapon in the modern society.Every day we are overflowed withahuge amount ofdata inform of electronic newspaper articles,emails, web pages and search results.Often, information we receive isincomplete,such thatfurther search activities are required to enable correct interpretation and usage of this information. Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount ofdata inform of electronic newspaper articles,emails, web pages and search results.Often, information we receive isincomplete,such thatfurther search activities are required to enable correct interpretation and usage of this information. Using Porter Stemming Algorithm

  34. Algorithm – Studying 1 2 3 4 5 6 7 8 Select frequent Term As study, number of keyword is about 10% number of term in document and no more than 30 terms. The top ten frequent terms (denoted as G) and the probability of occurrence, normalized so that the sum is to be 1.

  35. Algorithm – Studying 1 2 3 4 5 6 7 8 Co-occurrence and Importance Two terms in a sentence are considered to co-occur once. • Example: • The imitation game could then be played with the machine in question and the mimicking digitalcomputer and the interrogator would be unable to distinguish them. “imitation” and “digital computer” have one co-occurrence

  36. Algorithm – Studying 1 2 3 4 5 6 7 8 Co-occurrence and Importance

  37. Algorithm – Studying 1 2 3 4 5 6 7 8 Co-occurrence and Importance The degree of biases of co-occurrence can be used as a indicator of term importance

  38. Algorithm – Studying 1 2 3 4 5 6 7 8 The statistical value of χ2 is defined as pgUnconditional probability of a frequent term g ∈ G (the expected probability) nwThe total number of co-occurrence of term w and frequent terms G freq (w, g)Frequency of co-occurrence of term w and term g

  39. Algorithm – Studying 1 2 3 4 5 6 7 8 We consider the length of each sentence and revise our definitions pg (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) nwThe total number of terms in the sentences where w appears including w

  40. Algorithm – Studying 1 2 3 4 5 6 7 8

  41. Algorithm – Studying 1 2 3 4 5 6 7 8 the following function to measure robustness of bias values Subtracts the maximal term from the X2 value

  42. Algorithm – Studying 1 2 3 4 5 6 7 8

  43. Algorithm – Studying 1 2 3 4 5 6 7 8 • To improve extracted keyword, we will cluster terms • Two major approaches (Hofmann & Puzicha 1998) are: • Similarity-based clustering • If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. • Pairwise clustering • If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Eg: Monday is a day in week. Tuesday is a day in week. Wednesday is a day in week.

  44. Algorithm – Studying 1 2 3 4 5 6 7 8 Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Green Circles

  45. Algorithm – Studying 1 2 3 4 5 6 7 8 Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is Where: and:

  46. Algorithm – Studying 1 2 3 4 5 6 7 8 Pairwise clustering Cluster a pair of terms whose mutual information is Where:

  47. Algorithm – Evaluation 1 2 3 4 5 6 7 8 Precision: Ratio of right keyword to number of keyword Coverage: Ratio of indispensable keyword in list to all the indispensable terms Frequency index: average frequency of keyword in list

  48. Ranking – Why? 1 2 3 4 5 6 7 8 Ranking Result

  49. Ranking 1 2 3 4 5 6 7 8

  50. Ranking 1 2 3 4 5 6 7 8 Frequency of Term t in the given document Total number of documents that contain Term t Use rank calculate formula Term in a collection documents: ( Automatic Keyword Extraction for Database Search First examiner : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl Second examiner : Prof. Dr. Heribert Vollmer Supervisor : MSc. Dipl.-Inf. Elena Demidova) R(t) = Fd(t)*log(1 + N/N(t)) (1) Rank of Term t in document, which extracted by Extract Service reliability coefficient Ranking formula : Rank = d * Rd(t) / R(t) (2) => Rank = d * Rd(t) / (Fd(t)*log(1 + N/N(t))) (3) Rank of Term t in all the collection Total number of documents in the collection

More Related