1 / 18

Categorization of DPS Dissertations Using Machine Learning Techniques

This project aims to categorize Pace DPS Dissertations with machine learning. The extended work involves full dissertations and surveys to understand needs of FT working professionals in education. IRB ensures ethical research practices are followed. OPAIR oversees data analysis for planning needs. Steps including completing certifications, application, and proposal review need to obtain IRB approval. CITI Program provides training for ethical research conduct. Explore the process at Pace University for a comprehensive understanding.

jfarish
Download Presentation

Categorization of DPS Dissertations Using Machine Learning Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEAM 2 EMERGING INFORMATION TECHNOLOGIES I TEAM MEMBERS: Lisa Ellrodt, Tonya Fields, Ion Freeman, Ashley Haigler and Suzanna Schmeelk (Pace University, USA) COURSE: DCS-861A PROFESSORS: Dr. Charles Tappert and Dr. Tilak Agerwala DATE:February 23, 2019

  2. Categorization of DPS Dissertations Original Goal • To employ machine learning techniques to categorize Pace DPS Dissertations based on the abstracts Extended work • To employ new techniques on the full dissertations • To attempt to contribute insights to the Education community at large regarding the needs of th FT Working professional population • Design a survey • distribute via a survey tool; in our case Qualtrics • analyze collected data

  3. IRB Institutional Review Board • Ensures human subjects are protected during a research project • Rights • Welfare • Privacy • It is made up of a committee • Research is ethical • responsible • does not pose undue risk to participants • in compliance with Federal Guidelines

  4. OPAIR Office of Planning, Assessment and Institutional Research “The mission of the Office of Planning, Assessment and Institutional Research is to facilitate assessment, planning and decision-making to support a culture of continuous improvement.” - Pace University https://www.pace.edu/administration/strategic-initiatives/opair

  5. OPAIR • Ensures effectiveness of the collection and analysis of data by departments to meet planning needs • Coordinates institutional assessments, surveys and research • Satisfies reporting requirements for accreditation • They ensure that the data gathered reflects the needs of the department

  6. Survey • Sent to DPS 2020 end of Fall 2018 • Sending to full DPS Cohort Spring 2019 • To help the field to understand the dissertation/education needs of full time working professionals • Research interests • Time constraints • Job constraints • Educational motivation factors

  7. (IRB Process - where, when, how, why) IRB- Institutional Review Board (IRB) • Under FDA regulations, IRB has been formally designated to review and monitor biomedical research involving human subjects • IRB has the authority to approve, require modifications in (to secure approval), or disapprove research • This group review serves an important role in the protection of the rights and welfare of human research subjects • The purpose of IRB review is to assure, both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of humans participating as subjects in the research • Pace’s IRB Website Link: https://www.pace.edu/office-of-research/research-protections-IRB-IACUC

  8. IRB Process - It’s tough to be legit • IRBs use a group process to review research protocols and related materials (e.g., informed consent documents and investigator brochures) to ensure protection of the rights and welfare of human subjects of research. • You may not begin your research until the IRB has given your research protocol full unconditional approval. • Review of Exempt or Expedited protocols takes about two to three weeks. • The review process for protocols submitted for Full Review can take up to a month or longer to complete. • If your project is considered research under IRB rules, you must submit an application to the IRB office and receive approval before research can begin.

  9. Steps to get your IRB Approval • Determine if your project requires IRB approval • Complete the Mandatory Online Certification for Researchers • Mandatory online certification is required for all researchers, investigators, and faculty advisors (if applicable for student conducted research) who submit a proposal to the IRB • The NIH certification must be renewed every 3 years. • There is a new fee that the OHRP requires to take their certification • Complete the IRB Research Project Application • Prepare the Informed Consent Document(s) • If the research study has human subjects under the age of 18 as participants, additional informed consent forms are required • The document describes, briefly and simply, what the research is about. Give detail about your research • Submit Proposal Form • Review the previous steps and ensure that each step has been completed. Proposal will go through this process • Step of process approval: https://36d5l8225ig13rrnnc3w4af9-wpengine.netdna-ssl.com/wp-content/uploads/sites/38/2014/04/Adapted_IRB_Process.pdf • Make adjustments as necessitated by IRB Review until approved • May take time as they meet only on a monthly bases, so allocate for this time • Report Changes and Annually Renew Authorization (if needed) • Submit a Close-Out Form

  10. Certification Process CITI Program- Collaborative Institutional Training Initiative • Training in the ethical conduct of research with human participants • Ensure the rights, welfare and safety of participants are protected • Informed consent requirements • Reporting requirements • Maintenance and retention of records (keep complete files during and 5 years after research ends) • All faculty, students, and staff proposing to use human participants in research are required to complete the IRB human participants training. • Approvals for including human participants in proposed research projects will not be granted until this training has been completed and verified by IRB staff. • Can take the required training by accessing the human participant training online • Initial CITI training can take up to 8 or more hours to complete but you will have the option to save your progress and complete it at a later date to avoid fatigue • CITI completion reports are valid for three years and then must be renewed. • (www.citiprogram.org)

  11. WORK TO DATE – Publishing Teamwork Research • Fall 2017 paper was published in Pace Research Day • Spring 2018, after Pace Research Day we improved the thrust of the paper; it was accepted into: • IEEE-FIE 2018 (48th Annual IEEE-Frontiers in Education) San Jose, CA • Summer 2018 our paper included the works from IBM and Weka • Fall 2018: It was accepted into IEEE-ICMLA 2018 (17th IEEE International Conference on Machine Learning and Applications) • Several improvements were completed

  12. PAPER TOPIC: BACKGROUND • Fall 2017, characterize DPS dissertations produced in program • Worked to hand classify the papers found standardization was challenging • Resolved by using machine learning to cluster the dissertation abstracts • Original approach: TF-IDF (Term Frequency - Inverse Document Frequency) • Spring 2018, we tried 4-5 different algorithms in Weka and IBM BlueMix to see differences • Fall 2018, we cleaned the data described in the following pages

  13. COMPLETED FALL 2018 • Focused on data preparation which is usually the most tedious and important step in data analytics • Used Various Online Sources to Convert PDF Files to Text • Manually Processed 110 of 112 Files removed weird symbols that was introduced during conversion • Used a NLTK Library (PYTHON) for Data preparation and Cleansing • Tokenization – separate each word in the text • Case folding – reduce all letters to lowercase • Lemmatization – reduces the common base form of words but keeps the context.(caring-care) • Stemming - similar to lemmatization but crude heuristic process that chops off the ends of words (caring - car), • Eliminate stop words - such as the, a, of, and • Eliminate domain-specific stop words - not useful for the study – in this case “study”, “dissertation”, etc.

  14. (What we did last semester) • Cleansed Data & Used for Data Mining Abstracts Only(438K) and Full Dissertations (25 MB) • Applied TF-IDF Processing • Clustering Algorithm (K-MEANS) Resulted in Six Clusters • Cluster #5: Problem|Project|Development|Member|Agile|Evaluation|Communication|Solution|solve • Word Count program in Spark for insights into the data • XML (128|2577) CLOUD (76|2,438) • N-Grams 2 | N-Grams 3 (Abstracts Only)

  15. NEAR FUTURE PLANS DETAILS (Which Could be Dissertation) • First, refine the methodology on the dissertation database. • Apply method to other databases to show that it generalizes • Categorizes the variety research papers • Example, apply to 5-10 years of articles from several journals

  16. Spring 2019 Plans • Get running on AWS and/or Colab • Evaluate the scope of updating the code • Re-do study on full dissertations • Distribute Fall 2017 survey to full DPS cohort; Analyze results • Match categorization of dissertation to author and survey author • Requires a new IRB protocol (or update to old study) since we need to now survey the author. • Since there are over 100 dissertation • Each group member will be responsible for reaching out to approximately 20-30 former students • We are hoping this to be the culmination of our 3 years of research next year+

  17. Potential Issues • Size of data of full text dissertations can constrain resources • Debugging is timely since there are huge amounts of dissertation data being parsed • Number of respondents to survey • Some email addresses we have are old • 20-30 people to survey per person on the team • Possible research venues • Journal article(s) • Machine Learning • Computing Education • Conference article

  18. Good news! We’ve started already! • Team: • Lisa Ellrodt • Tonya Fields • Ion Freeman • Ashley Haigler • Suzanna Schmeelk

More Related