290 likes | 599 Views
Using data mining methods to identify college freshmen who need special assistance in their academic performance. 指導教授:呂學毅 教授 學生:陳彥璋. background. 加文字敘述. background ( cont ). 加文字敘述. background ( cont ). Lower grade. background ( cont ). will affect the physical and mental health.
E N D
Using data mining methods to identify college freshmen who need special assistance in their academic performance. 指導教授:呂學毅 教授學生:陳彥璋
background 加文字敘述
background(cont) 加文字敘述
background(cont) • Lower grade
background(cont) • will affectthe physical and mental health. • will affect grade in future several semesters.
background(cont) For low academic achievement • Guidance in collage ※Taking Taiwan's national Yunlinuniversity of science and technology for example:From academic year 94building 「強化學生輔導新體制工作計畫」.Building the achievements warning policy for students' academic achievement.
motivation(cont) The general practice: Out of the test scores Final exam time List of need special assistance students This study: The freshmen entering Out of the test scores Final exam time List of need special assistance students
objective • The aim of this study is to construct a model with data mining toolsin predicting college freshmen of low academic achievement. • Findingstudents who need special assistance in their academic performance, and help students with improving their academicperformancethrough guidanceas earlier as possible.
The problems of college freshmen with lowacademic achievement.
Coding Data attributes of primary data
Feature selection Some attributes are noisy or redundant.Thisnoise makes it more difficult to discover meaningful patterns from the data. Dash,〈1997〉 Sequential Backward Selection: UsingShannon‘s Entropyas identification rule to find out attributes that have more explanatory capability. Sequential Backward Selection: T = Original Variable Set For k = 1 to M – 1 {/* Iteratively remove variables one at a time */ For every variable v in T {/* Determine which variable to be removed */ Tv= T – v Calculate ETvon D using eqn. 1} Let vkbe the variable that minimizes ETv T = T – vk/* Remove vkas the least important variable */ Output vk}
Feature selection(cont) Shannon's Entropy: N 為資料筆數。Sij為資料集中任兩資料間相似的程度,數值為 0 到 1 間的值。 表示資料中的與之距離。為實驗參數。
Data mining:K-Fold Cross-vaildation • K-Foldis mainly used in settings where the goal is prediction. To estimate how accurately a predictive model will perform in practice. • One round of cross-validation involves partitioning a sample of data into complementary subsets. • Performing the analysis on one subset (called the training set). • Vtheanalysis on the other subset (called the testing set). • To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
Data mining: C4.5 Decision Trees C4.5 is an extension of Quinlan's earlier ID3 algorithm〈1993 〉. The decision trees generated by C4.5 can be used for classification. internal node (attribute) branches leaf node (class)
Data mining: C4.5 Decision Trees(cont) • At each node of the tree,C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. • Its criterion is the normalized information gain (Gain ratio) that results from choosing an attribute for splitting the data. • The attribute with the highest normalized information gain is chosen to make the decision(class). • The C4.5 algorithm then recurses on the smaller sublists. • 其中為屬性測試前的資訊量,為測試後的資訊量。為資料屬於其中一個類別的比例,數值為 0 到 1 間的值。為節點所含的的資訊量。為屬性個數。
Data mining: Naïve Bayes Classifier A NaïveBayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (Naïve) independenceassumptions. 為x 屬性集的機率,為分類y 的機率,為分類 y下 x 屬性集的機率。 為後驗勝算比。 為實驗參數。
Data mining: MLP Artificial neural network A ANN model where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons or their connectivity. neurons weight y adder active function
Data mining: MLP Artificial neural network(cont) A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate output. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Input Layer Output Layer hidden layer
Data mining: MLP Artificial neural network(cont) Activation function: Learning through backpropagation: 其中為第n層的第j個單元的輸入值。 為誤差函數目的用來降低網路輸出值與目標輸出值間的差距。 為介於第n-1層的第i個處理單元,與第n層的第j個處理單元間的連結加權值。 η為學習速率,為控制每次以最陡坡降法最小化誤差函數的步幅。
model evaluation: Confusion Matrix Accuracy= Sensitivity (true positive rate) = Specificity = false positive rate =
Expected result • This study is expect to construct a forecasting model through collage freshmen’s data. • The forecasting model is using data mining methods to constructed that be select from three classifier (C4.5 decision trees, Naïve Bayes classifier, MLP artificial neural network). • Using the forecasting model can identify college freshmen who need special assistance in their academic performance. • And the collages can use model to help students with improving their academic performance through guidance as earlier as possible.