160 likes | 286 Views
Machine Learning Applications in Biological Classification of River Water Quality. Saso Dzeroski, Jasna Grobovic and William J. Walley 98419-548 조 동 연. Contents. Introduction Learning Rules for Biological Classification of British Rivers The Data The Experiment
E N D
Machine Learning Applications in Biological Classification of River Water Quality Saso Dzeroski, Jasna Grobovic and William J. Walley 98419-548 조 동 연
Contents • Introduction • Learning Rules for Biological Classification of British Rivers • The Data • The Experiment • Analysis of Data about Slovenian Rivers • The Influence of Physical and Chemical Parameters on Selected Organisms • Biological Classification • Discussion
Introduction • Indicator Organisms (Bioindicators) • Given a biological sample, information on the presence and density of all indicator organisms present in the sample is usually combined to derive a biological index that reflects the quality of the water as the site where the sample was taken • Saprobic Index • The main Problem: subjectivity • The subjectivity introduced at intermediate levels can and should be minimized.
Learning Rules for Biological Classification of British River • Data • 292samples of 80 benthic macroinvertebrates • Abundance of animals • 0: no members of the particular family • 1: 1-2 • 2: 3-9 • 3: 10-49 • 4: 50-99 • 5: 100-999 • 6: more than 1000 • Sparse matrix • Five classes
Experiments 1 • Modified CN2 algorithm • Measure the relative information score • Use the m-estimate instead of the Laplace estimate • The rules were required to be highly significant (99%). • 15 difference values of m were tried (0, 0.01, 0.25., 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024). • Criterion • Information Score • Accuracy • Smaller value of the parameter m
Result 1 • 12 rules, m = 32 • 83% accuracy on the training set, 75% information content • Each rule covered 25 examples and contained 5 conditions. • The expert’s conclusions confirmed the rules.
Experiment 2 • The main criticism was that the rules use only a small number of taxa, whereas the expert takes into account the whole community. • Six additional attributes • MoreThan0, MoreThan1, …, MoreThan5 • reflect the number of families • Result 2 • 13 rules, m = 64 • accuracy 84%, information content 80%
Experiment 3 • 195 training example, 97 test example • Obvious performance improvement from the original to the extended problem.
Analysis of Data about Slovenian Rivers • Data • 4 years (1990 - 1993) • Biological samples are taken twice a year (summer, winter). • Physical and chemical analyses are performed several times a year for each sampling site. • 698 water examples • training (70% - 489 cases), test (30% - 209 cases)
The Influence of Physical and Chemical Parameters on Selected Organisms • From an ecological and water quality of view, these are important research topic. • Binary Classification: Present / Absent • Attributes • Plants: Hardness, NO2, NO3, NH4, PO4, SiO2, Fe, Detergents, COD, BOD • Animals: Temperature, PH, O2, Saturation, COD, BOD
Result • Accuracy: 66% - 85% • Information score: 23% - 50% • 10 - 20 rules for each taxa • The average rule length was less than 5 conditions. • Average rule coverage was 15 to 45 examples.
Nitzschia palea Elmis sp.
Biological Classification • 13 physical and chemical parameters • 27 bioindicators • 7 classes • The majority class comprises 339 of the 698 examples, thus the default accuracy is 48.6%.
Discussion • We have described several applications of rule induction in the domain of biological water quality classification. • The produced rules are transparent and can be easily understood by experts. • The induced rule contained valuable knowledge about the domain studied. • Machine learning techniques can be useful tools for classification and data analysis in the domain of river water quality and other ecological domains.