10 likes | 219 Views
The next generation of identification tools: interactive programs incorporating multivariate models. Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI.
E N D
The next generation of identification tools: interactive programs incorporating multivariate models Pavel B. KlimovBarry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI Context: The vast majority of interactive identification programs use a sequential approach to assign an unknown specimen to a known group. This algorithm works when the distinguishing characters do not have overlapping values. If the boundaries between taxa are overlapping, simultaneous (=probabilistic, matching) methods of identifications are more likely to lead to the correct assignment, but these methods usually require time-consuming measurements or experiments. We discuss how the sequential approach can be enhanced by multivariate statistics incorporated into this method. 1. INTRODUCTION Computer assisted interactive identification allows quick assignment of an unknown specimen to a known taxon with minimal costs in obtaining data and learning about the unknown. The number of characters used in the identification is substantially reduced compared to traditional taxonomic keys. For example, any of 128 taxa can be identified using only eight binary characters, or even fewer numeric or multistate characters. There are two major approaches to identification, sequential (=elimination, diagnostic) and simultaneous (=probabilistic, matching). In the sequential approach, only one character is used at each step of identification until the unknown specimen is assigned to a particular group. In the simultaneous approach, some or all characters are entered simultaneously, and the probability of group membership of the unknown specimen is calculated. The advantage of the sequential algorithm, particularly its multi-entry variant (=freedom to choose any character), is obvious when a taxon set is large and the taxa have distinct boundaries. At each step, taxa matching the unknown are retained and diagnostic characters for this subset are ordered according to their separating power. This algorithm has been implemented in a variety of interactive identification programs such as DELTA and Lucid that are widely used at present. In contrast, simultaneous methods usually require data obtained by time consuming measurements or experiments and are not that flexible in terms of the freedom of choosing characters, but are more likely to lead to the correct assignment if the boundaries between some or all taxa are overlapping. The situation when a data set is large and contains taxa that cannot be completely separated using qualitative or uni- or bivarite characters requires a combination of both methods of identification where each approach will handle the appropriate data. 2. MULTIVARIATE MODELS Multivariate statistics summarizes variation in many variables in many specimens in the form of a concise model that contains essential and comprehensive information about the groups and that has predictive power. We consider two multivariate techniques that are usually used to analyze intergroup differences: canonical variates analysis (CVA), and binomial logistic regression (LR). Both analyses handle metric and non-metric independent variables. A canonical variates function is a latent variable that is created as a linear combination of independent variables, CV = b1*x1 + b2*x2 + ... + bn*xn + c (1), where the b's are coefficients, the x's are independent variables, and c is a constant. If there are n groups, n-1 CV's are calculated. For assignment purposes, the estimated posterior probability of group membership is calculated, or, when multivariate normality of the independent variables is assumed, the value of CV can be equivalently used. Logistic regression models can be expressed as the following equation, P(0) = exp(b1*x1 + b2*x2 + ... + bn*xn + c)/(1+exp(b1*x1 + b2*x2 + ... + bn*xn + c)) (2), where P(0) the probability of an unknown specimen being taxon 0, other notations are the same as for CVA above. If P(0) exceeds 0.5, then the unknown belongs to taxon 0, otherwise to taxon 1. A great advantage of LR over CVA is that it is a direct posterior probabilities estimator, it calculates the class posterior probabilities without ever estimating the classes' individual density functions, which requires additional data (group means, prior probabilities, and the value of mean square within groups). 3. INCORPORATING THE MODELS IN THE SEQUENTIAL ALGORITHM Both (1) and (2) can be used in any sequential identification program, as a single character “Model classifies the unknown specimen to” with the character states “group 1, group2,…group n”. The user, however, should be asked simply to enter measurements or observations, x1, x2, …, xn, then the Bayesian probabilities associated with being in either group are calculated, and the greater of these probabilities is used to classify the specimen. • Implementation of the new data type will require some adjustment in the internal logic of an identification program. In the general case, there are some characters in the identification matrix that can separate a subset of taxa without using methods of multivariate models. These characters, whether they are binary, multistate, or variable, should be given more weight compared to the complex character generated by a multivariate model. The latter also should be coded only for the subset of taxa included in the model, and this character for the other taxa should be coded as "missing". Because a multivariate model may contain characters that are used elsewhere in the identification matrix, these matching characters should be cross-referenced. • Results • The most optimal way of identification when a data matrix contain both both discrete and overlapping groups is to use combined sequential and probabilistic strategies for appropriate data. • Canonical variates and logistic regression models can be used in the context of the sequential approach to calculate posterior probabilities and to classify the unknown specimen. http://insects.ummz.lsa.umich.edu/beemites/Morphometrics.html Research supported by NSF DEB-0118766 (PEET) and the USDA (CSREES #2002-35302-12654).