NLCD2001 – C5 and Cubist Training

NLCD2001 – C5 and Cubist Training Mike Coan (coan@usgs.gov) Limin Yang, Chengquan Huang, Bruce Wylie, Collin Homer Land Cover Strategies Team EROS Data Center, USGS June 2003

Overview • Classification tree – C5/See5 • General description of the algorithm • C5 how-to • An example • Regression tree – Cubist • General description of the algorithm • Cubist how-to • An example

C5/See5 – What is it? • “…a system that extracts informative patterns from data.” • C5:UNIX version / See5:Windows version • Predicts categorical variables (ie: land cover) • www.rulequest.com - vendor and tutorial

C5 for Land Cover Classification –Why this method? Compared to other methods such as the maximum likelihood classifier, neural networks etc., the classification tree method- 1) is non-parametric and therefore independent of the distribution of class signature, 2) can handle both continuous and nominal data, 3) generates interpretable classification rules, and 4) is fast to train and is often as accurate as or even slightly more accurate than many other classifiers.

What does a decision tree model look like? D-tree output syntax elev <= 1622: :...asp <= 2: 81 (62/1) : asp > 2: : :...asp <= 9: 41 (12/1) : asp > 9: 81 (15) elev > 1622: :...slp > 10: 41 (34) slp <= 10: :...pidx > 64: 41 (15) pidx <= 64: :...slp <= 1: 81 (37/13) slp > 1: :...elev > 1885: 41 (42/3) elev <= 1885: :...asp <= 12: :...slp <= 9: 41 (75/24) : slp > 9: 81 (2) Comparable psuedo-code syntax If elev <= 1622 if asp <= 2 then landcover = 81 otherwise if asp <= 9 then landcover = 41 otherwise landcover = 81 Otherwise If elev > 1622 if slp > 10 then landcover = 41 otherwise if pidx > 64 landcover = 41 otherwise if slp <= 1 landcover = 81 (and so on…)

General description of the algorithm y x

Major Steps in Developing a Spatial Classification (map) using C5 • Collect training points • Develop a classification tree model (aka decision tree, or d-tree) • Apply the model spatially to create a map

Extract Coordinates from the Training Point File • In an ERDAS IMAGINE image viewer, either create new training data, or load existing point data as an ESRI ARC coverage or shapefile • From the viewer, select Vector>Attributes… to open the attribute table • Without selecting any rows, highlight the two columns containing x and y coordinates • With the cursor at the title row of the highlighted columns, right click the mouse to activate a list of options, select export • Specify the output file (*.dat), making sure it goes to the desired directory • Follow the same steps to output the land cover label column (if it exists) to a text file

Viewer>Vector Attributes…>(select POINT_X and POINT_Y column headers)

Right click on column headers, select Column Options > Export… > (Specify path of output)

Extract the Spectral Values of the Training Points: Utilities > Pixel to ASCII…

Pixel To Table: Fill in the details 1) Specify each input, “Add”, see additions to “Files to export” 2) Type of Criteria, choose “Point File” 3) Specify x,y locations (*.dat) to extract 4) Direct Output File (*.asc) to desired directory 5) “OK”

Create the *.data file • Copy the Pixel to ASCII (*.dat) results – the header information will be important! • Edit the copy, and delete the first few lines defining the input file and bands • Load this file and the text file containing land cover labels in Excel, copy/paste the land cover label as the last column in the file containing the spectral values • Save the new file in coma separated format (csv). Rename the *.csv file to *.data (the required extension for c5).

Creating a *.names file by hand class. | to be predicted x: ignore. y: ignore. gndvi1: continuous. gndvi2: continuous. gndvi3: continuous. moist1: continuous. moist2: continuous. moist3: continuous. loff1: continuous. loff2: continuous. loff3: continuous. loff4: continuous. loff5: continuous. loff7: continuous. lon1: continuous. lon2: continuous. lon3: continuous. lon4: continuous. lon5: continuous. lon7: continuous. spr1: continuous. spr2: continuous. spr3: continuous. spr4: continuous. spr5: continuous. spr7: continuous. aspect: 0,1,2,3,4,5,6,7,8,9,10, 11,12,13,14,15,16,17. class: 11,23,41,42,91,92. This page is included for UNIX use and historical interest – don’t expect to use a handbuilt *.names file with the CART Tool (it crashes…) The first line defines the variable name to be classified, which also appears on the last line with the values to be assigned. The order of input variables listed in this *.names file must correspond to the order of the data in the *.data file. Every line ends with “.”, and comments can be inserted after the “.” with “|”. Syntax is “variable: datatype.” Use brief but descriptive variable names. Use “ignore.” to exclude certain input layers as desired, such as the coordinate x and y values. Data can be discrete (integer values, which do not have ranking) or continuous (integer or floating point, with ranking). For example: the topographical derivative layer “aspect” with values 0-17 is a discrete layer. List these aspect values individually.

ERDAS Module-Classification And Regression Tree (CART) New ERDAS Imagine module, only for Windows platform, to implement See5 and Cubist tools CART Utilities > CART Sampling Tool… Can create both the *.data and *.names files, but still will require some knowledgeable editing…

CART Module, ERDAS 8.6, Windows XP

CART Sampling Tool for See5 A raster image of training points is being used! Select all your input variables, fill out the sampling list Background values in the training raster are set to 0 (not 255) Maximize the number of training points to use – but don’t try 100%! It crashes…

CART Sampling Tool - *.names | Generated with cubistinput by EarthSat | Training samples : 504 | Validation samples: 0 | Minimum samples : 0 | Sample method : Random | Output format : See5 dep. |d:/nlcd2000/training/c5/z16/train/trainingpts.img(:Band_1) Xcoord: ignore. Ycoord: ignore. band01: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_1) band02: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_2) band03: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_3) band04: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_1) band05: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_2) band06: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_3) band07: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_1) band08: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_2) band09: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_3) band10: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17. |d:/nlcd2000/training/c5/z16/topo/aspect.img(:Layer_1) band11: continuous. |d:/nlcd2000/training/c5/z16/topo/dem.img(:Layer_1) band12: continuous. |d:/nlcd2000/training/c5/z16/topo/posindex.img(:Layer_1) band13: continuous. |d:/nlcd2000/training/c5/z16/topo/slope.img(:Layer_1) band14: continuous. |d:/nlcd2000/training/c5/z16/b9/b9off.img(:Layer_1) band15: continuous. |d:/nlcd2000/training/c5/z16/b9/b9on.img(:Layer_1) band16: continuous. |d:/nlcd2000/training/c5/z16/b9/b9spr.img(:Layer_1) dep: 11,31,41,42,43,52,53,54,71,81,82,91,92. |d:/nlcd2000/training/c5/z16/train/trainingpts.img(:Band_1) dep – the dependent variable, aka “the thing to be estimated” List of input variables (count ‘em: 18)

CART Sampling Tool - *.data -1431270,1861950,87,80,84,110,68,63,103,74,71,1,1783,50,2,154,215,199,42 -1369020,1862160,120,60,52,145,57,32,108,65,63,1,1690,16,2,185,244,199,52 -1368780,1862070,70,75,95,96,72,73,77,74,94,13,1747,54,17,175,234,181,71 -1368990,1862010,81,69,72,105,70,55,82,78,80,13,1698,21,4,181,243,192,71 -1368660,1861950,89,67,81,104,67,74,84,72,87,13,1772,50,14,181,234,184,71 -1369470,1861710,76,77,102,112,66,73,84,73,91,5,1748,28,21,174,235,187,42 -1370070,1859610,68,83,101,86,80,84,71,81,97,6,1830,26,10,171,232,191,42 -1369950,1859580,62,80,94,84,78,77,64,81,95,1,1817,40,4,179,235,193,52 -1369920,1859400,58,82,97,84,76,75,66,81,93,15,1836,40,9,169,229,188,52 -1369230,1858290,66,77,88,89,71,71,74,78,85,1,2026,82,3,175,228,188,52 Xcoord, Ycoord, band01,…, band16, dep Order of variables matches the *.names file

Running See5 – Locate Data Start->Programs->See5 See5->1st Icon: Locate Data->(point to *.data) (See5 finds associated *.names file in same directory, too)

Running See5 – Construct Classifier See5->2nd Icon: Construct Classifier->(select options) Typically, try a series of runs: 1) Cross-validate to evaluate the training data and come up with a preliminary accuracy estimate, 2) Boosting, and 3) with neither. Save each of the resulting *.out files – rename them to something unique, so they don’t get overwritten.

Run C5 (UNIX server) • Command syntax: c5 –f filestem • Output model is saved in a file called filestem.tree • The model can also be viewed as text on the screen, or redirected to a text file: c5 –f filestem > filestem.out

Model evaluation More than one method to assess training accuracy- 1) Create an optional test file (filestem.test), containing training points withheld from developing the decision tree, to be used exclusively for evaluation 2) Run C5/See5 with the Cross-validation option – more realistic than training accuracy, and uses all training data sequentially (none withheld).

Cross validation (-X option) • Divides the training samples into n equal sized subsets • Develops a tree model using (n-1) subsets of training points and evaluate the model using the remaining subset • Repeats Step 2 for n times, each time using a different subset for evaluation • Averages the results of these n tests • Command Syntax: c5 –f filestem –X n

Pruning the d-tree model (-m or -c options) • A d-tree model can be overfitted. • Two options to control overfitting -m: specifies the minimum number of pixels in a node which can no longer be split. Larger m values increase severity of pruning. Default value is 2. -c: lessening this value causes more severe pruning of a tree model. Default value is 25, allowable range is 0-100.

Boosting (-b option) Why? Often improves accuracy by 5% or more! • Develops an initial tree model using all training points and classifies them • With higher weights assigned to the misclassified points, resamples the same number of points from the original training data set and builds another tree model, and uses the new model to classify the original points • Repeats Step 3 several times (default is 10) • All developed d-tree models are used to classify new sample points, with the final prediction a weighted vote of the predictions of those models • Command syntax: c5 –f filestem -b

Develop a Spatial Classification (map) A mask file is handy for speeding up the processing, and addressing all the pixels of an irregularly shaped mapping area. An error, or confidence layer, can also help identifying areas to inspect for evaluation – and maybe a guide for where new or additional training data is needed.

Review: Steps to Develop a Classification Tree Model • Required files • Optional file • Cross-validation • Pruning • Boosting

Review: Files for Running C5 • filestem.names – attribute table, required • filestem.data – training data, required • (filestem.test – withheld training data test file, optional)

Review: Examples of Required *.names and *.data Files The order of the variables in the *.data file must match the order of the variables in the *.names file! 1577,15,7,66,4,81 1499,19,5,50,4,81 1485,20,1,0,1,81 1507,10,10,50,4,81 1534,10,1,50,4,81 1548,1,1,50,4,81 1562,0,1,50,1,81 1542,13,17,33,1,81 lc. | to be classified elev: continuous. slp: continuous. asp: discrete 17. pidx: continuous. lform: 0,1,2,3,4,5,6. lc: 0,41,52,81. taskname.names taskname.data

Finally! Run an Example Sample Data: Subset of Southern Utah (zone 16) Reflectance (bands 1,2,3,4,5,7 of leaf on, leaf off, spring dates): D:\Workspace\c5training\z16\refl Tasselled Cap (wetness, greeness, brightness of same dates): D:\Workspace\c5training\z16\tc Topographic Derivatives (aspect, DEM, position index, slope): D:\Workspace\c5training\z16\topo Thermal (band 9 for each date): D:\Workspace\c5training\z16\b9 Date Mosaics (how each date mosaic was assembled): D:\Workspace\c5training\z16\date Training data in ARC/Info export coverage format: D:\Workspace\c5training\z16\trainingpts.e00 Note: columns “point_x” and “point_y” are Albers x and y coordinates, and “nlcd2000” (last column) is the land cover label.

Another Example – With Your Own Training Data Sample Data: Subset of Northern Minnesota (zone 41) Reflectance (bands 1,2,3,4,5,7 of leaf on, leaf off, spring dates): D:\Workspace\c5training\z41\refl Tasselled Cap (wetness, greeness, brightness of same dates): D:\Workspace\c5training\z41\tc Ratio Indices (Green NDVI, Moisture): D:\Workspace\c5training\z41\ratio NOTE: These ratio indices are nonstandard! Here given only as examples of value-added, user specific input layers for determining woody-wetlands. Thermal (band 9 for each date): D:\Workspace\c5training\z41\b9 Training data must be created by screen interpretation. Go to the Minnesota DNR website: www.ra.dnr.state.mn.us/forestview, and choose “ForestView +”. Specify our subset area, township T62R17W. With the online stand information, and a new point coverage in your ERDAS Imagine Viewer, create your own training data.

Regression tree – cubist For estimating continuous variables like percent canopy cover and height, etc.

Methods for estimating a continuous variable • Physically-based models (e.g. Li and Strahler (1992)) • Too complex to be inverted • Spectral mixture models: • End-members – green vegetation, non-photosynthetic vegetation, soil etc., not directly interpretable as the target variables; • Assumptions on spectral mixing may not be valid; • Empirical models -- results directly interpretable as the target variables • Linear regression • cannot approximate non-linear relationships • Regression tree • can approximate complex nonlinear relationships • Neural net

Regression tree cartoon

Steps to developing a spatial map using Cubist • Collect training points • Develop a regression tree model • Apply the model spatially to create a map • Masking

Develop a regression tree model • Required files • Optional files • Cross-validation • Pruning • Committee model • Composite model

Files for running Cubist • filestem.names – attribute table, required • filestem.data – training data, required • filestem.test – test file, optional • filestem.cases – data to be classified, optional

Example Files – Just Like C5 Except! The *.names target variable is continuous, not discrete… 1577,1,1,66,4,15 1499,1,1,50,4,89 1485,1,1,0,1,20 1507,1,1,50,4,45 1534,0,1,50,4,60 1548,1,1,50,4,5 1562,0,1,50,1,100 1542,1,1,33,1,70 treecover. | target elev: continuous. slp: continuous. asp: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17. pidx: continuous. lform: 0,1,2,3,4,5,6. treecover: continuous. filestem.names filestem.data

Cubist Training Data We use Cubist to estimate CONTINUOUS data (% impervious surface, % tree canopy). Training data is generated by classifying high resolution imagery (DOQQ’s, IKONOS) – use C5/See5 to classify! Typically, impervious surfaces have a value of 1, shadows (unknown) have a value of 2, and all others go to 0. The CART Utilities->Percent Calculation… computes a rescaled high-res image of 1m (DOQQ) or 4m (IKONOS) pixels to a block of estimated percent coverages on 30m Landsat-compatable pixels

Run CART Sampling Tool (Cubist XP) Make the 30m dependent variable (continuous, values of 0 through 100) with area of full extent from a model (like union255.gmd), but with value 255 where it was padded. Pick many thousands of Training points, and a non-zero number of points for Valdiation (will crash if set to zero). Stratified Sampling, with Minimum of 50 Samples per bin (0-100).

Run Cubist (XP) Start->Programs->Cubist Cubist->1st Icon-> Locate Data->(point to *.data)

Running Cubist – Build Model Cubist->2nd Icon: Build Model->(select options) NOTE: Cross Validation helps with preliminary assessment of the training data, but it DOES NOT generate a *.model file. Run again, without Cross Validation, to make the required *.model

CART – Cubist Run… A mask could be used to eliminate water bodies, or to restrict the classification to road buffered areas combined with prior urban classes. The estimated impervious percentage for the entire area!

Run Cubist (UNIX Server) Command syntax: cubist –f filestem • Model is saved in a file called filestem.model • The model can also be viewed as text on the screen, or redirected to a text file: cubist –f filestem –e 0 > filestem.out “-e 0” is to prevent extrapolation (beyond 100%)

Regression Tree Model Evaluation – Just Like C5 Two methods of assessing training accuracy • Can provide a test file of reserved training data – filestem.test • Cross-validation – more realistic than training accuracy, uses all training data

Pruning Regression Tree models An r-tree model can be over fitted. Two options to control over fitting -m: specifies the minimum number of pixels in a node which can no longer be split. Default value is 1% of training points. -r: specifies the maximum number of output rules

Committee model Similar to the boosting function of C5.

Composite model • In addition to a regular regression tree model, can also make a prediction using the K-nearest neighbor (KNN) method • Final prediction is a combination of both • Initiated by option “-i”, or the program will determines if composite model is needed if option “-a” is used.

Run a Cubist example Percent Imperviousness of Columbus, GA: Two training sites of 1 meter resolution color DOQQ, already classified by unsupervised classifications into tree, grass, shadow, bare, water, and impervious surface (6 classes). Combine training sites, and recode so that impervious has value of 1, and shadow has value of 2, all others go to 0. Resample recoded file, from higher resolution to 30 meter estimated percent impervious. Increase area to match extent of L7 imagery, padding with values of 255: see “union255.gmd”. Use this file, along with two dates of L7 imagery, leaf-on and leaf-off to estimate percent imperviousness for entire area of Landsat data.

NLCD2001 – C5 and Cubist Training