310 likes | 399 Views
Datcracker Open data-mining platform connecting Rseslib and WEKA. Marcin Wojnarski. Warsaw University, Poland. Outline. Datcracker is … Motivation What is available in version 0.5 HOWTO … Architecture Future releases. Datcracker is….
E N D
DatcrackerOpen data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland
Outline • Datcracker is … • Motivation • What is available in version 0.5 • HOWTO … • Architecture • Future releases
Datcracker is… …an open-source extensible data-mining platform which provides common architecture for data processing algorithms of various types. The algorithms can be combined together to build data processing schemes of large complexity.
Main characteristics • Extensibility of algorithm poolthrough well-defined API • Extensibility of types of data that algorithms operate on • Stream-based data processing, for efficient handling of large volumes of dataand for freedom of designing complex experiments • Language: Java • Licence: GPL • Download: www.datcracker.org
Motivation To enable independent research groups exchange and combine their algorithms To simplify implementation of new algorithms
Available in version 0.5 • Rseslib algorithms: • classifiers (~20 algorithms) • Weka algorithms: • ARFF reader • classifiers (~60) • filters (47) • Datcracker algorithms: • Train&Test evaluation scheme • Data types: • vectors of numeric and/or symbolic features
HOWTO: Read ARFF file Cell arff = new ArffReaderCell(); arff.set("filename", "data/iris.arff"); arff.set("labelIndex", "last"); arff.open(); System.out.println(arff.next()); System.out.println(arff.next()); arff.close(); Output: [data:[5.1 3.5 1.4 0.2] label:[Iris-setosa]] [data:[4.9 3.0 1.4 0.2] label:[Iris-setosa]]
HOWTO: Train classifier (Rseslib) Cell learner = new RseslibClassifier("C45"); learner.set("pruning", "true"); learner.setSource(arff); learner.build(); learner.setSource(arff_test); learner.open(); System.out.println(learner.next()); learner.close();
HOWTO: Train classifier (Weka) Cell learner = new WekaClassifier("J48"); learner.set("minNumObj", "2"); learner.setSource(arff); learner.build();
HOWTO: Apply Weka filter Cell filter = new WekaFilter("attribute.Remove"); filter.set("attributeIndices", "3-6"); filter.setSource(arff); filter.open(); System.out.println(filter.next()); System.out.println(filter.next()); filter.close();
HOWTO: Set parameters arff.set("filename", "data/iris.arff"); arff.set("labelIndex", "last"); ... OR Parameters par = new Parameters(); par.set("filename", "data/iris.arff"); par.set("labelIndex", "last"); ... arff.setParameters(par); par = arff.getParameters();
HOWTO: Train & Test Cell learner = new RseslibClassifier("C45"); learner.set("pruning", "true"); TrainAndTest tt = new TrainAndTest(learner); tt.set("trainPercent", "70"); tt.set("repetitions", "10"); tt.setSource(source); tt.build(); System.out.println(tt.report());
ARFF ARFF Filter1 Filter2 Classifier New ARFF Another Classifier set("attributeIndices","0-3") set("attributeIndices","5") Data Processing Chain Cell.setSource(sourceCell)
Outline • Cell • interfaces • state • how to override • Data • MetaData
Cell • Main class of Datcracker architecture • Base class for all data-processing algorithms • classifiers • clusterers • filters • data loaders • data generators • … • Cells can be connected in a Data Processing Chain • Data transfer between cells have form of a stream of samples • Receiving cell may immidiately consume incoming samples large volumes of data processed efficiently
Cell’s interface Cell can be: • a data source • a data receiver • buildable • parameterized
Cell as a data source Cell’s interface for data transfer: open() : MetaSampleopens communication session next() : Sampleretrieves next sample of data close() closes communication session
Cell as a data receiver Cell’s interface for receiving data: setSource(Cell) set source cell
Buildable cells • Some cells may be buildable: they have to be built before use • Building a cell is implemented by subclasses and may mean different things: • training a decision system • running an evaluation scheme (T&T, CV, …) • buffering input data • … • Cell’s interface for building: build() builds the cell erase() erases the cell; it can be built again afterwards
Fixed cells • Cells that are not buildable are called fixed. They are usable just after construction or setting parameters: • file reader • WEKA filter • …
Parameterized cells • Cell’s interface for parameterization: set(String name, String value) sets a parameter setParameters(Parameters) sets all parameters at once getParameters() :Parametersreturns all parameters that are set
next() build() open() EMPTY CLOSED OPEN erase() close() State of the cell EMPTYcell has no content, cannot be used CLOSEDcontent has been built, cell ready to use OPENcell is being used now (generating samples of data)
…motivation • To check against access violations when the cell is accessed.Examples: • two cells try to retrieve data from a given cell at the same time • someone tries to use an empty cell • someone tries to reconnect cells during their activity • To simplify implementation of subclasses (new algorithms):they may safely assume that access is correct(build() before open(), open() before next(), …) • To detect bugs early – important in heterogenous system!
How to override Cell • Methods to override: • onBuild() • onErase() • onOpen() • onNext() • onClose() • Public methods build(), … can’t be overriden.They perform state checking and then call on…() method • Like event handlers in event-driven programming • You do not have to override all of them!(e.g. cell for reading data will not be buildable) • You can provide additional interface in your subclass
Data representation • Data set split into samples • Sample: • data : Data input data • label : Data associated decision label • Separation of data and label: • useful for complex types of data/labels, e.g. in image processing (like segmentation) • useful for meta-learning algorithm, which operate on labels alone • labelled / unlabelled / partially labl. samples handled in the same way • Data:abstract base class. Downcasted by cells to what they expect • Currently available subclasses: • NumericFeature, SymbolicFeature, DataVector • In the future: time series, images, special types of labels, ...
Immutability • Data objects are immutable: they cannot be modified after creation (like String class) • They can be freely shared among cells without risk of accidental modification • safety • simplicity • efficiency: • no need to copy data between cells • no need for synchronization in multi-threaded execution
Metadata • Many algorithms have to know „type” of input data in advance, before processing of data starts metadata • Separation of data and metadata base class MetaData • Describes common properties of all Data objects generated in a given session • number and types of features in a DataVector • dictionary of possible values of a SymbolicFeature • … • Each Data subclass has an associated MetaData subclass • Immutable!
Future releases • Architecture • Multi-input and multi-output cells • Composite cells (e.g. meta-learning) • Serialization and copying • Progress info and suspension of cell building • Algorithms • cross-validation • data buffering • … • Data types • time series • …
Home www.datcracker.org