180 likes | 363 Views
Data Analysis in Java. Wolfgang Hoschek CERN IT/PDP. Outline. Why is Java interesting to Data Analysis? Existing and non-existing performance problems Colt - Libraries for scientific and technical computing C/C++ Java interoperability issues. Why Java for Data Analysis?. LHC timescale
E N D
Data Analysis in Java Wolfgang Hoschek CERN IT/PDP
Outline • Why is Java interesting to Data Analysis? • Existing and non-existing performance problems • Colt - Libraries for scientific and technical computing • C/C++ Java interoperability issues
Why Java for Data Analysis? • LHC timescale • Prepare for change and evolution Data Analysis is end user task • Data Analysis needs • Primary competence and interest of user is Physics, not understanding a tricky language • Tools and language should be easy to learn and use • Java can meet these needs
Java Features (1) • Simple • C++ without its “broken” features • Omits many rarely used, poorly understood, confusing C++ features bringing more grief than benefit • Easy to learn without esoteric training • Safe • Garbage collection --> no memory trashing, no memory leaks • No loopholes in type safety • Architecture Neutral • Portable C++: Theory, in practice hardly ever • Non-portable Java: Theory, in practice hardly ever • Distributed • Transparently access remote objects over network • Transparently move code and/or objects over network
Java Features (2) • Dynamic • Complete type introspection API • Late linking (upon first use at runtime) • Change implementation, add methods, add fields without rebuilding clients depending on it • Upgrading a piece of a large system does not involve rebuilding everything that depends on it • Interoperability with C/C++ via Java Native Interface (JNI) • Java can call C/C++; C/C++ can call Java • Many free standard APIs • GUI, graphics, networking, I/O, databases, ORBs, XML, security, … • http://www.cetus-links.org/oo_java_libraries.html • http://www.cetus-links.org/oo_java.html • Fundamental Java concepts applied to physics software • Whitepaper http://java.sun.com/people/jag/OriginalJavaWhitepaper.pdf
Performance • Strong improvements over last 2 years • Interpreted mode is history • All Virtual Machines now generate optimized machine code • GUI • Still rather sluggish, improving with JDK1.3 • Most performance problems have nothing to do with language, but with inadequate design • Compute-intensive: ~ C • http://www.research.ibm.com/ninja/ • http://www.cern.ch/CERN/Divisions/EP/HL/Papers/ACMJava2000.ps • Networking: ~ C sockets • http://www.alphaworks.ibm.com/aw.nsf/techmain/sockperf • http://www.cs.ucsb.edu/conferences/java98/papers/dots.ps • I/O - Sun now working on new high performance I/O library • synchronous&asynchronous, buffered, raw binary I/O, memory mapped, pluggable filesystem implementations, ... • http://java.sun.com/aboutJava/communityprocess/jsr/jsr_051_ioapis.html
Tools and Toolkits • Java Analysis Studio (JAS) • End user GUI ala Paw, Root, OpenScientist • www-sldnt.slac.stanford.edu/jas • Colt • Libraries for scientific and technical computing • nicewww.cern.ch/~hoschek/colt/index.htm • Numerical Libraries • math.nist.gov/javanumerics • Objectivity/Java Binding to C++ • WIRED Event Display • http://wired.cern.ch/ • Atlas Graphics Group • http://www.cern.ch/Atlas/GROUPS/GRAPHICS/
Colt Distribution - Features (1) • Open Source ala CLHEP • Several free libraries • For user convenience • ...documented, packaged and bundled • ...under one single uniform umbrella • Colt library • General-purpose data structures optimized for numerical data, e.g. • Dense and sparse matrices (multi-dimensional arrays) • Linear Algebra • variable sized arrays • hashtables • buffer management
Multi-dim arrays & Views • Several free libraries • For user convenience documented, packaged and bundled under one single uniform umbrella • Colt library • Fundamental general-purpose data structures optimized for numerical data, e.g. • Dense and sparse matrices (multi-dimensional arrays), Linear Algebra, resizable arrays, associative containers, buffer management • Jet library • Mathematical and statistical tools for data analysis, • Histogramming functionality, • Random Number Generators and Distributions for simulations • more
Colt Distribution - Features (2) • Jet library • Mathematical and statistical tools for data analysis, • Histogramming functionality (filling ~ 2 x 10^6 numbers/sec) • Random Number Generators and Distributions for simulations • A complete port of CLHEP’s random number library • ~ 5*10^6 uniform numbers/sec • more • JAL library • a partial port of the C++ Standard Template Library • developed by Silicon Graphics • Concurrent library • Mutex, Semaphore, Thread pools, Tasks, Task dispatchers, ... • Contributions from • Sun, SGI, Visual Numerics, Univ. New York
Download Packaging • HTML API documentation • Introduction, installation details, FAQs, news, feedback • Extensive doc for each package, class, and method. Examples • Build by javadoc tool (part of JDK) • Source code • and everything else needed for complete rebuild of download file • One single cross-platform shared Java library (colt.jar)
Dense 2-d Matrix Benchmarks x-axis: Size of each dimGet: 100-170MB/sSet: 60MB/sassign(memcopy): 160-1000MB/sMatmult: 40-100 Mflops CLHEP: C/C++Ninja, Jama, Colt: pure Java PentiumIII@600 MHz, 32KB L1, 512 KB L2, 512 MB, IBMJDK1.1.8, RedHat 6.0, lxplus.cern.ch
Sparse 2-d Matrix Benchmarks dense… DenseDoubleMatrix2Dsparse…SparseDoubleMatrix2DNormalized to dense algorithm -> reflects user perceived speedup Get: 24MB/sSet: 18MB/sassign(memcopy): 0.1-24GB/sMatmult: 10-3200 Mflops PentiumIII@600 MHz, 32KB L1, 512 KB L2, 512 MB, IBMJDK1.1.8, RedHat 6.0, lxplus.cern.ch
Open Issues • AIDA Interfaces mapped to Java • Scalable&Seamless Java C/C++ Interoperability • How can 100s of classes be accessed both in C++ and Java? • How to maintain consistency in the presence of evolving code base? • How to make transient Java classes persistent? • How to minimize performance impact of Java binding to C++ database? • If these problems can be solved in a way transparent to users... • it opens the door to smooth evolution • we can select suitable language on a case by case basis! • See talk at next RD45 workshop “Espresso Java binding”
Conclusions • More work is needed to solve persistance and interoperability problem • Java has many advantages • ...and is ready for serious use • Active involvement by community likely to yield high return-on-investment