1 / 15

Data Analysis in Java

Data Analysis in Java. Wolfgang Hoschek CERN IT/PDP. Outline. Why is Java interesting to Data Analysis? Existing and non-existing performance problems Colt - Libraries for scientific and technical computing C/C++  Java interoperability issues. Why Java for Data Analysis?. LHC timescale

remy
Download Presentation

Data Analysis in Java

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis in Java Wolfgang Hoschek CERN IT/PDP

  2. Outline • Why is Java interesting to Data Analysis? • Existing and non-existing performance problems • Colt - Libraries for scientific and technical computing • C/C++  Java interoperability issues

  3. Why Java for Data Analysis? • LHC timescale • Prepare for change and evolution Data Analysis is end user task • Data Analysis needs • Primary competence and interest of user is Physics, not understanding a tricky language • Tools and language should be easy to learn and use • Java can meet these needs

  4. Java Features (1) • Simple • C++ without its “broken” features • Omits many rarely used, poorly understood, confusing C++ features bringing more grief than benefit • Easy to learn without esoteric training • Safe • Garbage collection --> no memory trashing, no memory leaks • No loopholes in type safety • Architecture Neutral • Portable C++: Theory, in practice hardly ever • Non-portable Java: Theory, in practice hardly ever • Distributed • Transparently access remote objects over network • Transparently move code and/or objects over network

  5. Java Features (2) • Dynamic • Complete type introspection API • Late linking (upon first use at runtime) • Change implementation, add methods, add fields without rebuilding clients depending on it • Upgrading a piece of a large system does not involve rebuilding everything that depends on it • Interoperability with C/C++ via Java Native Interface (JNI) • Java can call C/C++; C/C++ can call Java • Many free standard APIs • GUI, graphics, networking, I/O, databases, ORBs, XML, security, … • http://www.cetus-links.org/oo_java_libraries.html • http://www.cetus-links.org/oo_java.html • Fundamental Java concepts applied to physics software • Whitepaper http://java.sun.com/people/jag/OriginalJavaWhitepaper.pdf

  6. Performance • Strong improvements over last 2 years • Interpreted mode is history • All Virtual Machines now generate optimized machine code • GUI • Still rather sluggish, improving with JDK1.3 • Most performance problems have nothing to do with language, but with inadequate design • Compute-intensive: ~ C • http://www.research.ibm.com/ninja/ • http://www.cern.ch/CERN/Divisions/EP/HL/Papers/ACMJava2000.ps • Networking: ~ C sockets • http://www.alphaworks.ibm.com/aw.nsf/techmain/sockperf • http://www.cs.ucsb.edu/conferences/java98/papers/dots.ps • I/O - Sun now working on new high performance I/O library • synchronous&asynchronous, buffered, raw binary I/O, memory mapped, pluggable filesystem implementations, ... • http://java.sun.com/aboutJava/communityprocess/jsr/jsr_051_ioapis.html

  7. Tools and Toolkits • Java Analysis Studio (JAS) • End user GUI ala Paw, Root, OpenScientist • www-sldnt.slac.stanford.edu/jas • Colt • Libraries for scientific and technical computing • nicewww.cern.ch/~hoschek/colt/index.htm • Numerical Libraries • math.nist.gov/javanumerics • Objectivity/Java Binding to C++ • WIRED Event Display • http://wired.cern.ch/ • Atlas Graphics Group • http://www.cern.ch/Atlas/GROUPS/GRAPHICS/

  8. Colt Distribution - Features (1) • Open Source ala CLHEP • Several free libraries • For user convenience • ...documented, packaged and bundled • ...under one single uniform umbrella • Colt library • General-purpose data structures optimized for numerical data, e.g. • Dense and sparse matrices (multi-dimensional arrays) • Linear Algebra • variable sized arrays • hashtables • buffer management

  9. Multi-dim arrays & Views • Several free libraries • For user convenience documented, packaged and bundled under one single uniform umbrella • Colt library • Fundamental general-purpose data structures optimized for numerical data, e.g. • Dense and sparse matrices (multi-dimensional arrays), Linear Algebra, resizable arrays, associative containers, buffer management • Jet library • Mathematical and statistical tools for data analysis, • Histogramming functionality, • Random Number Generators and Distributions for simulations • more

  10. Colt Distribution - Features (2) • Jet library • Mathematical and statistical tools for data analysis, • Histogramming functionality (filling ~ 2 x 10^6 numbers/sec) • Random Number Generators and Distributions for simulations • A complete port of CLHEP’s random number library • ~ 5*10^6 uniform numbers/sec • more • JAL library • a partial port of the C++ Standard Template Library • developed by Silicon Graphics • Concurrent library • Mutex, Semaphore, Thread pools, Tasks, Task dispatchers, ... • Contributions from • Sun, SGI, Visual Numerics, Univ. New York

  11. Download Packaging • HTML API documentation • Introduction, installation details, FAQs, news, feedback • Extensive doc for each package, class, and method. Examples • Build by javadoc tool (part of JDK) • Source code • and everything else needed for complete rebuild of download file • One single cross-platform shared Java library (colt.jar)

  12. Dense 2-d Matrix Benchmarks x-axis: Size of each dimGet: 100-170MB/sSet: 60MB/sassign(memcopy): 160-1000MB/sMatmult: 40-100 Mflops CLHEP: C/C++Ninja, Jama, Colt: pure Java PentiumIII@600 MHz, 32KB L1, 512 KB L2, 512 MB, IBMJDK1.1.8, RedHat 6.0, lxplus.cern.ch

  13. Sparse 2-d Matrix Benchmarks dense… DenseDoubleMatrix2Dsparse…SparseDoubleMatrix2DNormalized to dense algorithm -> reflects user perceived speedup Get: 24MB/sSet: 18MB/sassign(memcopy): 0.1-24GB/sMatmult: 10-3200 Mflops PentiumIII@600 MHz, 32KB L1, 512 KB L2, 512 MB, IBMJDK1.1.8, RedHat 6.0, lxplus.cern.ch

  14. Open Issues • AIDA Interfaces mapped to Java • Scalable&Seamless Java  C/C++ Interoperability • How can 100s of classes be accessed both in C++ and Java? • How to maintain consistency in the presence of evolving code base? • How to make transient Java classes persistent? • How to minimize performance impact of Java binding to C++ database? • If these problems can be solved in a way transparent to users... • it opens the door to smooth evolution • we can select suitable language on a case by case basis! • See talk at next RD45 workshop “Espresso Java binding”

  15. Conclusions • More work is needed to solve persistance and interoperability problem • Java has many advantages • ...and is ready for serious use • Active involvement by community likely to yield high return-on-investment

More Related