180 likes | 360 Views
Technology and Infrastructure Support for Large Scale Information. Marcio Faerman The Brazilian National Education and Research Network - RNP marcio@rnp.br www.rnp.br. Generating Large Data Collections. Large Data Volumes can be generated much faster than they can be analyzed
E N D
Technology and Infrastructure Support for Large Scale Information Marcio Faerman The Brazilian National Education and Research Network - RNP marcio@rnp.br www.rnp.br
Generating Large Data Collections • Large Data Volumes can be generated much faster than they can be analyzed • Instrument Observations • Particle Accelerators (Cern LHC) • Telescopes, Satellites • Sensor Networks • Virtual Observatories • Large Model Simulations • High resolution, Very complex • Scientific Experiments • medical imaging (fMRI): ~ 1 GByte per measurement (day) • Bio-informatics queries: 500 GByte per database • Satellite world imagery: ~ 5 TByte/year • Current particle physics: 1 PByte per year • LHC physics (2007): 10-30 PByte per year • LSST Astronomy (2012): 5 PBytes per year
Challenges Managing Large Volume Data • Scalability • What works for small datasets does not necessarily work for large collections • Data Integrity • At a terabyte scale failuresand data corruption are very likely to occur • Is data provenance reliable? • Efficiency • Data should be accessed at a rate which keeps work feasible • More data – need for more speed • Distributed Access • Data can be at remote (and possibly unknown) location • Infrastructure Management • Heterogeneous • Distributed • Prone to failures • Very Complex
Challenges – Getting to Know your Data • Extract knowledge from raw data files • Data product derivation • Vizualization • Relationships • Patterns • New derived quantities • Cross institutional and cross disciplinary collaborations • What if experiments • Your data with our model? • Dataset Access • Multiple formats • Each sensor, simulation has its own storage format • Federated collections • Discovery by content
Technological Response • Integration of compute, communication, storage and instrument resources into a powerful infrastructure – Information Grids • Very powerful infrastructure • Economy of scale • Serves broad range of customers • biologists, pysicists, government, industry • Infrastructure is heterogeneous, distributed, very complex • Middleware and Data Oriented tools act as facilitators to tackle data management complexities
Open Access and Preservation Functionalities • Federated Digital Libraries • Integration of distributed repositories • Access control – can decide who can see it • Organize the data in collections • Describe your data – Metadata • Data Grids • Access to efficient parallel I/O systems • Hierarchical Systems • Disk caches, tapes • Often Distributed • Analysis, Data Mining • Visualization • Workflow based systems • Transaction based data ingestion • Data provenance, Data fingerprinting • What if virtual lab • End User Oriented Portals • "I deal with the data in the way it makes sense to me"
Middlewares and Tools • Data Management • Storage Resource Broker (SRB) • Globus Data Management • L-Store • IBP • Storage Resource Manager (SRM) • Data Representation Libraries • HDF5 • NetCDF • Portals • OGCE • JSR 168
Today’s Reality • Exceptional achievements by early adopters • Integration between domain scientists – data users and producers still a challenge • Need much more cross-disciplinary interaction • Emphasis on scale and performance • Failures are still a taboo • Frustration factor should be addressed in partnership with users • Focus on failure recovery and quality of service getting more attention
Grid Initiatives around the World e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
UNAM OurGrid EELA SPRACE SINAPAD HEPGrid Ringrid CL Grid UCRAV
CUDI-MX REACCIUN-VE RAAP-PE RNP-BR REUNA-CL Networking in Latin America
Brazilian National Research And Education Network - RNP • In November 2005 the RNP networking infrastructure was entirely renovated. It consists of • A multigigabit core connecting 10 capitals at 2.5 and 10 Gbps • Connections at 34 Mbps to 11 capitals • Connections up to16 Mbps to 6 capitals
Communitary Metropolitan Networks • It is not enough to bring high speed connectivity to each city – it is necessary bring it to the university campus / research lab as well. • The metropolitan network is the solution • Infrastructure sharing to support: • Campi interconnection of each partner institution • Access to RNP national network backbone • This sharing substantially reduces deployment costs • Preferably, the infrastructure will be owned by the partners themselves (reducing operating costs) • Pilot: The Metrobel project in the city of Belém do Pará in the Amazon region Infra-estrutura para e-Ciência
Redecomep Project(2005-7) • Following Metrobel, Brazilian Ministry of Science and Technology is supporting the Communitary Networks for Education and Research (Redecomep) Project, with a R$ 39,7 M (~ U$ 19,0 M) through Finep (dec/2004) • Goals: • Extend the metropolitan optical network to other 26 cities with RNP points of presence • Promote integration in metropolitan area • High speed access to RNP point of presence Infra-estrutura para e-Ciência
Next steps • Integration between network, data repositories, compute, storage resources and applications • Identify who needs better connectivity • Developing Brazilian cyberinfrastructure • Generally uncoordinated funding for infrastructure resources • Need broad vision at funding agencies and partners level of application requirements and cyberinfrastructure integration • RNP articulating with scientific communities and infrastructure providers e-Science/Infrastructure initiative in Brazil
JRU- Brazil: 22 members in EELA-2 e-Infrastructure Workshop, NUDI/USP, São Paulo, 07.05.2007
Developing Together • Information infrastructure is being redefined in Brazil and Latin America • Now is the time to have as much cross-disciplinary interaction as possible to define needs, partnerships and investments • Please contact us THANK YOU!