120 likes | 629 Views
VLDB. The real world is in the Tb range (British Telecom - 80Tb using Sun Oracle)Data consolidated from different sources to build Data Warehouses and using Data mining techniques to extract useful informationData is always READ onlyPhysics data has similar characteristics . VLDB. Current size li
E N D
1. ORACLE & VLDB Nilo Segura
IT/DB - CERN
2. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle)
Data consolidated from different sources to build Data Warehouses and using Data mining techniques to extract useful information
Data is always READ only
Physics data has similar characteristics
3. VLDB Current size limits : Solaris 64bits + Oracle 64bits = 4Pb per database
This is more or less 19240 SunStoredge A1000 units (216Gb per unit today)
The current technology does not allow us to store on-line all this data in a manageable way Database in Oracle terms : instance (memory) + database files . The dictionary + database accounts
4Pb is a limit in the db software, the underlying Filesystem can store more than that. It is clear that we need the proper hardwareDatabase in Oracle terms : instance (memory) + database files . The dictionary + database accounts
4Pb is a limit in the db software, the underlying Filesystem can store more than that. It is clear that we need the proper hardware
4. VLDB A typical LHC experiment will get several Petabytes of raw data
But the end user ought not to have access to all this.
We need to process/group/summarize it following certain criterias
This also means more disk space (if we want to keep everything on-line) We should keep only the reconstructed data on-line
I think that it is a mistake to let the users moving around the raw data for practical reasons, they do not know enough of the system to produce good quality queries, they do not need to repeat the process already done at the reconstruction stage
ASM data size 11-15 Tb per year is possible to handle with today’s technology, things will improve certainly in 2-3 yearsWe should keep only the reconstructed data on-line
I think that it is a mistake to let the users moving around the raw data for practical reasons, they do not know enough of the system to produce good quality queries, they do not need to repeat the process already done at the reconstruction stage
ASM data size 11-15 Tb per year is possible to handle with today’s technology, things will improve certainly in 2-3 years
5. VLDB Not to mention backup…we need to keep our data safe (have you devised your backup strategy?)
RAID technology to help us to increase the performance and availability
RAID 0+1(best) or RAID 5(cheaper)
Today, we should have raw data on tapes and reconstructed data on-line It is clear that we need to offer a good quality service, with mirroring ans stripping or hardware-raid5, unless everybody agrees that the service will stop in case of disk failure
And we need also some clustering techniques and load balancing
Even with 1Tb disk units, this means N*1024 units..quite a lot but becoming feasible..It is clear that we need to offer a good quality service, with mirroring ans stripping or hardware-raid5, unless everybody agrees that the service will stop in case of disk failure
And we need also some clustering techniques and load balancing
Even with 1Tb disk units, this means N*1024 units..quite a lot but becoming feasible..
6. VLDB Now some software tricks to deal with all this amount of data
Partitioning
Parallel DML
Materialized Views
Parallel Server (Cluster configuration)
Bitmapped indexes? This tips are needed if we want to increase the performance of the systemThis tips are needed if we want to increase the performance of the system
7. VLDB Partitioning
A large amount of data can be divided into physically independent structures according to a certain criteria
However the user continues to see the same logical structure
Partition keys can be defined by range or using a hash function
The system can discard partitions based on the user’s queries, reducing the amount of data to be processed
8. VLDB Parallel DML
A single select/insert/delete/update can be executed by several processes (slaves) coordinated by a master process
Naturally leads to a better use of SMP machines
The degree of parallelism can be set by the user of automatically by the system (testing is a must)
Parallel insert/update/delete does need partitioning
You need to plan carefully your I/O system
9. VLDB Materialized views
Normal views are just a name for a SQL query with no real data associated (until runtime)
This can be very costly if we run it regularly
MV is just a view with all the data that satisfy the query already there (like in a normal table)
It can be refreshed (manually or automatically) to reflect the dynamic nature of the view
10. Parallel Server
Several nodes are attacking the same database that is on a disk system shared by all the nodes in the cluster
Users can be assigned to different nodes (load balancing)
Intra parallelism – queries are executed across the different nodes
At CERN there are 3 Sun Clusters (2 for DB, 1 for Web) and 1 Compaq
There is no such thing for Linux (yet) For the moment, the files are actually raw partitions, but makers are working on shared filesystems (OpenVMS had them already years ago)
There are heavy dependencies between HW/SW to create one of these clustersFor the moment, the files are actually raw partitions, but makers are working on shared filesystems (OpenVMS had them already years ago)
There are heavy dependencies between HW/SW to create one of these clusters
11. Others Another point is how to distribute data amongst the different Institutes
Network ? , Tapes + Post ? …
It would be nice to have a plug-in plug-out mechanism in the database
This is called transportable tablespaces
We may also use the database replication option but…. In Oracle this is called transportable tablespaces, in its current version it is limited to traffic between same hardware, will XML help us ?
With XML there will be certainly some extra processing cost, it is very portable but you have the extra overhead and processing..In Oracle this is called transportable tablespaces, in its current version it is limited to traffic between same hardware, will XML help us ?
With XML there will be certainly some extra processing cost, it is very portable but you have the extra overhead and processing..
12. Conclusion Do not despair by the size of problem
We are trying to solve tomorrow’s problem using today’s technology
So keep an eye open and be VERY flexible in your model to be able to adapt quickly and painlessly
Do never declare you model frozen, it is a mistake, try to improve it, adopt new technologies (if it is a benefit) Maybe for AMS tomorrow is today.
I know cases of systems based on Oracle that were created many years ago and that had a lot of extra coding to workaround certain limitations, (no longer true) and now they are a real mess, extremely complex and heavy to maintain. Because they keep on adding new things by patching and patching… without any real thought of what the system must doMaybe for AMS tomorrow is today.
I know cases of systems based on Oracle that were created many years ago and that had a lot of extra coding to workaround certain limitations, (no longer true) and now they are a real mess, extremely complex and heavy to maintain. Because they keep on adding new things by patching and patching… without any real thought of what the system must do