1 / 91

Handling Big Data

Handling Big Data. Howles Credits to Sources on Final Slide. Handling Large Amounts of Data. Current technologies are to: Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors

loc
Download Presentation

Handling Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handling Big Data Howles Creditsto Sources on Final Slide

  2. Handling Large Amountsof Data • Current technologies are to: • Parallelize – use multiple processors or threads. Can be a single machine, or a machine with multiple processors • Distribute – use a network to partition work across many computers

  3. Parallelized Operations • This is relatively easy if the task itself can easily be split into units. Still presents some problems, including: • How is the work assigned? • What happens if we have more work units than threads or processors? • How do we know when all work units have completed? • How do we aggregate results in the end? • How do we handle if the work can’t be cleanly divided?

  4. Parallelized Operations • To solve this problem, we need communication mechanisms • Need synchronization mechanism for communication (timing/notification of events), and to control sharing (mutex)

  5. Why is it needed? • Data consistency • Orderly execution of instructions or activities • Timing – control race conditions

  6. Examples • Two people want to buy the same seat on a flight • Readers and writers • P1 needs a resource but it’s being held by P2 • Two threads updating a single counter • Bounded Buffer • Producer/Consumer • …….

  7. Synchronization Primitives • Review: • A special shared variable used to guarantee atomic operations • Hardware support • Processor may lock down memory bus while other reads/write occur • Semaphores, monitors, conditions are examples of language-level synchronization mechanisms

  8. Needed when: • Resources need to be shared • Timing needs to be coordinated • Access data • Send messages or data • Potential race conditions – timing • Difficult to predict • Results in inconsistent, corrupt or destroyed info • Tricky to find; difficult to recreate • Activities need to be synchronized

  9. Producer while count == MAX NOP Put in buffer counter++ Consumer while count == 0 NOP Remove from buffer counter-- Producer/Consumer

  10. Race Conditions • … can result in an incorrect solution • An issue with any shared resource (including devices) • Printer • Writers to a disk

  11. Critical Section • Also called the critical region • Segment of code (or device) for which a process must have exclusive use

  12. Examples of Critical Sections • Updating/reading a shared counter • Controlling access to a device or other resource • Two users want write access to a file

  13. Rules for solutions • Must enforce mutex • Must not postpone process if not warranted (exclude from CR if no other process in CR) • Bounded Waiting (to enter the CR) • No execution time guarantees

  14. Atomic Operation • Operation is guaranteed to process without interruption • How do we enforce atomic operations?

  15. Semaphores • Dijkstra, circa 1965 • Two standard operations: wait() and signal() • Older books may still use P() and V(), respectively (or Up() and Down()). You should be familiar with any notation

  16. Semaphores • A semaphore is comprised of an integer counter and a waiting list of blocked processes • Initialize the counter (depends on application) • wait() decrements the counter and determines if the process must block • signal() increments the counter and determines if a blocked process can unblock

  17. Semaphores • wait() and signal() are atomic operations • What is the other advantage of a semaphore over the previous solutions?

  18. Binary Semaphore • Initialized to one • Allows only one process access at a time

  19. Semaphores • wait() and signal() are usually system calls. Within the kernel, interrupts are disabled to make the counter operations atomic.

  20. Process 0: wait (s); // 1st wait (q); // 3rd ……. signal (s); signal (q); Assume both semaphores initialized to 1 Process 1: wait (q); // 2nd wait (s); // 4th ……. signal (q); signal (s); Problems with Semaphores

  21. Other problems • Incorrect order • Forgetting to signal() • Incorrect initial value

  22. Monitors • Encapsulates the synchronization with the code • Only one process may be active in the monitor at a time • Waiting processes are blocked (no busy waiting)

  23. Monitors • Condition variables control access to the monitor • Two operations: wait() and signal() (easy to confuse with semaphores, so be careful!) • enter() and leave() or other named functions may be used

  24. Monitors if (some condition) call wait() on the monitor <<mutex>> call signal() on the monitor

  25. States in the Monitor • Active (running) • Waiting (blocked, waiting on a condition)

  26. Examples

  27. Signals in the Monitor • When an ACTIVE process issues a signal(), it must allow a blocked process to become active • This would allow 2 ACTIVE processes and can’t allow this in a CR. • So – the first process that wants to execute the signal() must be active in order to issue the signal(); the signal() will make a waiting process become active.

  28. Signals • Two solutions: • Delay the signal • Delay the waiting process from becoming active

  29. Gladiator monitor (Cavers & Brown, 1978) • Delay the signaled process, signaling process continues • Create a new state (URGENT) to hold the process that has just been signaled. This signals the process but delays execution of the process just signaled. • When the signal-er leaves the monitor (or wait()s again), the process in URGENT is allowed to run.

  30. Mediator (Cavers & Brown adapted from Hoare, 1974) • Delay the signaling process • When the process signal()s, it is blocked so the signaled process becomes active right away. • This monitor may be more difficult to get correct interaction. Be warned, especially if you have loops in your CR.

  31. Tips for Using Monitors • Remember that excess signal() instructions don’t matter so don’t test for them or try to count them. • Don’t intermix with semaphores. • Be sure everything shared is declared inside the monitor • Carefully think about the process ordering (which monitor you wish to use)

  32. Deadlocks T3 T4 Lock-X(B) Read(B) B=B-50 Write(B) Lock-S(A) Read(A) Lock-S(B) Lock-X(A) Deadlock occurs whenever a transaction T1 holds a lock on an item A and is requesting a lock on an item B and a transaction T2 holds a lock on item B and is requesting a lock on item A. Are T3 and T4 deadlocked here?

  33. Deadlock: T1 is waiting for T2 to release lock on X T2 is waiting for T1 to release lock on Y Deadlock: graph cycle

  34. Two strategies: Pessimistic: deadlock will happen and therefore should use “preventive” measures: Deadlock prevention Optimistic: deadlock will rarely occur and therefore wait until it happens and then try to fix it. Therefore, need to have a mechanism to “detect” a deadlock: Deadlock detection.

  35. Deadlock Prevention • Locks: • Lock all items before transaction begins execution • Either all are locked in one step or none are locked • Disadvantages: • Hard to predict what data items need to be locked • Data-item utilization may be very low

  36. Detection • Circular Wait • Graph the resources. If a cycle, you are deadlocked • No (or reduced) throughput (because the deadlock may not involve all users)

  37. Deadlock Recovery • Pick a victim and rollback • Select a transaction, rollback, and restart • What criteria would you use to determine a victim?

  38. Synchronization is Tricky • Forgetting to signal or release a semaphore • Blocking while holding a lock • Synchronizing on the wrong synchronization mechanism • Deadlock • Must use locks consistently, and minimize amount of shared resources

  39. Java • Synchronization keyword • wait() and notify() notifyAll() • Code examples

  40. Java Threads • P1 is in the monitor (synchronized block of code) • P2 wants to enter the monitor • P2 must wait until P1 exits • While P2 is waiting, think of it as “waiting at the gate” • When P1 finishes, monitor allows one process waiting at the gate to become active. • Leaving the gate is not initiated by P2 – it is a side effect of P1 leaving the monitor

  41. Big Data

  42. What does “Big Data” mean? • Most everyone thinks “volume” • Laney [3] expanded to include velocity and variety

  43. Defining “Big Data” • It’s more than just big – meaning a lot of data • Can be viewed as 3 issues • Volume • Size • Velocity • How quickly it arrives vs consumed or response time • Variety • Diverse sources, formats, quality, structures

  44. Specific Problems withBig Data • I/O Bottlenecks • The cost of failure • Resource limitations

  45. I/O Bottlenecks • Moore’s Law: Gordon Moore, the co-founder of Intel • Stated that processor ability roughly doubles every 2 years (often quoted at 18 months) • Regardless … • The issue is that I/O, network, and memory speeds have not kept up with processor speeds • This creates a huge bottleneck

  46. Other Issues • What are the restart operations if a thread/processor fails? • If dealing with “Big Data”, parallelized solutions may not be sufficient because of the high cost of failure • Distributed systems involve network communication that brings an entirely different and complex set of problems

  47. Cost of Failure • The failure of many jobs is a problem • Can’t just restart because data has been modified • Need to roll-back and restart • May require human intervention • Resource costly (time, lost processor cycles, delayed results) • This is especially problematic if a process has been running a very long time

  48. Using a DBMS for Big Data • Due to the volume of data: • May overwhelm a traditional DBMS system • The data may lack structure to easily integrate into a DBMS system • The time or cost to clean/prepare the data for use in a traditional DBMS may be prohibitive • Time may be critical. Need to look at today’s online transactions to know how to run business tomorrow

  49. Memory & NetworkResources • Might be too much data to use existing storage or software mechanisms • Too much data for memory • Files too large to realistically distribute over a network • Because of the volume, need new approaches

  50. Would this work? • Reduce the data • Dimensionality reduction • Sampling

More Related