Big Data 개요

Big Data 개요 강명철 mc.kang08@gmail.com

Contents • 빅데이터란? • Hadoop이해하기 • Hadoop설치하기

빅데이터란?

빅데이터 정의 인터넷 성숙 데이터의 폭발적 증가 소셜 네트워크 모바일 확산

빅데이터 정의 데이터 가공 데이터 수집 새로운 의미 도출 시스템 개선

빅데이터 정의 “서버 한대로 처리할 수 없는 규모의 데이터” “기존의 SW로는 처리할 수 없는 규모의 데이터” : Scale Out “3V (Volume, Velocity, Varity) + Variability “ • 웹검색엔진 데이터 (e.g. 1조 Page * 4KB = 4PB) • 검색어 로그 및 클릭 로그 데이터 (e.g. 구글Flu Trends, 구글 번역) • Device 생성 데이터 (스마트 패킷 데이터) • 소셜 미디어 데이터

빅데이터 시스템 구성 시각화 데이터 저장 및 처리 데이터 수집 처리결과 억세스 워크플로우

빅데이터 시스템 구성 시각화 데이터 저장 및 처리 데이터 수집 처리결과 억세스 Flume Chukwa Kafka Infographics R Pentaho HDFS MapReduce RDBMS NoSQL 검색엔진 워크플로우 Cascading Oozie Azkaban Ambrose

빅데이터 성공 스토리 • Netflix 영화 추천 서비스 (기존 주 1회 -> 일 1호) • eBay 쿼리로그 마이닝(10억건 이상의 사용자 검색데이터) • Twitter 대용량 머신 러닝 (1일 3억건 이상의 트윗 분석, 팔로우형태, 정서) • 신용 카드사 Fraud Detection Model 수립

빅데이터 주요 회사 • Apache Software Foundation • Cloudera • HotornWorks • MapR • Gruter • NexR

Hadoop이해하기

Hadoop이란?

Hadoop이란? • Hadoop 참조모델 • ‘2003 The Google File System • ‘2004 MapReduce : Simplified Data Processing on Large Cluster • 더그 커팅 Nutch/Lucene • ‘2006 Appache Top 프로젝트로 승격 • 오픈소스 • 데이터가 있는 곳으로 코드 이동 • 스케일 아웃 • 단순 데이터 모델 • 오프라인 배치 프로세싱에 최적화

MapReduce Paradigm

Scalability

Applications

UseCases

EcoSystems

Hadoop구성 • Distributed file system (HDFS) • Single namespace for entire cluster • Replicates data 3x for fault-tolerance • MapReduce framework • Executes user jobs specified as “map” and “reduce” functions • Manages work distribution & fault-tolerance

Hadoop구성 태스크 트래커 데이터 노드 잡트래커 태스크 트래커 네임노드 데이터 노드 태스크 트래커 데이터 노드

Hadoop Distributed File System HDFS Server Name node HDFS Client Application Local file system Block size: 2K Data Nodes Block size: 128M Replicated • 하부운영체제의 파일 시스템 그대로 사용 • Fault Tolerance • Write Once Read Many

Hadoop Distributed File System HDFS Server Master node blockmap HDFS Client heartbeat Application Local file system Block size: 2K Name Nodes Block size: 128M Replicated

MapReduce

MapReduce • 데이터가있는 서버로 코드 전송 • Key/Value 데이터 셋 • Shared Nothing 아키텍처 • 오프라인 배치 처리에 적합

MapReduce • 병렬도가 매우 높은 단순 작업 • 로그분석 • 머신러닝(Clustering, Classification 등) • 리얼타임 데이터 스트림 처리 • 반복 실행이 많이 필요한 작업들 • MapReduce구현 시 네트워크 데이터 전송량이 너무 큰 경우

Hadoop설치하기

Hadoop설치 실습 환경: CentOS 6.3 • 서버환경 확인 • 인코딩 방식 확인 • 확인 : echo $LANG • 결과 : ko_KR.utf8 이어야 함 • 인코딩 방식 변경 • 편집 : vi /etc/sysconfig/i18n • 내용 : LANG=“ko_KR.utf8” • 인코딩 방식 반영 • source /etc/sysconfig/i18n

Hadoop설치 자바 설치 : JDK 1.6 이상 권장 Hadoop : 1.0.4 • Yum install java-1.6.0-openjdk • Download Hadoop • Wgethttp://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/hadoop-1.0.4.tar.gz • Tar xvfz hadoop-1.0.3.tar.gz • Vi ~/.bashrc • export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk.x86_64 • export PATH=$PATH:$JAVA_HOME/bin • export HADOOP_HOME=/home/mckang/hadoop-1.0.3 • Source ~/.bashrc

Hadoop설치 Standalone Mode 확인 • 초기 설정 변경 없음 : core-site.xml, mapred-site.xml, hdfs-site.xml • WordCount예제 확인 • Cd $HADOOP_HOME • ./bin/hadoop jar hadoop-examples*.jar wordcount README.txt ~/wc-output

Hadoop설치 SSH 설정 • 키페어 생성 • Ssh-keygen –t rsa • 공개키 배포 • Scp ~/.ssh/id_rsa.pub 사용자@데이터노드:/home/사용자 • 인증키 등록 (각 데이터 노드에서) • Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys (655권한 확인) • 확인 • Ssh데이터노드

Hadoop설치 Pseudo Distributed Mode 확인 • 네임노드, 2차 네임노드, 데이터노드, 잡트래커, 태스크 트래커 • 초기 설정 변경 • Core-site.xml • <configuration> • <property> • <name>fs.default.name</name> • <value>hdfs://localhost:9000</value> • </property> • <property> • <name>hadoop.tmp.dir</name> • <value>/home/mckang/hadoop-data</value> • </property> • </configuration>

Hadoop설치 Pseudo Distributed Mode 확인 • 네임노드, 2차 네임노드, 데이터노드, 잡트래커, 태스크 트래커 • 초기 설정 변경 • hdfs-site.xml • <configuration> • <property> • <name>dfs.replication</name> • <value>1</value> • </property> • </configuration>

Hadoop설치 Pseudo Distributed Mode 확인 • 네임노드, 2차 네임노드, 데이터노드, 잡트래커, 태스크 트래커 • 초기 설정 변경 • mapred-site.xml • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>localhost:9001</value> • </property> • </configuration>

Hadoop설치 Pseudo Distributed Mode 확인 • 2차 네임노드 확인 • Cat masters • 데이터노드 확인 • Cat slaves • HDFS 포맷 • Hadoopnamenode -format • 실행 • Start-all.sh • jps

Hadoop설치 Pseudo Distributed Mode 확인 • 잡트래커: http://localhost:50030 • 네임노드: http://localhost:50070

Hadoop설치 Pseudo Distributed Mode 확인 • 파일업로드 • Hadoopfs -put README.txt /README.txt • WordCount예제 수행 • Hadoop jar hadoop-examples-*.jar wordcount /README.txt /wc_output • 결과 확인 • Hadoopfs -cat /wc_output/p*

HDFS Command

HDFS Hadoopfs -{command} • ls디렉토리 출력 • lsr하위디렉토리 포함 출력 • du 파일 사용량 출력 • dus전체합계 파일 사용량 출력 • cat 파일 내용 보기 • text 압축파일 텍스트 보기 • mkdir, cp, mv, rm, rmr • put / copyFromLocal , get / copyToLocal, getmerge

HDFS 개발환경셋업 • 개발도구 다운로드 • www.springsource.org/downloads/sts-ggts • 메이븐프로젝트 생성 • <dependency> • <groupId>org.apache.hadoop</groupId> • <artifactId>hadoop-core</artifactId> • <version>1.0.4</version> • </dependency>

HDFS HDFS API 테스트 // 파일 저장 FSDataOutputStreamoutStream = hdfs.create(path); outStream.writeUTF(args[1]); outStream.close(); // 파일 출력 FSDataInputStreaminputStream = hdfs.open(path); String inputString = inputStream.readUTF(); inputStream.close(); System.out.println("## inputString:" + inputString); } catch (Exception e) { e.printStackTrace(); } } } import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class SingleFileWriteRead { public static void main(String[] args) { // 입력 파라미터 확인 if (args.length != 2) { System.err .println("Usage: SingleFileWriteRead <filename> <contents>"); System.exit(2); } try { // 파일 시스템 제어 객체 생성 Configuration conf = new Configuration(); FileSystemhdfs = FileSystem.get(conf); // 경로 체크 Path path = new Path(args[0]); if (hdfs.exists(path)) { hdfs.delete(path, true); }

MapReduce

Map Operation Map Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Map … …… Data Collection: split n MAP: Input data  <key, value> pair

Reduce Operation Map Reduce Data Collection: split1 Split the data to Supply multiple processors Map Reduce Data Collection: split 2 Map … …… Data Collection: split n Reduce MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result>

Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000 , count1 Parse-hash Count P-0001 , count2 Parse-hash Count P-0002 Parse-hash ,count3

MapReduce Example in my operating systems class Cat Bat Dog Other Words (size: TByte) reduce combine map part0 split reduce combine map part1 split reduce combine map split part2 map split

MapReduce Programming Model

MapReduce programming model Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set). Design and implement solution as Mapper classes and Reducer class. Compile the source code with hadoop core. Package the code as jar executable. Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams Load the data (or use it on previously available data) Launch the job and monitor. Check the result.

MapReduce Characteristics • Very large scale data: peta, exa bytes • Write once and read many data: allows for parallelism without mutexes • Map and Reduce are the main operations: simple code • There are other supporting operations such as combine and partition (out of the scope of this talk). • All the map should be completed before reduce operation starts. • Map and reduce operations are typically performed by the same physical processor. • Number of map tasks and reduce tasks are configurable. • Operations are provisioned near the data. • Commodity hardware and storage. • Runtime takes care of splitting and moving data for operations. • Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime.

Big Data 개요

Big Data 개요

Presentation Transcript

Histograms & Summary Data

Chapter 4

Data sources and data structure: Panel data

Web of Data

Addressing the Limited Data Dilemma Non-Traditional Sources of Safety Data

Aplikasi Data Mining

Data Demand & Use (DDU)

Data Communication Essentials

Data

Exporting Finance Data to Excel

Chapter 3 Data Representation

Data ! Data! Data!

Data Mining: Data Preprocessing

Chapter 5: The Data Link Layer

CS490D: Introduction to Data Mining Chris Clifton

Software Engineering Data flow diagrams

Введение в Data Mining

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Need for Complex Data Types

Data

Chapter 5

C H A P T E R

Big Data 개요

Big Data 개요

Presentation Transcript

Histograms &amp; Summary Data

Chapter 4

Data sources and data structure: Panel data

Web of Data

Addressing the Limited Data Dilemma Non-Traditional Sources of Safety Data

Aplikasi Data Mining

Data Demand &amp; Use (DDU)

Data Communication Essentials

Data

Exporting Finance Data to Excel

Chapter 3 Data Representation

Data ! Data! Data!

Data Mining: Data Preprocessing

Chapter 5: The Data Link Layer

CS490D: Introduction to Data Mining Chris Clifton

Software Engineering Data flow diagrams

Введение в Data Mining

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Need for Complex Data Types

Data

Chapter 5

C H A P T E R

Histograms & Summary Data

Data Demand & Use (DDU)