430 likes | 462 Views
Explore the fundamentals of machine learning in analyzing big data. Learn linear and logistic regression, clustering, and recommendations using R and Hadoop. Practical examples and application frameworks included.
E N D
2015.4.30 서울시립대학교 전전컴 인공지능연구실 김유상 6. Understanding Big Data Analysis with Machine Learning
Introduction to machine learning • 기계학습이란.. • 활용예 • 스팸메일 검출기 • 자동운전 • 음성인식 • 얼굴인식 • 온라인상 이상활동 감지 등 • 관련 application/framework • R • Python • Apache Mahout • Weka
Supervised machine-learning algorithms • Linear regression • Logistic regression
Linear regression • Regression can be formulated as follows • The slope of the regression line - 기울기 • The intercept point of regression – y절편 • 선형회귀의 활용 • 판매예측 • 제품가격 최적화 • 다양한 자료 및 행사에 기반한 다음 온라인 구입 예측
Linear regression with R • train_data
Linear regression with R and Hadoop • Calculating the Xtx value with MapReduce job1. • Calculating the Xty value with MapReduce job2. • Deriving the coefficient values with Solve (Xtx, Xty).
Logistic regression • To predict the log odds ratios, use the following formula: • The probability formula is as follows • 로지스틱회귀의 활용 • 온라인구매의 가능성 예측 • 당뇨병 여부 진단
Logistic regression with R and Hadoop • Defining the lr.map Mapper function • Defining the lr.reducer Reducer function • Defining the logistic.regressionMapReduce function
Logistic regression with R and Hadoop • foodstamp : Food-Stamp Program
Unsupervised machine learning algorithm • Clustering • Artificial neural networks • Vector quantization
Clustering • Clustering is the task of grouping a set of object in such a way that similar objects with similar characteristics are grouped in the same category. • R에 있는 클러스터링 기술 • K-means • K-medoids • Hierachical • Density-based • 클러스터링의 활용 • 시장세분화 • 사회연결망 분석 • 컴퓨터 네트워크 조직화 • 천문 데이터 분석
Performing clustering with R and Hadoop • Defining the dist.fun distance function • Defining the k-means.map k-means Mapper function • Defining the k-means.reduce k-means Reducer function • Defining the k-means.mr k-means MapReduce function • Defining input data points to be provided to the clustering algorithms
Performing clustering with R and Hadoop • kmeans.mr 실행중 에러발생 • 하둡로그확인 • apply 함수에서 dim(X)를 찾지못함 • colSums는 vector형이고 apply는 matrix형을 요구하여 생긴문제로 추정
Performing clustering with R and Hadoop • 결과(책내용)
Recommendation algorithms • User-based recommendations • 유저에 기반하여 비슷한 유저의 선호도를 바탕으로 아이템을 추천 • Item-based recommendations • 아이템에 기반하여 유저가 선호하는 아이템과 비슷한 아이템을 추천
Steps to generate recommendations in R • Computing the co-occurrence matrix. • Establishing the user-scoring matrix. • Generating recommendations.
Steps to generate recommendations in R • small.csv
Generating recommendations with R and Hadoop • Establishing the co-occurrence matrix items. • Establishing the user scoring matrix to articles. • Generating recommendations.
Performing clustering with R and Hadoop • cal.mr 실행중 에러발생 • 하둡로그확인