110 likes | 228 Views
Project 1 : Who is Popular, and Who is Not. Angel Trifonov Anh Pham Xiao Qin. Tasks. Task b, c both in Pig and Java Task h in Java. Task b in Java. Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job
E N D
Project 1 : Who is Popular, and Who is Not. Angel Trifonov AnhPham Xiao Qin
Tasks Task b, c both in Pig and Java Task h in Java
Task b in Java Write a job(s) that reports for each country, how many of its citizens have a Facebook page. • Single map-reduce job • Input: MyPage datasets • Mapper: examine each file line-by-line • Each line converted to a string • String is split using “,” delimiter • Extract nationality and map to an IntWriteable • Reducer: take all pairs and sum values for each key • Output: number of users per nationality • Single reducer
Task b in Pig • Group Mypage dataset based on Country code: • countrygrp= group mypage by cc; • Report number of people that have Facebook page for each country: • taskb= foreachcountrygrp generate group, COUNT(mypage.id); • dump taskb; Running Time Comparison: Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)
Task c in Java Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages. • HadoopSettings: multiple mappers and one reducer. (setNumReduceTasks(1)) • Input: AccessLog • 1st round: • Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value. • Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value • 2nd round: • Swap the key and value (InverseMapper.class) • Output: [Count] , [WhatPage] (in descending order )
Task c in Pig • Group the Accesslog dataset based on accessed facebook ID: • access_fid_grp= group alog by fid; • Get the access count for each accessed facebook ID: • grpcnt = foreachaccess_fid_grpgenerate group,COUNT(alog.aid) as alogcnt; • Order the count descending: • grporder = order grpcnt by alogcntdesc; • List top 10: • taskc = limit grporder 10; • dump taskc; Running Time Comparison: Plain MapReduce: 2 min 1 sec(Job time) Pig: 1 min 52 sec (Job time)
Task h :Define Potential Stalkers A person who visits another person’s Facebook page too much. But they are not friend.
Mapper - Output key: 2nd field (Person ID): IntWritable 1st Field, PersonID, 3rd Field… - Output value: “<dataset tag>, <ID>”: Text Friends: personIDf, friendID Accesslog: personIDa, visitedID
Reducer Key:<personID> Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …> • Sort the list based on the second field of each element. • All visitedID and friendIDhave the same value will be place next to each other • If all ID are visitedID, and it appears too many times (based on a predefined threshold) => Potential stalker. • Output: personIDvisitedID
Thank you! Questions?