160 likes | 266 Views
Projects (2012-2013). Ida Mele. Rules. Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published on my web site.
E N D
Projects(2012-2013) Ida Mele
Rules • Students have to work in teams (max 2 people). • The project has to be delivered by the deadline that will be published on my web site. • Usually the project deadline is the same day of the written exam. Students, who pass the exam during the first session, can deliver the projects by the second session. • The project score is from 0 to 10. • The professor decides the final mark, considering also the score of the written exam. • A project can be assigned to max 2 groups. Projects (2012-2013)
Project Request • Students have to send me an email with object: WebIR - project request and the following information: • Name and last name of each student in the group. • Title of the project. • Short description of what the students intend to do (up to 250 words). Important: all the members of the group should be cc-ed in the email. • If everything is OK, you will receive a confirmation email. • There is no deadline for the request of the project. Projects (2012-2013)
Project Delivery • The presentation of the project takes 15-minutes. The presentation should contain the description of the problem, the design decisions, the most important issue related to the implementation, and the results achieved. Students use slides for their presentations and if they want they can realize a demo as well. • Students have to deliver the source code and the slides. More instructions about the project delivery will be published on my web site. Projects (2012-2013)
Project list • Analyze the link structure of the web graph of Sapienza University. • Analyze the link structure of Twitter social network. • Find communities in Facebook. • Find communities in IMDB. • Find communities in DBLP. • Hadoop implementation of PageRank. • Hadoop implementation of HITS. • Realize a reverse web graph with Hadoop. • Realize an inverted index with Hadoop. • Personalized ranking of news. • Enrich News using Tweets. • Enrich News using Wikipedia. Projects (2012-2013)
Projects 1) Analyze the link structure of the web graph of Sapienza University. • Crawl the portion of the Web related to the domain uniroma1.it, create the corresponding web graph. Analyze its link structure, and identify the authoritative web sites. • Tip: the students can use node features such as: degree, in-degree, out-degree, PageRank, etc. They can plot the distribution of the aforementioned measures. The students can enrich their analysis by studying the edge reciprocity, and the graph assortativity. Projects (2012-2013)
Projects 2) Analyze the link structure of Twitter social network. • Use Twitter API and create the who-follow-whom network. Analyze the distribution of followers, following, and identify most popular users. Study the edge reciprocity, and determine if the network is assortative. • Tip: the students can use PageRank and/or other node features to identify the most popular users. • Tip: the network is assortative when nodes tend to be connected with similar nodes, for example nodes with high degree have edges to nodes with high degree. Projects (2012-2013)
Projects 3) Find communities in Facebook. • Use Facebook API to download data of your friends and of friends of friends. Create the corresponding friendship graph and find communities of users. Check if communities correspond to groups of users who live in the same city, work for the same organization, or attend the same school, university, etc. • Tip: the students can identify clusters of users by using a graph-partitioning tool. Projects (2012-2013)
Projects 4 and 5) Find communities in a network of collaborations. Project n.4: use IMDB: http://www.imdb.com/interfaces Project n.5: use DBLP: http://dblp.uni-trier.de/xml/ • Create a graph where nodes are people and a link between two people represents the fact that they have worked together. Use this graph to find communities of people. People come from the same country, they are famous (for project n.4), they belong to the same university (for project n.5). • Tip: the information about the number of collaborations is important, students can use weighted edges to represent it. • Tip: the students can use a tool for graph partitioning in order to find out clusters of users. Projects (2012-2013)
Projects 6 and 7)Hadoop implementation of a ranking algorithm. Project n.6: implementation of PageRank. Project n.7: implementation of HITS. • Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement a ranking algorithm to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores. Projects (2012-2013)
Projects 8) Realize a reverse web graph with Hadoop. • Given a web graph, the algorithm creates the graph with reversed edges. For example if the input graph has the edge (u,v), the output graph will have the edge (v,u). Represent the input and output graphs (or portions of them) using a graph tool. • Tip: for each link the map creates <target, source> pairs. The reducer create the concatenation of the sources, and emits <target, list of sources> pairs. Projects (2012-2013)
Projects 9) Realize an inverted index with Hadoop. • Given a large collection of documents, the algorithm creates the inverted index, where the dictionary contains the indexed terms, and for each term is stored the list of postings. • Tip (for the dictionary): the students can decide to use stemming or to remove stop-words. • Tip (for the postings): the students can realize an inverted index where each posting has the ID of the document containing the term and other information, such as the frequency of the term in the document and the position of the occurrences of the term in the document. Projects (2012-2013)
Projects 10) Personalized ranking of news. • Create a system which re-ranks news articles according to the user interests. Users can specify their interests by selecting them from a list of keywords (ex. gossip, sport, politics, …). The system uses an algorithm that ranks the news articles according to the user preferences. • Tip: the students can use different sources for collecting the news articles. Projects (2012-2013)
Projects 11) Enrich News using Tweets. • Enrich a news site with the information published by the users of Twitter. Given a news article, the system can gather all the user tweets about that and show the news article along with the tweets. • Tip: students can use news about concerts of famous singers, or about strikes, riots… • Tip: students can decide to use a timeline of tweets on the top of the page, or to rank them and show the top-n tweets on the left of the page. Projects (2012-2013)
Projects 12) Enrich News using Wikipedia. • Enrich the facts reported in news pages with information extracted from Wikipedia. Given a news article identify the name of people mentioned in the article and for each of them report the wikipedia information about their life. • Tip: the students can use Stanford Name Entity Recognizer (http://nlp.stanford.edu/software/CRF-NER.shtml) for the entity-extraction task. It allows to easily find the name of famous people. • Tip: the students can use the whole wikipedia page or paragraphs extracted from it. Projects (2012-2013)
Other important information • Graph datasets: for those students who want work on graphs, but they cannot crawl a portion of the Web, they can find some large graphs here: http://law.di.unimi.it/datasets.php. • News datasets: for those students who want to work on news articles, but they cannot collect the pages from the Web, send me an email. • Some famous graph tools: • Gephi (https://gephi.org/), • METIS (http://glaros.dtc.umn.edu/gkhome/views/metis) for graph-partitioning. • For questions send me an email, I will reply ASAP. Projects (2012-2013)