110 likes | 205 Views
Notes on Final Project of MIR Course. Part I: Crawling Phase. Crawling Phase. Crawling the Dmoz directory It has as taxonomic structure (Tree-like) Each subdirectory by a group. Crawling Phase. This tree-like structure has two important components: Internal Nodes (also known as “topics”)
E N D
Notes on Final Project of MIR Course Part I: Crawling Phase Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Crawling the Dmoz directory • It has as taxonomic structure (Tree-like) • Each subdirectory by a group Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • This tree-like structure has two important components: • Internal Nodes (also known as “topics”) • Leaves (also known as “pages”) Topics Pages Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Then each topic has a: • list of children (subtopics) • unique path to root node (supertopics) • description • list of related pages • And each page has: • A topic Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase Description of Current Topic The Current Topic (Node) • Each topic has some characteristics List of super topics List of subtopics List of Related Pages (Leaves) Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Deliveries for first phase: • TopicNames.txt • Each line contains a topic number and the full name of that topic, separated by a tab character (i.e. 46 Top/Science/Agriculture ) • TopicDescs.txt • Each line contains a topic number and the description of that topic, separated by a tab character. For some topics, the description is a zero-length string. • TopicHierarchy.txt • Each line contains a pair of topic numbers (separated by a tab character). The first of these two topics is the parent of the second topic. Each topic has exactly one parent, except for the root (topic 0), which has no parent. Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Deliveries for first phase: • DocUrls.txt • Each line contains a document number and its URL, separated by a tab character • DocTitles.txt • Each line contains a document number and its title, separated by a tab character • DocTopics.txt • Each line contains a document number and a topic number, separated by a tab character. This indicates that the document belongs to the given topic. Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Deliveries for first phase: • Documents.zip • The contents of the documents seperately • A list of samples for each output file have been added to the Assignments page (for “Science” Subdirectory) Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Naming contraction: • Names in each subdirectory start with a special character: Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Then for each sub tree , generate numeric names for children in BFS search order. • i.e. in Science Subdirectory: Sample Topic Sample Page 1 L1 5 L4 L3 2 L2 4 L8 3 L5 L7 L6 Modern Information Retrival Course, Semantic web Research labratory
Crawling Phase • Assignments of subdirectories to groups: Modern Information Retrival Course, Semantic web Research labratory