210 likes | 226 Views
Learn about cloud computing, Nutch web search software, and Google App Engine development platform. Explore features, benefits, and how to get started with these technologies.
E N D
HOMEPAGE & SEARCH ENGINE 2008.12.08
Contents • 2. About Cloud computing • 3. ApplicationIntroduction - Nutch - Google App Engine • 4. Presentation
2. What is Cloud computing? • Cloud computing is Internet-based ("cloud") development and use of computer technology ("computing"). • Cloud computing is a general concept that incorporates software as a service (SaaS), Web 2.0 and other recent, well-known technology trends, in which the common theme is reliance on the Internet for satisfying the computing needs of the users.
3-1.What is ‘Nutch’? • open source web-search software based Lucene • 원래는 Apache Lucene project 의 sub-project • Lucene을 좀더 사용하기 편하게 하기 위한 목적 • Lucene Java : • Apache의 매우 유명한 open source search engine
3-1. What is Nutch? • Transparency. • Nutch is open source, so anyone can see how the ranking algorithms work. • Understanding. • Nutch has been built using ideas from academia and industry • for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model • Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
3-1. What is Nutch? • Extensibility. • Nutch is very flexible • it can be customized and incorporated into your application. • For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism.
3-1. What is Nutch? • Nutch divides naturally into two pieces: • the crawler • the searcher • Crawl • 페이지를 수집 • 페이지에 대한 index를 만든다 • index는 Crawl과 Search간의 가교 역할을 한다 • Search • 유저의 요청에 따라 필요한 정보를 찾아서 보여준다
3-1. What is Nutch? • More detail about crawler • the Nutch crawler system produces three key data structures: • The WebDB containing the web graph of pages and links. • A set of segments containing the raw data retrieved from the Web by the fetchers. • The merged index created by indexing and de-duplicating parsed data from the segments.
3-1. What is Nutch? • More detail about searcher • Nutch looks for these in the index and segments subdirectories of the directory defined in the searcher.dir property. • The default value for searcher.dir is the current directory (.), which is where you started Tomcat.
3-1. What is Nutch? • crawl db로부터 url의 목록을 생성한다. • segment에서 url의 목록을 fetch한다. • segment에서 fetch한 contents를 분석(parse) 한다. • 세그먼트로부터 crawl db와 분석한 데이터를 업데이트 한다 • segments로부터 invert 링크를 분석한다. • segment 문서와 anchor 문서에 대한 색인을 생성한다. • 이 부분을 계속 반복 실행
3-1. What is Nutch? • Nutch 실행 방법 • Nutch가 설치된 directory 에서 cralwing을 시작 >> /bin/nutch crawl –dir urls crawl –depth 3 -topN 10 • Tomcat 5.5를 실행 • 주의할 점: Nutch directory에서 tomcat을 실행시켜야 함 >> /opt/apache-tomcat-5.5.27/bin/catalina.sh start • http://localhost:8080/en/
3-2 .Development environment of Nutch • Nutch 0.9 from apache-nutch homepage • JAVA JDK-6 • Tomcat 5.5 version 이상 version • OS : Linux server Edition Cygwin for Window’s developer
3-3. What is ‘Google App Engine’? • A project for Cloud Computing of Google • Google web application platform • Easy to build, easy to maintain, and easy to scale as user’s traffic and data storage needs grow • No servers to maintain, with App Engine : just upload an application, and it’s ready to serve your users.
3-3. What is ‘Google App Engine’? • Google App Engine에서 제공하는 기능 • Python이 제공하는 기본 기능 • Python으로 만들어 졌기 때문 • BigTable/GFS 기술이 뒷받침하는 견고한 Datastore • Google에서 만든 기존의 oracle, mysql과 같은 database • 확장성을 제공하는 호스팅 공간 • Free ‘Google’account • SDK를 이용한 로컬 개발 및 테스트
3-3. What is ‘Google App Engine’? • Google’s Moto : • “Web Development that doesn’t hurt” • Google App Engine을 통해 웹 서비스 개발자들은 또 다른 고통 없이 개발할 수 있는 선택권을 갖게 된다. • Load balancing, automatic scaling, dynamic web serving 등을 Google App Engine에서 제공할테니 걱정 없이 application 개발만 신경 써라 • 다만, 이 선택에는 세가지의 제약이 따른다. • 1. 모든 코드는 반드시 Python으로 작성해야 한다. • 현재, perl로 개발 중 • 2. 사용량 제한을 통해 비용 지불의 가능성이 존재한다. • 무료로 제공되는 사용량 • 500MB of persistent storage and enough CPU and bandwidth for about 5 million page views a month • 3. 모든 데이타는 구글 플랫폼에서 움직이며 구글이 갖게 된다는 점이다. 이는, 구글 플랫폼에 종속된 어플리케이션은 쉽게 구글 플랫폼을 벗어나지 못하게 할 것이다. • 3번 째 제약이 Google App Engine의 가장 치명적
3-3. What is Google App Engine? • Google App Engine 실행 방법 • Google-engine이 설치된 directory로 이동 • Google-engine 실행 명령 • dev_appserver.py bono/ : Test용 • appcfg.py update bono/ : Web에 uploading함 • ID & PWD를 매번 입력하여 uploading • 결과 화면 확인 • http://localhost:8080/ • http://flyingbono.appspot.com
3-4. The Development Environment • Google App Engine using the App Engine software development kit (SDK) • Python 2.5 • You need active Python in window environment • OS : Windows Mac OS X Linux
4. Presentation • Nutch • Google App Engine + Nutch • Another example of using Google App Engine