Time Series Center a next generation search engine using semantics, machine learning, and GPGPU

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU Pavlos Protopapas (Harvard CfA and SEAS) Rahul Dave, Gabriel Wachman, Matthias Lee, Roni Khardon

timeseries center: short overview what is it about and what we do. database, web interface: data model/database design web services web interface (demo) analysis classification using kernels results search engine morphological searches GPGPU architecture web interface (demo) overview

idea: create the largest collection of time series in the world and do interesting science discoveries. recipe: 5 tons of data 1 dozen of people with science questions 2-3 people with skills 2 tons of hardware focus: astronomy (light curves = time series). We have other data too such as labor data, real estate data, heart monitor data, archeological data, brain activityetc. time series center

Cyclist’sheartrate tcppackets Cadence design stock prices FishcatchoftheNorth-eastpacific time series everywhere

0 200 400 600 800 1000 1200 Supporting Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures VLDB 2006: 882-893. Eamonn Keogh · Li Wei · Xiaopeng Xi · Michail Vlachos · Sang-Hee Lee · Pavlos Protopapas

the fish

data MACHO(Microlensing survey)- 66 million objects. 1000 flux observations per object in 2 bands (wavelengths) TAOS (outer solar system survey) - 100000 objects. 100K flux observations per object. 4 telescopes. ESSENCE (supernovae survey). thousandsobjects, hundred observations. Minor Planet Center Light curves - few hundred objects. Fewhundred observations Pan-STARRS.billion of objects. (Video Mode data too) OGLE (microlensing and extra solar planet surveys) MMT variability studies, some HAT-NET SDSS 82 EROS SuperMACHO(another microlensing survey)- Close to a million objects. 100 flux observations per objects. DASCH Digital Access to a Sky Century @ Harvard

astronomy eclipsing binaries: determining masses and distances to objects extra-solar planets: either discovery of extra solar planet or statistical estimates of the abundance of planetary systems cosmology: supernovae from Pan-STARRS will help determine cosmological constants. asteroids, Trans Neptunian objects etc: using occultation signals –for the detection of the smaller objects in the edge of the Solar System. AGN: automatic classification of AGN from the time series. variable stars: study of variable stars microlensing: determine dark matter question. and many more

outlier/anomaly detection • find the anomalous cases. • clustering • unsupervised clustering could help identify new subclasses. • classification • automatic classification either supervised or semi supervised. • motif detection • finding patterns especially low signal to noise ratio. scalability analyzing a large data set requires efficient algorithms that scale linearly in the number of time series. the feature space representation of the time series (Fourier Transform, Wavelets, Piecewise Linear, and symbolic methods) distance metric determine similarities in time series.

size: • size of data in astronomy, medicine and other fields are presently exploding. he time series center needs to be prepared for data rates starting in the 10’s of gigabytes per night, scaling up to terabytes soon. • PARALLEL FILE SYSTEMS [gpfs, luster, NFS] do not perform well • interplay: • between the algorithms used to study the time series, and the appropriate database indexing of the time series itself must be optimized. • seed: • read time access and real time processing • distributed computing and disbursement of data: • standards (VO), subscription query etc. computational challenges

disk: ~100 TB of disk GPFS, LUSTER, NFS computing nodes: part of Odyssey cluster at Harvard (~5362 cores). db server: dual core with 16 GB of memory and 2 TB of disk. few servers for development exotic: GPGPU dedicate machineNvidiaGTX285(2GB 240 cores corespeed:1400MHz) GPU cluster (Nikola) with 16 machines with Nvidia Tesla T10 GPU's attached to each node web server 2 dual machines. [who cares] hardware

source: usno_id=xxx tsc_id=yyy astobject: A.23.1234 A.123.5632 B.71.12 Survey A, Field 23 Survey A, Field 132 Survey B, Field 71 data model

data model

source: usno_id=xxx tsc_id=yyy astobject: SURVEY: MACHO FIELD: 012 STAR_ID: 87 astobject: SUVEY: OGLE FIELD: 12 STAR_ID: 0894343 LCOT BAND: R LOCT BAND: B SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 data model

who entity: Wachman OGLE source: usno_id=xxx tsc_id=yyy using astobject: SURVEY: MACHO FIELD: 012 STAR_ID: 87 algorithm/method: SVN/kNN CMD what is it LCOT BAND: R about Variable judjemet: Eclipsing Binary Cepheid Quasar SNIPPET: STAR_TIME: 01/13/98 END_TIME: 05/1/98 variability data model

Pulsating Eruptive Rotating IntenseVariableXraySources … RRlyrae Cepheid DCEPS CWB BCEP … variable tree

why web services ? grandma can program ascii and json web site is a web services client you can program with wget, curl, perl, python etc next ? VAO compatibility ie XML we have python language API provided, inside cluster and outside usage web services

★everything is web services ★ every query gets a UID. ★ with the UID: summary, page info, results useto build extra queries GET: http://timemachine.iic.harvard.edu/search/lcdb/astobject/filter=survey__exact:ASAS3/ POST: /astobject/justquery/ searchtext: survey = ASAS3;variable = EBI queryname: "Myquery" prevquerycontext: "68b994ca-08a3-4d79-affd-8cc1636aa6c2" privacy: false web services

demo timemachine

classification

we have millions of light curves triangular inequality tree structures Euclidean distance between light curves Q and Cz-normalized what is similar?

DEVICE GeForce GTX 285 30 multiprocessors 8 cores per multiprocessor 240 cores Each core has 2 ports = 3 instructions 1.4 GHz • ~0.9 TFLOPS 512 threads / multiprocessor • 15360 threads • Organized in: • 1 Block per multiprocessor (3D array) • GRID of Blocks 8 KBshared memory 8 KBshared memory multiprocessor 1 8 cores multiprocessor 2 8 cores …30 texture memory constant memory 2 GBytes DDR2 global memory HOST (CPU) GPGPU

compile: C/C++ go to gcc cuda go to nvcc // Kernel definition __global__void VecAdd(float* A, float* B, float* C) { ... } intmain() { ... cudaMalloc((void**)&d_A, size); // Kernel invocation VecAdd<<<1, N>>>(A, B, C); } CUDA

CUDA RESULT

Results

demo fshare

done: build the infrastructure to host data, search, distribute methodology computational new discoveries doing: moredata (EROS, SDSS 82) more hardware (mainly disk) ~50TB renormalized db, slice indexing. data model/database design web services web interface (demo) analysis classification using kernels results search engine morphological searches GPGPU architecture web interface (demo) summary

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU

Time Series Center a next generation search engine using semantics, machine learning, and GPGPU

Presentation Transcript

Using a Google Custom Search Engine (CSE)

Keyword Generation for Search Engine Advertising

Knowledge-Based Discovery: Using Semantics in Machine Learning

Using a Google Custom Search Engine (CSE)

The Next Generation of Next Generation Learning

Machine Learning Based Models for Time Series Prediction

Next Generation of Search

Next Generation Natural Gas Engine

Next Generation Learning

Next-Generation Teaching and Learning

Next generation search

Next Generation Learning

Next Generation Learning Platform

Next Generation Learning Network

xSEED a next generation time series format

Next generation search

Benefits of Using a Corporate Search Engine