1 / 55

The Enterprise Technology Driven, Marketing Services Company

w ww.linkedin.com/in/aasifbagdadi/. The Enterprise Technology Driven, Marketing Services Company . Aasif Bagdadi Director Of Engineering https://www.linkedin.com/in/aasifbagdadi. Unique Data Asset. Automotive transactional data on 61% of the US Households. Customers. Households.

paloma
Download Presentation

The Enterprise Technology Driven, Marketing Services Company

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. www.linkedin.com/in/aasifbagdadi/ The Enterprise Technology Driven, Marketing Services Company Aasif Bagdadi Director Of Engineering https://www.linkedin.com/in/aasifbagdadi

  2. Unique Data Asset Automotive transactional data on 61% of the US Households Customers Households Vehicles

  3. List Manager - Entity Relationship • Customer/Household: Name, Address, Distance, Email, Phone, Wireless, NCOA, Compliance, Federal DNC, EBR • Vehicle Profile : VIN, Make, Model, Year, Sale Date, Sale Amount, Last observed Mileage, Lease, Loan, Warranty, Extended Warranty, Pre Paid Maintenance, AMPD • Deal: Owner, Purchase Date, Purchase Amount, Sales Person, Lease, Loan, Warranty, Odometer • Service: Service Date, Mileage, Service Advisor, Warranty Pay, Internal Pay, Customer Pay, Parts, Labor, Services Performed, Services Declined, Discounts. • Campaigns: Customer, Date, Communication, Channel, Offers • Responders: Response Date, Transactions, Days to Respond • Forecast Communications: Date, communications • Ownership: Store, Store Group, OEM, Data boundary • Uploaded List: Conquest List or any List acquired from external sources

  4. Purchase List Manager Complex search Vehicle Customer / Household Campaigns Service Future Communications

  5. List Manager Adhoc Search • Find Customers that are within 50 miles • Find Customers that have bought { Make } in last { Y } Year • That have Serviced their vehicles between { M1 } & { M2 } months in the past • That had the following service performed {Opcode1} or {Opcode2} performed • That had the following service declined {ASR1} or {ASR2} • That had been mailed between {D1} and {D2} dates • And have not yet Responded.

  6. List Manager Advanced Search

  7. List Manager V 1.0 Using SQL • 2005 – 2006 time frame • SQL Server based • Dynamic Sql • Implementation: • Table Valued function for each entities • Batch processes • Request are queued. • Job will apply all the search criteria. • Pros: • Simple to use • Easy to build • Data is available to search almost real time • Cons: • Slow. Took hours just to get a count. • Did not provide results in real time • No caching

  8. List Manager v2.0 Using Cubes • 2010 Timeframe • Use SSAS ( SQL Server 2008 R2) • Apply the search criteria & Get the counts extremely fast • List can be batched • Implementation: • SSAS Cubes (MOLAP). • Dynamic MDX • Use MDX to query the count • Use MDX to get the Keys • Use Dimensions / attributes to filter • Mash with Sql on Keys to get the List details (name, address etc) • Pros: • Extremely Fast • Sub second response on counts. • MDX queries are cached • Cons: • Complex MDX • Cube refresh / Partition reprocessing • Dimension Size constraint of 4GB size ( sql 2012 has options to overcome these limits) • Cube changes require the entire cube to be offline • Weak Scale out options

  9. List Manager 3.0 • 2014 ( currently in development) • Uses Elastic Search in the cloud • Layer of API written in Node.js • Front end (C#, MVC, jQuery, JSON) • Change Data ( CDC, selected columns and tables, multi databases) • Data Pump (C#, multi threaded, windows service, compressed json, bulk api) • Pros: • Good Scale –out • High Availability • Optimized for Search • High Caching (filters) • Read-only Replica • Document based • Allows more better control of incremental data changes • Solves Volume, Velocity & Variety of Data (a.k.aBigData ) • Cons • Technology is still emerging

  10. using Elastic Search Big Data

  11. Define Big Data Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing ormassive for conventional technologies, skills and infrastructure to address efficiently. Said differently, the volume, velocity or variety of data is too great.

  12. Elasticsearch • real time • Search & Analytics Engine • Distributed • Scales massively • High availability • Restful api • Json over HTTP • Schema free • Multi tenancy • Open source • Lucene based

  13. API • curl -XGETlocalhost:9200/?pretty • Verb ( GET, PUT …) • Node • Port • Path { "name" : "Exploding Man", "tagline" : "You Know, for Search", "ok" : true, "status" : 200, "version" : { "number" : "0.90.7", "snapshot_build" : false } }

  14. Input Data PUT /index/type/id PUT /myapp/tweet/1 -d ' { "tweet": "I think #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5 } } ' { "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "ok": true }

  15. Retrieve Data • GET /myapp/tweet/1 { "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 1, "exists": true, "_source": { ...OUR TWEET... } }

  16. Update Data • PUT /myapp/tweet/1 -d ' { "tweet": "I know #elasticsearch is AWESOME", "nick": "@clintongormley", "name": "Clinton Gormley", "date": "2013-06-03", "rt": 5, "loc": { "lat": 13.4, "lon": 52.5 } } ' { "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 2, "ok": true } # atomic delete and put

  17. Delete Data • DELETE /myapp/tweet/1 { "_index": "myapp", "_type": "tweet", "_id": "1", "_version": 3, "ok": true, "found": true }

  18. RDBMS lingo MySQL/Oracle/Sql Server => Databases => Tables => Columns/Rows Elastic Search => Indices => Types => Documents with Properties • An Elastic Search cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

  19. Glossary • Node: A node is a running instance of elasticsearch which belongs to a cluster. • Shard: A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards. • Primary Shard: Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. • Replica Shard: Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: a) increase fail over b) increase performance

  20. Glossary • Index: An index is like a database in a relational database. • Type: A type is like a table in a relational database. Each type has a list of fields that can be specified for documents of that type. • Document: JSON document which is stored in elasticsearch. It is like a row in a table in a relational database. • Field: A document contains a list of fields, or key-value pairs. The value can be a simple (scalar) value (eg a string, integer, date), or a nested structure like an array or an object. A field is similar to a column in a table in a relational database. • Mapping:  mapping is like a schema definition in a relational database. The mappingdefines how each field in the document is analyzed. • Routing: When you index a document, it is stored on a single primary shard. That shard is chosen by hashing the routing value. By default the routing value is derived from the ID of the document.

  21. Core Field Types • Strings: string • Datetimes: date • Whole numbers: byte, short, integer, long • Floats: float, double • Booleans: boolean • Objects: object • Also: multi_field, ip, geo_point, geo_shape,

  22. Auto Detection of Field • "foo bar" string • "2013-01-01" date • 10 byte, short, integer, long • 10.0 float, double • true boolean • { foo: "bar" } object • ["foo","bar"] No special mapping. Any field can have multi-values

  23. Some more Glossary • Term: A term is an exact value that is indexed in elasticsearch. The terms foo, Foo, FOO are NOT equivalent. • Text: Text (or full text) is ordinary unstructured text, such as this paragraph. By default, text will be analyzed into terms, which is what is actually stored in the index. Text fields need to be analyzed at index time in order to be searchable as full text, and keywords in full text queries must be analyzed at search time to produce (and search for) the same terms that were generated at index time. • Analysis: Analysis is the process of converting full text to terms. Depending on which analyzer is used, these phrases: FOO BAR, Foo-Bar, foo,bar will probably all result in the terms foo and bar. These terms are what is actually stored in the index.

  24. Tokenizer: Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. • Facets: They enable you to calculate and summarize data about the current query on-the-fly. They can be used for all sorts of tasks such as dynamic counting of result values or even distribution histograms. Facets only perform their calculations one-level deep, and they cant be easily combined. • Aggregations: Aggregations are similar to facets in many ways, and overcome the limitations of facets. Indeed, aggregations are meant to eventually replace facets altogether. Facets are and should be considered deprecated and will likely be removed in one of the future major releases. One of the major limitations of facets is that you can't have facets of facets. Which is to say, facets cannot be nested. The ability to nest aggregations therefore brings a great deal of power that was missing in facets. • The two broad families of aggregations are metrics aggregations and bucket aggregations. Metrics aggregations calculate some value (like an average) over a set of documents, and bucket aggregations group documents into buckets. 

  25. Schemaless, Document Oriented • No need to configure schema upfront • No need for slow ALTER TABLE – like operations • Define mapping (schema) to customize the indexing process • Require fields to be of certain type • If you want text fields that should not be analyzed

  26. Distributed & Highly Available • Multiple nodes running in a cluster • Acting as a single service • Nodes in cluster that store data or nodes that just help in speeding up search queries • Sharding • Indices are sharded (#shards are configurable) • Each Shard can have zero or more replicas • Replicas on different servers for failover • Master • Automatic Master detection + failover • Responsible for distribution/balancing of shards

  27. A Single Node Cluster with An Index • All 3 Primary Shards allocated to Node1 • No replication Nodes • A single node means single point of failure • Health of the Cluster: Yellow

  28. Add Failover • Add one Node to cluster by configuring the cluster name. • 3 replica shards have been allocated. • Cluster Health : Green. • Now 6 Shards. There is redundancy

  29. Scale horizontally • A 3 node cluster • One shard each from Node 1 and Node 2 have moved to Node 3 • Better performance as hardware resouces (CPU,RAM, I/O) are shared

  30. Scale some more • More Nodes can be added • More replicas can be added • This will allow faster searches • Allows better redundancy • However the number of primary shards is fixed at the moment an index is created. • Effectively, the maximum amount of data that can be stored in the index is defined by this number. • Is this a limitation …..

  31. Coping with Node Failure • Kill Master Node • Elect a New Master (Node 2) • Primary Shard 1 and 2 were lost • Cluster Health : Red • Node 2 & 3 have Replicas of these shards, which are now promoted as primaries • Cluster Health: Yellow

  32. Beauty Of Elastic Search • In Elasticsearch, all data in every field is indexed by default. That is, every field has a dedicated inverted index for fast retrieval. And, unlike most other databases, it can use all of those inverted indices in the same query, to return results at breathtaking speed

  33. Document • refers to the top-level or root object which is serialized into JSON and stored in Elasticsearch under a unique ID. • field or property, can be a string, a number, a boolean, another object, an array of values, or some other specialized type such as a string representing a date or an object representing a geolocation

  34. Document metadata • _index: Where the document lives • _type: The class of object that the document represents • _id: The unique identifier for the document • Elasticsearch will auto generate id if not specified

  35. Creating a Document

  36. Pagination size = num of results from = results to skip GET /_search?size=5&from=0 GET /_search?size=5&from=5 GET /_search?size=5&from=10

  37. Search (basic) • GET /_search?q=mary → user named "Mary" → tweets by "Mary" → tweet mentioning "@mary“ • _all field • String value from all fields

  38. GET /_search?q=2013 ->12 results GET /_search?q=2013-06-03 -> 12 results!! GET /_search?q=date:2013-06-03 -> 1 result

  39. Mapping ( field definitions) GET /myapp/tweet/_mapping { "tweet" : { "properties" : { "tweet" : { "type" : "string" }, "name" : { "type" : "string" }, "nick" : { "type" : "string" }, "date" : { "type" : "date" }, "rt" : { "type" : "long" }, "loc" : { "type": "object", "properties" : { "lat" : { "type" : "double" }, "lon" : { "type" : "double" } } } }}} date = type:date _all = type:string date = 2013-06-03 _all = 2013,06,03

  40. Exact Value Vs Full Text 10 4.5 2013-01-01 true Foo foo The quick brown fox jumped over the lazy dog

  41. Inverted Index → separate words / terms → sort unique terms → list docs containing terms → normalize terms The,brown,dog,fox,jumped,lazy,over,quick,the Quick,brown,dogs,foxes,in,lazy,leap,over,summer

  42. Analysis • The index analysis module acts as a configurable registry of Analyzers that can be used in order to both break indexed (analyzed) fields when a document is indexed and process query strings. It maps to the Lucene Analyzer. • Analyzer:: tokenizer + token filters

  43. Standard Analyzer "The Quick Brown Fox jumped over the Lazy Dog!“ Standard Tokenizer The,Quick,Brown,Fox,jumped,over,the,Lazy,Dog Lowercase filter the,quick,brown,fox,jumped,over,the,lazy,dog Stopwords filter the,quick,brown,fox,jumped,over,the,lazy,dog

  44. English Analyzer standard tokenizer lowercase filter englishstemmer the,quick,brown,fox,jumped,over,the,lazy,dog englishstopwords the,quick,brown,fox,jumped,over,the,lazy,dog

  45. Filters Vs Queries Queries Full Text Relevance scoring text search Heavier Not cacheable Filters • exact matching • Binary yes/no • Fast • Cacheable

  46. Queries As a general rule, queries should be used instead of filters: • for full text search • where the result depends on a relevance score

  47. Some Query Types { “match” : { “message” : “this is a test” } } • Match Specifies a field to search • _all is also a field • Match ( Boolean) • The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to ororand to control the boolean clauses (defaults to or). The minimum number of should clauses to match can be set using the minimum_should_match parameter. • Match (phrase) • The match_phrase query analyzes the text and creates a phrase query out of the analyzed text { “match” : { “message” : { “query” : “this is a test”, “operator” : “and” } } } { “match_phrase” : { “message” : “this is a test” } }

  48. Multi Match • Multi match query • Multiple fields to search • Field can be identified using wild cards • Fields can be boosted { “multi_match” : { “query” : “Will Smith” “fields” : [ “title”, “*_name”] } }

  49. Bool Query • A query that matches documents matching boolean combinations of other queries. { "bool" : { "must" : { "term" : { "user" : "kimchy" } }, "must_not" : { "range" : {"age" : { "from" : 10, "to" : 20 }}}, "should" : [ {"term" : { "tag" : "wow" } }, {"term" : { "tag" : "elasticsearch" } } ], "minimum_should_match" : 1, "boost" : 1.0 } }

More Related