E N D
Wikipedia: Amazon Web Services (abbreviated AWS) is a collection of remote computing services (also called web services) that together make up a cloud computing platform, offered over the Internet by Amazon.com. The most central and well-known of these services are Amazon EC2 and Amazon S3. The service is advertised as providing a large computing capacity (potentially many servers) much faster and cheaper than building a physical server farm.[2] AWS is located in 8 geographical "regions": US East (Northern Virginia), where the majority of AWS servers are based,[3] US West (northern California), US West (Oregon), Brazil (São Paulo), Europe (Ireland), South Asia (Singapore), East Asia (Tokyo), and Australia (Sydney). List of products ComputeAmazon Elastic Compute Cloud (EC2) provides scalable virtual private servers using Xen. Amazon Elastic MapReduce (EMR) allows developers et al to easily and cheaply process bigdata, using a hosted Hadoop on EC2 and Amazon S3. Networking Amazon Route 53 provides a highly available and scalable Domain Name System (DNS) web service. Amazon Virtual Private Cloud (VPC) creates a logically isolated set of Amazon EC2 instances which can be connected to an existing network using a VPN. AWS Direct Connect provides dedicated network connections into AWS data centers, providing faster and cheaper data throughput. Content deliveryAmazon CloudFront, a content delivery network (CDN) for distributing objects to so-called "edge locations" near the requester. Storage and content deliveryAmazon Simple Storage Service (S3) provides Web Service based storage. Amazon Glacier provides a low-cost, long-term storage option (compared to S3). High redundancy/availability, but low-frequent access times. Ideal for archiving. AWS Storage Gateway, an iSCSI block storage virtual appliance with cloud-based backup. Amazon Elastic Block Store (EBS) provides persistent block-level storage volumes for EC2. AWS Import/Export, accelerates moving large amounts of data into and out of AWS using portable storage devices for transport. Database Amazon DynamoDB provides a scalable, low-latency NoSQL online Database Service backed by SSDs. Amazon ElastiCache provides in-memory caching for web applications. This is Amazon's implementation of Memcached and Redis. Amazon Relational Database Service (RDS) provides a scalable database server with MySQL, Informix,[20]Oracle, SQL Server, and PostgreSQL support.[21] Amazon Redshift provides petabyte-scale data warehousing with column-based storage and multi-node compute. Amazon SimpleDB allows developers to run queries on structured data. It operates in concert with EC2 and S3 to provide "the core functionality of a database". AWS Data Pipeline provides reliable service for data transfer between different AWS compute and storage services Amazon Kinesis, streams data in real time with the ability to process thousands of data streams on a per-second basis. Deployment Amazon CloudFormation provides a file-based interface for provisioning other AWS resources. AWS Elastic Beanstalk provides quick deployment and management of applications in the cloud. AWS OpsWorks for configuration of EC2 services using Chef. Management Amazon Identity and Access Management (IAM), an implicit service, the authentication infrastructure used to authenticate access to services. Amazon CloudWatch, provides monitoring for AWS cloud resources and applications, starting with EC2. AWS Mgmt Console: web-based point/click interface to manage/monitor EC2, EBS, S3, SQS, Amazon Elastic MapReduce, and Amazon CloudFront. Application services Amazon CloudSearch provides basic full-text search and indexing of textual content. Amazon DevPay, currently in limited beta version, is a billing and account management system for applics that developers have built atop Amazon Web Services. Amazon Elastic Transcoder (ETS) provides video transcoding of S3 hosted videos, marketed primarily as a way to convert source files into mobile-ready versions. Amazon Flexible Payments Service (FPS) provides an interface for micropayments. Amazon Simple Email Service (SES) provides bulk and transactional email sending. Amazon Simple Queue Service (SQS) provides a hosted message queue for web applications. Amazon Simple Notification Service (SNS) provides a hosted multi-protocol "push" messaging for applications. Amazon Simple Workflow (SWF) is a workflow service for building scalable, resilient applications.
Wikipedia: Amazon S3 (Simple Storage Service) online file storageweb service by Amazon Web Services provides storage thru (REST, SOAP, BitTorrent Amazon S3 is reported to store more than 2 trillion objects as of April 2013 Details of S3's design are not made public by Amazon, though it clearly manages data with an object storage architecture. According to Amazon, S3's design aims to provide scalability, high availability, and low latency at commodity costs. S3 stores arbitrary objects (computer files) up to 5 terabytes in size, each accompanied by up to 2 kilobytes of metadata. Objects are organized into buckets (each owned by an Amazon Web Services account), and identified within each bucket by a unique, user-assigned key. Buckets and objects can be created, listed, and retrieved using either a REST-style HTTP interface or a SOAP interface. Requests are authorized using an access control list associated with each bucket and object. Bucket names and keys are chosen so that objects are addressable using HTTP URLs: http://s3.amazonaws.com/bucket/key http://bucket.s3.amazonaws.com/key http://bucket/key (where bucket is a DNS CNAME record pointing to bucket.s3.amazonaws.com) Because objects are accessible by unmodified HTTP clients, S3 can be used to replace significant existing (static) web hosting infrastructure.[16] Someone can construct a URL that can be handed off to a third-party for access for a period such as the next 30 minutes, or the next 24 hours. Every item in a bucket can also be served up as a BitTorrent feed. The S3 store can act as a seed host for a torrent and any BitTorrent client can retrieve the file. This drastically reduces the bandwidth costs for the download of popular objects. Amazon S3 provides options to host static websites with Index document support and error document support.[19] Photo hosting service SmugMug has used S3 since April 2006. They experienced a number of initial outages and slowdowns, but after one year they described it as being "considerably more reliable than our own internal storage" and claimed to have saved almost $1 million in storage costs.[21] There are various User Mode File System (FUSE)-based file systems for Unix-like operating systems (Linux, etc.) that can be used to mount an S3 bucket as a file system. Note that as the semantics of the S3 file system are not that of a Posix file system, the file system may not behave entirely as expected.[22] Apache Hadoop file systems can be hosted on S3, as its requirements of a file system are met by S3. As a result, Hadoop can be used to run MapReduce algorithms on EC2 servers, reading data and writing results back to S3. Dropbox,[23]Bitcasa,[24]StoreGrid, Jumpshare, SyncBlaze,[25]Tahoe-LAFS-on-S3,[26]Zmanda and Ubuntu One,[27]Fiabee are some of the many online backup and synchronization services that use S3 as their storage and transfer facility. Minecraft hosts game updates and player skins on the S3 servers.[28] Tumblr, Formspring, Pinterest, and Posterous images are hosted on the S3 servers. Alfresco (software) the OpenSource Enterprise Content Management provider are hosting data for the Alfresco in the cloud service on S3. LogicalDOC, the open source Document management system provides a tool for disaster recovery based on S3. S3 was used in the past by some enterprises as a long term archiving solution, until Amazon Glacier was released. Examples of competing S3 compliant storage implementations include: Google Cloud Storage Cloud.com’s CloudStack[31] Cloudian, Inc.[32] an S3-compatible object storage software package. Connectria Cloud Storage[33][34] in 2011 became the first US cloud storage service provider based on the Scality RING organic storage technology[35][36] Eucalyptus Nimbula (acquired by Oracle) Riak CS,[37] which implements a subset of the S3 API including REST and ACLs on objects and buckets. Ceph with RADOS gateway.[38] Caringo, Inc which implements the S3 API and extends Bucket Policies to include the addition of Domain Policies.
From: Arjun Roy I did some tests to compare C/C++ Vs C# on some basic op to see if compilers handle cases differently. I executed Dr. Wettstein's code in libPree library on Linux Ubuntu 12.04 and my C# code on Windows 7. 4 GB RAM. OR 1 million 1.1 sec 0.1013 millisec OR 10 million 23.33 sec 1.016 millisec OR 100 million 266.38 sec 10.4494 millisec AND 1 million 0.99 sec 0.0989 millisec AND 10 million 23.29 sec 1.0235 millisec AND 100 million 279.92 sec 10.6166 millisec Number of 1's 1 million 0.49 sec 0.9647 millisec Number of 1's 10 million 5.11 sec 6.6821 millisec Number of 1's 100 million 55.73 sec 57.8274 millisec 3 run replicate of a 100M bit PTree OR op done with 32-bit code (gcc 4.3.4) with -O2 opt PTree OR test: bitcount/iteration = 100000000/1 Count time: 60000 ticks [0.060000 secs.], Elapsed: 0real 0m0.076suser 0m0.048ssys 0m0.028sPTree OR test: bitcount/iteration = 100000000/1 Count time: 60000 ticks [0.060000 secs.], Elapsed: 0real 0m0.076suser 0m0.048ssys 0m0.024sPTree OR test: bitcount/iteration = 100000000/1 Count time: 70000 ticks [0.070000 secs.], Elapsed: 0real 0m0.076suser 0m0.040ssys 0m0.032s So a 100 million bit OR op on my workstation, which is somewhat dated in the tooth now, consistently yields about a 76 ms operation time. Doing the math suggests an effective applic memory bandwidth of 178.5 megabytes/second. Put in another frame of reference this suggests that my workstation could easily saturate a 1 gigabit/second LAN connection with a streaming PTree computation. I also ran a 3-tuple of a count test just to make sure that it was consistent: PTree count test: bitcount/iteration/population 100000000/1/500000 Count time: 110000 ticks [0.110000 secs.], Elapsed: 0real 0m0.122suser 0m0.112ssys 0m0.008s PTree count test: bitcount/iteration/population 100000000/1/500000 Count time: 120000 ticks [0.120000 secs.], Elapsed: 0real 0m0.123suser 0m0.120ssys 0m0.000s PTree count test: bitcount/iteration/population 100000000/1/500000 Count time: 110000 ticks [0.110000 secs.], Elapsed: 0real 0m0.121suser 0m0.112ssys 0m0.008s The above test reflects the amount of time required to root count a PTree with 100 million entries with 500,000 randomly populated one bits. This test suggests a count time of 120 milli-seconds for an effective application memory bandwidth rate of 109.2 megabytes/second. Just under what it would take to saturate a LAN conection. This test brings up an additional question that I would have with respect to the validity of the testing environment. This test uses the default PTree counting implementation which uses word by word byte scanning with a 256 entry bitcount lookup table. This means we impose an overhead of 12,500,000 memory references for the table lookup and an additional 9,375,000 unaligned memory references for the three byte pointer lookups per 32-bit word. PTree operations are by definition cache pessimal as we have discussed previously. Lookup tables are also notorious for their cache pressure characteristics. It is always difficult to predict how the hardware prefetch logic handles this but at a minimum the lookup table is going to bounce across two L1 cache-lines. Given this, it is somewhat unusual in their test results, that the '1 counting' was five times faster then the basic AND/OR ops. There are multiple '1 counting' implementations in the source code but they would have had to select those. Included in those counting implementations were reduction counting and an variant which unrolls the backing array onto the MMX vector registers. Those strategies provided incremental performance improvements but nothing on the order of a 5-fold increase in counting throughput. With respect to memory optimization PTree operations, I experimented with non-temporal loads and pre-fetch hinting and was never able demonstrate significant performance increases. As we have discussed many times, PTree technology and data mining in general, is ultimately constrained by memory latency. So that would be my spin on the results. The C# test results are approximately consistent with expected memory bandwidth speeds. I took a few minutes and ran the 100 million count test on one of the nodes in our 'chicken cruncher' and I am seeing around 20 milliseconds which is about twice as fast as their C# implementation which is reasonable given hdwre/OS/compiler-interpreter differences.
Some Interesting Articles We should know as much as we can about what "the client" is doing in order to best help them do it. From Popular Mechanics: Client Data Mining: How It Works: PRISM, XKeyscore, and plenty more classified info in the client's vast surveillance program has been in the light. How much data is there? How does the government sort through it? What are they learning about you? Here's a guide. Most people were introduced to the arcane world of data mining by a security breach revealing the government gathers billions of pieces of data—phone calls, emails, photos, and videos—from Google, Facebook, Microsoft, and other communications giants, then combs through the information for leads on national security threats. Here's a guide to big-data mining, NSA-style The Information Landscape How much data do we produce? An IBM study estimates 2.5 quintillion bytes ( 2.5 1018 ) of data every day. (If these data bytes were pennies laid out flat, they would blanket the earth five times.) That total includes stored information — photos, videos, social-media posts, word-processing files, phone-call records, financial records, and results from science experiments —and data that normally exists for mere moments, such as phone-call content and Skype chats Veins of Useful Information Digital info can be analyzed to establish connections between people, and these links can generate investigative leads. But in order to examine data, it has to be collected-from everyone. As the data-mining saying goes: To find a needle in a haystack, you 1st need to build a haystack Data Has to Be Tagged Before It's Bagged Data mining relies on metadata tags that enable algorithms to identify connections. Metadata is data about data—e.g., the names and sizes of files. The label placed on data is called a tag. Tagging data enables analysts to classify and organize the info so it can be searched and processed. Tagging also enables analysts to parse the info without examining the contents. This is an important legal point because the communications of U.S. citizens and lawful permanent resident aliens cannot be examined without a warrant. Metadata can! Finding Patterns in the Noise The data-analysis firm IDC estimates that only 3 percent of the info in the digital universe is tagged when it's created, so the client has a sophisticated software program that puts billions of metadata markers on the info it collects. These tags are the backbone of any system that makes links among different kinds of data—such as video, documents, and phone records. For example, data mining could call attention to a suspect on a watch list who downloads terrorist propaganda, visits bomb-making websites, and buys a pressure cooker. (This pattern matches behavior of the Tsarnaev brothers, who are accused of planting bombs at the Boston Marathon.) This tactic assumes terrorists have well-defined data profiles—something many security experts doubt Open Source and Top SecretAccumulo was designed precisely for tagging billions of pieces of unorganized, disparate data. The custom tool isbased on Google programming and is open-source. Sqrrl commercialized it and hopes healthcare/finance industries will use it to manage their own big-data sets The Miners: Who Does What The client is authorized to snoop on foreign communications and also collects a vast amount of data—trillions of pieces of communication generated by people across the globe. Itdoes not chase the crooks, terrorists, and spies it identifies; it sifts info on behalf of other government players such as the Pentagon, CIA, and FBI. Here are the basic steps: 1. A judge on a secret Foreign Intelligence Surveillance (FISA) Court gets an app from an agency to authorize a search of data collected by the Client. 2. Once authorized the requests go to the FBI's Electronic Comms Surveillance Unit (ECSU), a legal safeguard - they review to ensure no US citizens targeted. 3. The ECSU passes appropriate requests to the FBI Data Intercept Technology Unit, which obtains the info from Internet company servers and then passes it to the client to be examined. (Many companies have denied they open their servers. As of press time, it's not clear who is correct.) 4. The client then passes relevant information to the government agency that requested it What is the client up To? Phone-Metadata Mining Dragged Into the Light: The controversy began when it was revealed that the U.S. government was collecting the phone-metadata records of every Verizon customer—including millions of Americans. At the request of the FBI, FISA Court judge Roger Vinson issued an order compelling the company to hand over its phone records. The content of the calls was not collected, but national security officials call it "an early warning system" for detecting terror plots.
PRISM Goes Public: Every collection platform or source of raw intelligence is given a name, called a Signals Intelligence Activity Designator (SIGAD), and a code name. SIGAD US-984XN is better known by its code name: PRISM. PRISM involves the collection of digital photos, stored data, file transfers, emails, chats, videos, and video conferencing from nine Internet companies. U.S. officials say this tactic helped snare Khalid Ouazzani, a naturalized U.S. citizen who the FBI claimed was plotting to blow up the NY Stock Exc.. Mining Data as It's Created: Our client also operates real-time surveillance tools. Analysts can receive "real-time notification of an email event such as a login or sent message" and "real-time notification of a chat login". Whether real-time info can stop unprecedented attacks is subject to debate. Alerting a credit-card holder of sketchy purchases in real time is easy; building a reliable model of an impending attack in real time is infinitely harder XKeyscoreis software that can search hundreds of databases for leads. It enables low-level analysts to access communications w/o oversight, circumventing the checks/balances of FISA court. The client vehemently deny this, and the documents don't indicate any misuse. The seems to be a powerful tool that allows analysts to find hidden links inside troves of info. "My target speaks German but is in Pakistan—how can I find him?" "My target uses Google Maps to scope target locations—can I use this info to determine his email address?" This program enables analysts to submit one query to search 700 servers around the world at once, combing disparate sources to find the answers to these questions How Far Can the Data Stretch?: Oops—False Positives: Bomb-sniffing dogs sometimes bark at explosives that are not there. This kind of mistake is called a false positive. In data mining, the equivalent is a computer program sniffing around a data set and coming up with the wrong conclusion. This is when having a massive data set may be a liability. When a program examines trillions of connections between potential targets, even a very small false-positive rate 10K of dead-end leads that agents must chase down—not to mention the unneeded incursions into innocent people's lives Analytics to See the Future; Ever wonder where those Netflix recommendations or suggested reading lists on Amazon come from? Your previous interests directed an algorithm to pitch those products to you. Big companies believe more of this kind of targeted marketing will boost sales and reduce costs. For example, this year Walmart bought a predictive analytics startup called Inkiru. They make software that crunches data to help retailers develop marketing campaigns that target shoppers when they are most likely to buy certain products Pattern Recognition or Prophecy? In 2011 British researchers created a game that simulated a van-bomb plot, and 60% of "terrorist" players were spotted by a program called DScent, based on their "purchases" and "visits" to the target site. The ability of a computer to automatically match security-camera footage with records of purchases may seem like a dream to law-enforcement agents trying to save lives, but it's the kind of ubiquitous tracking that alarms civil libertarians.
Client's Red Team Secret Operations with the Government's Top Hackers By Glenn Derene When it comes to the U.S. government's computer security, we in the tech press have a habit of reporting only the bad news—for instance, last year's hacks into Oak Ridge and Los Alamos National Labs, a break-in to an e-mail server used by Defense Secretary Robert Gates ... the list goes on and on. Frankly that's because the good news is usually a bunch of nonevents: "Hackers deterred by diligent software patching at Army Corps of Engineers." Not too exciting So, in the world of IT security, it must seem that the villains outnumber the heroes—but there are some good-guy celebrities in the world of cyber security. In my years of reporting on the subject, I've often heard the red team referred to with a sense of breathless awe by security pros. These guys are purported to be just about the stealthiest, most skilled firewall-crackers in the game. Recently, I called up the secretive government agency and asked if it could offer up a top red teamer for an interview, and, surprisingly, the answer came back, "Yes". What are red teams? They're sort of like the special forces units of the security industry—highly skilled teams that clients pay to break into clients' own networks. These guys find the security flaws so they can be patched before someone with more nefarious plans sneaks in. The cleint has made plenty of news in the past few years for warrantless wiretapping and massive data-mining enterprises of questionable legality, but one of the agency's primary functions is the protection of the military's secure computer networks, and that's where the red team comes in In exchange for the interview, I agreed not to publish my source's name. When I asked what I should call him, the best option I was offered was: "An official within the client's Vulnerability Analysis and Operations Group." So I'm just going to call him OWCVAOG for short. And I'll try not to reveal any identifying details about the man whom I interviewed, except to say that his disciplined, military demeanor shares little in common with the popular conception of the flippant geek-for-hire familiar to all too many movie fans (Dr. McKittrick in WarGames) and code geeks (n00b script-kiddie h4x0r in leetspeak
So what exactly does the red team actually do? They provide "adversarial network services to the rest of the DOD," says OWCVAOG. That means that "customers" from the many branches of the Pentagon invite OWCVAOG and his crew to act like our country's shadowy enemies (from the living-in-his-mother's-basement code tinkerer to a "well-funded hacker who has time and money to invest in the effort"), attempting to slip in unannounced and gain unauthorized access These guys must conduct their work without doing damage to or otherwise compromising the security of the networks they are tasked to analyze—that means no denial-of-service attacks, malicious Trojans or viruses. "The first rule," says OWCVAOG, "is `do no harm.'?" So the majority of their work consists of probing their customers' networks, gaining user-level access and demonstrating just how compromised the network can be. Sometimes, the red team will leave an innocuous file on a secure part of a customer's network as a calling card, as if to say, "This is your friendly red team. We danced past the comical precautionary measures you call security hours ago. This file isn't doing anything, but if we were anywhere near as evil as the hackers we're simulating, it might just be deleting the very government secrets you were supposed to be protecting. Have a nice day!" I'd heard from one of the Department of Defense clients who had previously worked with the red team that OWCVAOG and his team had a success rate of close to 100 percent. "We don't keep statistics on that," OWCVAOG insisted when I pressed him on an internal measuring stick. "We do get into most of the networks we target. That's because every network has some residual vulnerability. It is up to us, given the time and the resources, to find the vulnerability that allows us to access it It may seem unsettling to you—it did at first to me—to think that the digital locks protecting our government's most sensitive info are picked so constantly and seemingly with such ease. But I've been assured that these guys are only making it look easy because they're the best, and that we all should take comfort, because they're on our side. The fact that they catch security flaws early means that, hopefully, we can patch up the holes before the black hats get to them And like any good geek at a desk talking to a guy with a really cool job, I wondered where the client finds the members of its superhacker squad. "The bulk is military personnel, civilian government employees and a small cadre of contractors," OWCVAOG says. The military guys mainly conduct the ops (the actual breaking and entering stuff), while the civilians and contractors mainly write code to support their endeavors. For those of you looking for a gig in the ultrasecret world of red teaming, this top hacker says the ideal profile is someone with "technical skills, an adversarial mind-set, perseverance and imagin Speaking of high-level, top-secret security jobs, this much I now know: The world's most difficult IT department to work for is most certainly lodged within the Pentagon. Network admins at the Defense Department have to constantly fend off foreign governments, criminals and wannabes trying to crack their security wall—and worry about ace hackers with the same DOD stamp on their paychecks Security is an all-important issue for the corporate world, too, but in that environment there is an acceptable level of risk that can be built into the business model. And while banks build in fraud as part of the cost of doing business, there's no such thing as an acceptable loss when it comes to national security. I spoke about this topic recently with Mark Morrison, chief info assurance officer of the Defense Intell Agency "We meet with the financial community because there are a lot of parallels between what the intelligence community needs to protect and what the financial community needs," Morrison said. "They, surprisingly, have staggeringly high acceptance levels for how much money they're willing to lose. We can't afford to have acceptable loss. So our risk profiles tend to be different, but in the long run, we end up accepting similar levels of risk because we have to be able to provide actionable intelligence to the war fighter OWCVAOG agrees that military networks should be held to higher standards of security, but perfectly secure computers are perfectly unusable. "There is a perfectly secure network," he said. "It's one that's shut off. We used to keep our info in safes. We knew that those safes were good, but they were not impenetrable, and were rated on the number of hours it took for people to break into them. This is a similar equation
FAUST A Clusterer (an unsupervised analytic) analyzes data objects without consulting a known class label (usually none are present). Objects are clustered (grouped) based on maximizing intra-classsimilarity and minimizing inter-classsimilarity. FAUST Count Change (FCC) Clusterer 1. Choose a nextD recursion plan, e.g., a. Initially use D = the diagonal with Max Standard Deviation (STD) or maximum STD/Spread.. b. Always use AM (Avg-Median). c. always use AFFA (Avg-FurthestFromAverage). d. always use FFAFFF (FurthestFromAvg-FurthestFromFurthest). e. cycle thru diagonals: e1,...,..en, e1e2..; f. cycle thru AM, AFFA, FFAFFF; ... Choose a DensityThreshold (DT), DensityUniformityThreshold(DUT), Precipitous Count Change Definition (PCCD). 2. If DT (and/or DUT) not exceeded at C, (a cluster), partition C by cutting at each gap and PCC in CoD using the next D in the recursion plan. FAUST Anomaly Detector (FAD) (outlier detection analytic) identifies objects not complying with the ata model. In most cases, outliers are discarded as noise or exceptions, but in some applications, e.g., fraud detection, rare events are the interesting events. Outliers can be detected with a. Statistics (statistical tests assume a distribution/probability model), b. Distance, c. Density or d. Deviation (deviation uses a dissimilarity to reduce overall dissimilarity by removing "deviation outliers"). Outlier mining can mean: 1. Given a set of n objects and given a k, find the top k objects in terms of dissimilarity from the rest of the objects. 2. Given a Training Set, identify outlier objects within each class (correctly classified but noticeably dissimilar to their fellow class members). 3. Determine "fuzzy" clusters, i.e., assign a weight for each (object, cluster) pair. (Does a dendogram do that?). We believe that the FAUST Count Change clusterer is the best Anomaly Detector we have at this time. It seems important to be able to identify and remove large clusters in order to find small ones (anomalies). D compute midpt of D or low and hi of xoD over each class, then build a polygonal hull, tightly for Linear and loosely for Medoid. For a loose hull examine for "none of the above" separately at great expense. Always build a tight polygonal hull, we end up with just one method: FAUST Polygon classifier: Take the above series of D vectors (Add additional Ds if you wish - the more the merrier - but watch out for the cost of computing {CkoD}k=1..K e.g., add CAFFAs (Class Avg-FurthestFromAvg), CFFAFFF (Class FurthestFromAvg-FurthestFromFurthest) D in the D_series, let lD,kmnCkoD (or 1st PCI) and let hD,kmxCkoD (or last PCD). y isa Ck iff yHkwhere Hk = {zSpace | lD,k Doz hD,k D in the series} If y hull of >1 class, say, y Hi1..Hih, y can be fuzzy classified by weight(y,k)=OneCount{PCk & PHi1 ... & PHih} and, if we wish, we can declare y isa Ck where weight(y,k) is a maximum weight. Or we can let Sphere(y,r)={z | (y-z)o(y-z)<r2} vote (but, requires the construction of Sphere(y,r) )
FCC on IRIS150DT=1 PCCD: PCCs must involve a hi 5 and 60% change ( 2 if high=5) from that high. Gaps must be 3 F 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 37 38 Ct 3 1 2 1 1 2 3 1 6 5 2 5 4 4 3 2 3 1 1 1 1 2 1 Gp 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15 1 2 (300) (11 0 0) (11 00)(200) (25 0 0) 40 41 47 49 50 51 53 54 55 56 57 59 60 61 63 65 66 67 69 70 71 72 73 74 1 1 2 1 2 1 2 2 3 5 1 2 1 2 1 4 2 2 2 5 2 3 5 2 1 6 2 1 1 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 1 C1(0 4 1) C2(0 17 1) (0 17 0) C3(032) Is there a better measure of gap potential than STD? How about normalized STD, NSTD Std(Xk-minXk)/SpreadXk ? NSTD1 2 3 4 .22 .18 .29 .31 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 99 103 106 108 109 111 114 2 2 1 2 1 2 3 1 1 1 1 2 3 2 2 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 3 2 1 2 3 C4(0 9 37) (003) (0 0 6) C1 F 37 38 40 41 Ct 2 1 1 1 Gp 1 2 1 (040) (001) C2 F 2 3 4 5 6 7 8 9 10 Ct 1 1 2 3 3 4 1 1 2 Gp 1 1 1 1 1 1 1 1 (040) (021) (0 11 0) C3 F 2 8 9 11 12 Ct 1 1 1 1 1 Gp 6 1 2 1 (002) (030) C4 71 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 2 3 5 2 2 2 1 2 1 2 3 1 1 1 1 2 3 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (041)(005) C41(055) (0 0 26) C41 F 13 14 16 17 18 23 Ct 3 3 1 1 1 1 Gp 1 2 1 1 5 (042) (012) (001) The FCC clusterer algorithm Set Dk to 1 for each column with NSTD>NSTDT=0.2, so D=1011. Make Gap and PCC cuts on YoD If Density < DT at a dendogram node, C, (a cluster), partition C at each gap and PCC in CoD using next D in recursion plan. FCC on IRIS150: 1st round, D=1011, F(x)=(x-p)oD, randomly selected p=(15 6 2 5). 91.3% accurate after 1st round 95.3% accurate after 2nd round FCC on IRIS150: 2nd round D=0100, p=(15 6 2 5) FCC on IRIS150: 3rd round D=1000, p=(15 6 2 5) 97% accurate after 3rd round We can develop a fusion method by looking for non-gaps in projections onto the vector connecting the medians of cluster pairs
F 0 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 20 24 27 28 29 31 37 38 39 40 41 43 44 45 46 47 48 51 53 54 55 56 57 58 59 60 62 63 64 65 Ct 1 1 2 2 3 1 3 3 7 3 7 5 3 3 2 2 1 1 1 1 1 1 1 1 3 1 1 3 2 3 5 1 3 1 4 6 3 3 6 3 3 1 2 3 6 1 5 Gp 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 4 3 1 1 2 6 1 1 1 1 2 1 1 1 1 1 3 2 1 1 1 1 1 1 1 2 1 1 1 1 --------------------------(50 0 0)------------------(001) (0 4 0) C1(0 18 1) (050) C2(0 22 18) (010)(008) 87.3% accurate after 1st round 66 69 70 71 72 73 76 78 79 81 82 84 88 89 90 92 3 5 2 1 1 1 2 1 1 1 1 1 1 1 2 1 3 1 1 1 1 3 2 1 2 1 2 4 1 1 2 (005) (005) (0 0 7) (0 0 5) C2 2nd round D=0001 F 11 12 13 14 15 16 17 18 19 22 23 Ct 1 4 7 8 3 1 5 4 4 2 1 Gp 1 1 1 1 1 1 1 1 3 1 C21(0 17 3) (040) C22(0 1 12)(003) C1 D=0001 F 9 10 11 12 13 14 16 Ct 3 2 4 7 1 1 1 Gp 1 1 1 1 1 2 (0 18 0) (001) 97.3% accurate after 2nd round FCC on IRIS150 DT=1 PCCD: PCCs must involve a hi 5 and be at least a 60% change ( 2 if high=5) from that high. Gaps must be 3 FCC on IRIS150: 1st rnd, D=1010 (highest STD)
FCC on IRIS150 DT=1 PCCD: PCCs must involve a hi 5 and be at least a 60% change ( 2 if high=5) from that high. Gaps must be 3 FCC on IRIS150: 1st rnd, D=1-111(highest STD/spread) 91.3% accurate after 1st round F 0 1 2 3 4 5 6 7 8 9 22 23 24 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 54 55 57 60 Ct 1 1 2 3 7 14 12 4 5 1 2 1 1 1 2 2 4 5 5 6 2 3 5 2 3 5 6 6 5 2 6 5 4 4 3 1 1 2 1 1 1 1 1 1 Gp 1 1 1 1 1 1 1 1 1 13 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 3 ----------------(50 0 0)-------- C1(0 25 2) (090) C2(0 16 11) (0 0 37) 2nd rnd, D=1-1-1-1 (which is the highest STD/spread of those to 1-111: 111-1 11-11 1-1-1-1) C2 14 15 16 17 19 21 22 23 24 25 26 28 29 31 32 33 35 38 1 1 3 2 1 1 1 1 2 1 1 1 3 1 1 4 1 1 1 1 1 2 2 1 1 1 1 1 2 1 2 1 1 2 3 C21(0 3 10) (021) ----(0 11 0)----- C1 F 17 19 21 27 28 29 30 32 33 34 35 36 37 39 40 41 49 Ct 1 1 1 3 3 1 1 1 1 1 4 1 2 2 2 1 1 Gp 2 2 6 1 1 1 2 1 1 1 1 1 2 1 1 8 (001) (020) (0 23 0) (001) 97.3% accurate after 2nd round
FCC on SEED150DT=1 PCCs have to involve a high of at least 5 and be at least 60% change from that high. Gaps must be 3 NSTD 1 2 3 4 0.28 0.28 0.22 0.47 NSTD Std(Xk-minXk)/SpreadXk ? C2 (1L, 0M, 0H) 2nd round D=AFFA 22 25 27 29 31 33 34 35 37 39 42 43 44 45 46 48 49 50 51 52 53 54 55 1 1 2 6 4 1 1 2 3 3 1 1 1 5 4 7 1 3 2 5 1 3 3 3 2 2 2 2 1 1 2 2 3 1 1 1 1 2 1 1 1 1 1 1 1 1 56 57 58 59 60 61 62 63 64 65 68 69 70 71 72 4 3 4 1 3 1 13 1 10 6 3 2 7 1 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 C1,3,2 (0L,1M,0H) C1,3,3 (0L,0M,2H) The Ulitmate PCC clusterer algorithm ? Set Dk to 1 for each column with NSTD>ThresholdNSTD (NSTDT=0.25) Shift X column values as in Gap Preservation above giving the shifted table, Y Make Gap and PCC cuts on YoD If Density < DT at a dendogram node, C, (a cluster), partition C at each gap and PCC in CoD using next D in recursion plan. Using UPCC with D=1111 First round only: F 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Ct 1 6 10 22 25 16 12 1 9 6 14 12 7 3 1 3 1 1 GP 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C1 (44L, 1M, 47H) C4 (2L, 31M, 0H) C5 (0L, 9M, 0H) C3 (3L, 9M, 3H) errs=53 45 0 6 2 0 spread=10 3 1 3 2 1 Using UPCC with D=1001 First round only: F 0 1 2 3 4 5 6 7 8 9 10 11 Ct 18 25 18 18 14 6 9 7 8 21 2 4 GP 1 1 1 1 1 1 1 1 1 1 1 C2 (0L, 22M, 0H) C3 (8L, 21M, 0H) C1 (42L, 1M, 50H) C4 (0L, 6M, 0H) errs=51 43 0 8 0 spread=6 2 1 2 1 (0L,0M,1H) C1.5.5 (0L,0M,1H) C1.4.5 D=1101 1st rnd: 85% accurate 0 1 2 3 4 5 6 7 8 9 10 11 12 13 18 25 16 9 13 12 6 9 7 7 21 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 C1,1 (0L,4M,0H) C1,8 (3L,0M,28H) C1,2 (6L,17M,0H) C1,3 (13L,1M,2H) C1,5 (5L,0M,0H) C1,6 (16L,0M,6H) C1,7 (2L,0M,12H) C3 (0L,7M, 0H) C1,4 (5L,0M,0H) C1 (50L,22M,50H) C2 (0L,21M,0H) 3rd round D=0010 C1,6: 3rd round D=0010 C1,8: C1,2 3rdrnd D=0010 errs=72 72 0 0 spread=5 3 1 1 F 0 1 Ct 4 19 GP 1 F 0 1 2 3 4 5 6 7 Ct 2 5 6 5 1 1 1 1 F 0 1 2 3 4 5 Ct 3 2 2 17 3 4 C1,2,1 (4L,0M,0H) C1,2,1 (2L,17M) C1,6,2 (14L,0M,2H) C1,8,1 (3L,0M,0H) C1,6,1 (2L,0M,0H) C1,6,3 (0L,0M,4H) C1,8,2 (0L,0M,28H) 3rd round D=0010 C1,3: 3rd round with D=0010 97.3% accurate F 0 1 2 3 6 7 Ct 3 5 4 1 1 2 C1,3,1 (3L,0M,0H) C1,3,2 (9L,0M,0H)
FCC on CONC150 CONC counts are L=43, M=52, H=55 DT=1PCCs: 5 60% change from high. Gaps: 3 NSTD 1 2 3 4 0.25 0.25 0.24 0.22 COLUMN 1 2 3 4 STD .017 .005 .001 .037 SPREAD 10 39 112 5.6 STD/SPR .18 .23 .21 2.38 NSTD Std(Xk-minXk)/SpreadXk ? NSTD Std(Xk-minXk)/SpreadXk ? (1 0 0) C3(14 17 1) C5(1 20 4) (0 0 2) F 0 1 2 3 4 5 6 Ct 1 27 32 53 25 10 2 C2(22 4 1) C4(18 28 7) C6(1 6 3) D 0001 1st rnd Gp>=1 0 1 6 7 8 10 14 15 16 17 18 20 21 22 23 24 25 26 28 29 31 33 34 36 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 2 4 1 3 1 1 2 1 1 5 1 1 2 4 1 1 1 1 2 1 1 1 1 1 1 2 1 2 2 1 2 1 (200) C2(2,2,0) 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52 53 54 55 56 58 59 61 62 1 2 1 1 2 1 2 1 5 4 2 6 1 1 4 1 1 7 1 3 2 1 5 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 C3(24 13 5) C4(6,4,1)(204) C6(5,6,11) C5 D 1111 1st rnd 1st rnd 53% accurate 63 64 65 66 67 68 69 71 72 73 74 76 77 78 80 81 82 85 86 87 91 92 94 3 3 1 2 1 2 1 1 2 1 1 1 1 1 1 5 3 5 2 2 5 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 3 1 1 4 1 2 3 C9(0 11 16) C10(152)(005)C11(144) (002) 97 98 100 101 102 104 110 119 F 1 3 2 2 1 1 1 1 Count 1 2 1 1 2 6 9 Gap C15(0 5 5) (0 2 0) (030) F 51 57 58 63 64 83 Ct 1 1 3 1 1 1 GP 6 1 5 1 19 (100)(001) (011) (010) F 71 74 119 Ct 1 1 4 GP 3 45 (200) (004) (100) (100) (0 0 9) (010) F 8 16 24 27 28 29 37 41 42 45 48 49 51 67 73 Ct 1 1 1 2 2 1 1 1 1 1 4 2 2 1 1 GP 8 8 3 1 1 8 4 1 3 3 1 2 16 6 (020) ( 2 2 1) (110) (001) F 31 37 41 49 Ct 1 1 1 1 GP 6 4 8 (200) (020) (010) F 20 42 58 64 77 83 Ct 1 2 3 1 2 1 GP 22 16 6 13 6 (010)(0 0 5) (0 4 0) (310) (610) (002) (102) (1 2 0) F 21 26 31 32 36 51 53 64 68 71 100 102 109 112 121 122 Ct 2 3 4 3 1 4 3 2 7 2 2 2 3 1 1 2 GP 5 5 1 4 15 2 11 4 3 29 2 7 3 9 1 (5 0 0)(520)(100) (0 5 0) (101) (020) (100) (010) (002) (011) F 16 20 24 29 37 48 49 51 52 54 57 63 64 67 73 82 83 Ct 1 1 1 1 2 4 1 1 3 1 3 1 1 2 2 1 1 GP 4 4 5 8 11 1 2 1 2 3 6 1 3 6 9 1 (0 4 0) (002) (0 2 8) (012)(020) (011) D:1100 2nd rnd C9 (0 11 16) D:1100 2nd rnd C15 (0 5 5) D:1100 2nd rn10 C10 (1 5 2) D:1100 2nd rnd C6 (56 11) D:1100 2nd rnd C3 (24 13 5) D:1100 2nd rnd C2 (22 0) D:1100 2nd rnd C5 (2 0 4) (200) F 14 16 24 28 48 49 51 64 Ct 1 1 2 2 2 1 1 1 GP 2 8 4 20 1 2 13 (200) (101) (140) D:1100 2nd rnd C4 (6 41) (010) (001) F 8 42 54 58 63 73 Ct 1 2 1 3 1 1 GP 34 12 4 5 10 (010)(110) (003) (010) D:1100 2nd rn10 C11 (1 4 4) FCC on WINE150 WINE counts are L=57, M=75, H=18 DT=1 PCCs: 5 60% change from high. 2nd round 90% accurate 1st round 64% accurate F 1 2 3 4 6 7 8 9 10 11 13 14 17 19 23 24 29 Ct 2 1 2 1 3 3 3 2 1 2 1 1 1 1 1 1 1 GP 1 1 1 2 1 1 1 1 1 2 1 3 2 4 1 5 C21(5 0 1) C22(10 4 0) (7 0 0) (030) F 0 2 3 4 5 6 7 8 9 10 11 12 13 17 18 20 21 22 24 25 27 32 37 Ct 1 3 4 2 3 2 4 1 2 1 1 2 2 4 1 2 2 3 2 1 1 1 2 GP 2 1 1 1 1 1 1 1 1 1 1 1 4 1 2 1 1 2 1 2 5 5 3 C31(10 8 1) C32(1 5 3) C33(121) C34(4 7 1) C35(481) C2 0100 2nd rnd (22 4 1) Gp>=2 C4 0010 2nd rnd (18 28 7) (020) (010) 40 52 60 67 93 106 1 1 1 1 1 1 12 8 7 26 13 (100) (001) (100) F 10 12 18 20 35 58 79 81 82 Ct 1 1 1 1 1 1 1 2 1 GP 2 6 2 15 23 21 2 1 (002) (030) (100) (031) C6 0010 2nd rnd (18 28 7)