460 likes | 838 Views
The Analysis of Patterns. Nello Cristianini. The Value of Patterns. Patterns are everywhere, and people have always been fascinated by them. Detecting patterns confers an advantage to an organism. Temperature and Rainfall in Lake Shasta over 5 years. Patterns Help Us in Many Ways….
E N D
The Analysis of Patterns Nello Cristianini
The Value of Patterns • Patterns are everywhere, and people have always been fascinated by them. • Detecting patterns confers an advantage to an organism Temperature and Rainfall in Lake Shasta over 5 years
Patterns Help Us in Many Ways… e.g., compress, predict, remove errors
Patterns and Intelligence • We care so much about pattern finding skills, that we even use them (partly) to quantify intelligence…
The Instinct for Patterns • We see patterns everywhere • Even where there are no patterns:
Patterns and Randomness • We are poorly equipped to deal with randomness: • 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679... • In first million digits we see Erice’s ZIP code 11 times… • Does it mean anything?
ABRAHAM LINCOLN WAS ELECTED TO CONGRESS IN 1846. JOHN F. KENNEDY WAS ELECTED TO CONGRESS IN 1946. ABRAHAM LINCOLN WAS ELECTED PRESIDENT IN 1860. JOHN F. KENNEDY WAS ELECTED PRESIDENT IN 1960. THE NAME LINCOLN AND KENNEDY EACH CONTAIN SEVEN LETTERS. BOTH WIVES LOST CHILDREN WHILE LIVING IN THE WHITE HOUSE. BOTH PRESIDENTS WERE SHOT ON FRIDAY. BOTH WERE SHOT IN THE HEAD. BOTH SUCCESSORS WERE NAMED JOHNSON. ANDREW JOHNSON, WHO SUCCEEDED LINCOLN, WAS BORN IN 1808. LYNDON JOHNSON, WHO SUCCEEDED KENNEDY, WAS BORN IN 1908. JOHN WILKES BOOTH, REPORTEDLY ASSASSINATED LINCOLN. LEE HARVEY OSWALD, REPORTEDLY ASSASSINATED KENNEDY. BOTH ASSASSINS WERE KNOWN BY THREE NAMES. BOTH NAMES CONTAINED FIFTEEN LETTERS. BOOTH AND OSWALD WERE ASSASSINATED BEFORE THEIR TRIALS. Patterns and Randomness Coincidences ?
Visualizing Patterns • We are naturally equipped to detect CERTAIN types of patterns, not others… 5.9400 8.6100 12.2800 11.6100 20.2800 23.8300 25.8300 27.3900 24.1700 19.0600 11.1700 8.8900 8.3300 7.8900 12.1100 15.3900 19.1100 24.6100 28.1100 25.7800 23.1100 16.8400 13.1700 8.4400 5.8900 10.8300 12.1100 15.7800 18.8300 26.5600 27.5600 25.0000 23.4400 15.5600 10.7200 7.1700 7.8300 11.1700 9.7800 14.9400 20.5000 23.3300 27.8300 29.2200 25.1100 20.6700 12.8900 11.8900 9.1700 9.8300 14.2800 18.5000 19.0000 26.3900 29.6100 26.7200 22.6700 20.3900 13.8900 8.8900
Finding Patterns • We are naturally interested in finding relations in data. • We are naturally ill-equipped in dealing with randomness. • We have developed sophisticated technology to do this for us. • In last decade we have made one more step… • As a society we rely on pattern discovery technology, in many ways
Computational Pattern Finding • We want to find relations • They need to be reliable • They need to be explored efficiently • We want to do it automatically • On MASSIVE amounts of data
The Analysis of Patterns • Data driven approach to • Science • Business • Technology • Modern society relies on our capability to automatically detect reliable patterns in vast sets of data
The Analysis of Patterns • Science: • The Genome Project • Surveys of the Universe • Business: • Amazon automatically exploiting trends and relations in transactions database • Fraud Detection in Credit Card Companies • Technology: • Voice recognition • Handwriting recognition
A Scientific Gold Rush • 1 GATCACAGGT CTATCACCCT ATTAACCACT CACGGGAGCT CTCCATGCAT TTGGTATTTT • 61 CGTCTGGGGG GTGTGCACGC GATAGCATTG CGAGACGCTG GAGCCGGAGC ACCCTATGTC • 121 GCAGTATCTG TCTTTGATTC CTGCCTCATT CTATTATTTA TCGCACCTAC GTTCAATATT • 181 ACAGGCGAAC ATACCTACTA AAGTGTGTTA ATTAATTAAT GCTTGTAGGA CATAATAATA • 241 ACAATTGAAT GTCTGCACAG CCGCTTTCCA CACAGACATC ATAACAAAAA ATTTCCACCA • 301 AACCCCCCCC TCCCCCCGCT TCTGGCCACA GCACTTAAAC ACATCTCTGC CAAACCCCAA • 361 AAACAAAGAA CCCTAACACC AGCCTAACCA GATTTCAAAT TTTATCTTTA GGCGGTATGC • 421 ACTTTTAACA GTCACCCCCC AACTAACACA TTATTTTCCC CTCCCACTCC CATACTACTA • 481 ATCTCATCAA TACAACCCCC GCCCATCCTA CCCAGCACAC ACACACCGCT GCTAACCCCA • 541 TACCCCGAAC CAACCAAACC CCAAAGACAC CCCCCACAGT TTATGTAGCT TACCTCCTCA • 601 AAGCAATACA CTGAAAATGT TTAGACGGGC TCACATCACC CCATAAACAA ATAGGTTTGG • 661 TCCTAGCCTT TCTATTAGCT CTTAGTAAGA TTACACATGC AAGCATCCCC GTTCCAGTGA • 721 GTTCACCCTC TAAATCACCA CGATCAAAAG GGACAAGCAT CAAGCACGCA GCAATGCAGC • 781 TCAAAACGCT TAGCCTAGCC ACACCCCCAC GGGAAACAGC AGTGATTAAC CTTTAGCAAT • 841 AAACGAAAGT TTAACTAAGC TATACTAACC CCAGGGTTGG TCAATTTCGT GCCAGCCACC • 901 GCGGTCACAC GATTAACCCA AGTCAATAGA AGCCGGCGTA AAGAGTGTTT TAGATCACCC • 961 CCTCCCCAAT AAAGCTAAAA CTCACCTGAG TTGTAAAAAA CTCCAGTTGA CACAAAATAG • 1021 ACTACGAAAG TGGCTTTAAC ATATCTGAAC ACACAATAGC TAAGACCCAA ACTGGGATTA • 1081 GATACCCCAC TATGCTTAGC CCTAAACCTC AACAGTTAAA TCAACAAAAC TGCTCGCCAG • 1141 AACACTACGA GCCACAGCTT AAAACTCAAA GGACCTGGCG GTGCTTCATA TCCCTCTAGA
Another Gold Rush • The world wide web contains billion of pages, with text, images, data… • Semantic web, XML-based, provides high quality annotated information… • Soon all books ever written will be in digital form • Are we ready?
2001 2004 2005
The Analysis of Patterns • Traditionally, the role of analyzing data belongs to Statistics. • Or: does it ? • Data analysis performed by physicists, biologists, engineers… each with their own set of tools. • Even the task of making or validating these tools is not just part of statistics.
The Analysis of Patterns • Signal processing • Data mining • Information retrieval • Pattern recognition (*) • Pattern matching • Machine Learning • …
The Analysis of Patterns • Pattern Recognition • Syntactical / Structural • Statistical • Visual • Pattern Discovery vs Pattern Matching • In sequences • In graphs • In images
The Analysis of Patterns • Grammatical Inference • Mining for Association Rules • Patterns in Vector Data (classical multivariate statistics; neural networks; machine learning; etc) • Etc, Etc
The Analysis of Patterns • Many communities working almost independently • Occasionally re-discovering the same things • A small and fairly stable set of ideas • Efficient search for patterns in data • Statistical validation issues • Pattern visualization • Often same tools and concepts
Searching for Patterns • The search problem can be framed within Operational Research / Optimization. • (e.g., Integer Programming, Convex Programming, etc…) • Many key ideas from exact optimization have revolutionized this field in recent years • Where exact solution are theoretically impractical (and only then!) we can use approximations, then heuristic approaches. • Again: same heuristics appear in many fields
Statistics • How do we know that a relation found in a finite set of data is reliable, or significant, or even interesting? • Many issues of hypothesis testing • Classical statistics vs statistical learning theory
What Are Patterns? • This is a rather difficult question to answer. I hope we will have an answer by the end of this meeting. • I encourage all speakers and participants to suggest some definitions.
Gregory Chaitin: "Patterns, Randomness and Information" • Information, Complexity, Patterns, Randomness and Compression. • What are regularities in data? How can they be defined? And quantified? • Predictability and Compressibility are connected. • Randomness can be defined in algorithmic ways.
Gregory Chaitin • Chaitin will explain what it means that a sequence “has no pattern”, and some far reaching consequences • ideas can be traced back through Hermann Weyl to Leibniz in 1686, • connect with Godel & Turing • the question of how math compares & contrasts with physics and with biology
Patterns in Sets of Points (Vectors) • Probably the most developed part of pattern analysis • Includes much multivariate stats, much statistical pattern recognition (e.g., Duda and Hart) and machine learning
Tijl De Bie: Patterns in Sets of Points • Patterns in sets of points: an overview • the role of optimization • examples of patterns • Dimensionality reduction, classification, clustering • Emphasize linear patterns (connect to later kernels talk) • Patterns in sets of points: the myriad virtues of eigenproblems • the eigenvalue problem. • principal component analysis, canonical correlation analysis, Fisher's discriminant, partial least squares, and spectral clustering. • More from thiss area will be covered in Kernel Methods’ talk
Patterns in Sequences • After vectors, probably the most important type of data: • DNA • Text (web) • How to find patterns within and among sequences? • What data structures? What statistical models?
Esko Ukkonen Suffix Tree and Hidden Markov techniques for pattern analysis • Efficient Pattern Discovery in sequences requires appropriate data structures • Suffix tree construction. • linear time array constructions • using suffix trees for finding motifs with gaps • finding cis-regulatory motifs by comparative genomics • Hidden Markov techniques for haplotyping
Dan Gusfield Trees, Arrays, Networks and Optimization for Finding Patterns in Biological Sequences • The use of suffix trees and integer programming for finding optimal virus signatures. • A current treatment of suffix-arrays and their uses. • Algorithms for finding signatures (patterns) of historical recombination and gene-conversion in SNP (binary) sequences.
Raffaele GiancarloPatterns and Compression • Patterns are not just necessary for prediction, they are also needed for data compression. • Many relations between PA and Data Compression. • Raffaele Giancarlo (University of Palermo) - On Indexing and Compression: Two Sides of the Same Coin
Conceptual Foundations • Alberto Apostolico (University of Padova and Georgia Tech) - "Algorithmic and Combinatorial Foundations of Pattern Discovery" • Will discuss various aspects of the interplay between algorithmics and statistics, as well as the notion itself of pattern.
Kernel Methods • An idea: if we are so good at finding (linear) patterns in sets of points… • Why not transforming all other problems into a points problem? • Good idea… • Kernels Methods (from machine learning) can do this “automatically” …
Bernhard Schoelkopf Kernel Methods • Kernel methods: combine ideas from statistics and optimization • State of the art machine learning systems • Operate on general types of data • Work by embedding data into a euclidean space • The structure of the space determined by choosing a special “kernel” function… • KMs connect various aspects of PA…
Patterns in Sets • The most classic textbook example of data mining: You shop at the supermarket, and the market-basket contents are recorded by the computer system at check-out… Discover when some items are associated, when it is possible to predict your next purchase, etc… • (this is what Amazon does automatically…)
Heikki Mannila: Finding frequent patterns • Part I: Finding frequent patterns from data • Discovery of frequent patterns = finding positive conjunctions that are true for a given fraction of the observations • this basic idea can be instantiated in many ways: • finding frequent sets from 0/1 data (association mining) • finding frequent episodes in sequences • finding frequent subgraphs in graphs etc. • efficient algorithms exist -- the levelwise approach • theoretical analysis of the algorithms is not trivial (leads to connections to hypergraph transversals etc.) • Part II: how can the patterns be used? • sometimes interesting in themselves • can be used to approximate the joint distribution • maximum entropy approaches • combining information from several patterns - ordering patterns
When can we trust the patterns we found? • Statistical issues: • Patterns can be the result of chance • Multiple testing increases this risk • Small samples, interest in weak patterns, etc… are other factors • Statistical learning theory and Classical statistics have developed tools to deal with this • These criteria can also guide the search algorithms towards more reliable patterns…
John Shawe-Taylor Statistical Aspects of Pattern Analysis • We want significant / reliable patterns • Reliable: give us predictive power • Significant: cannot be explained by chance • Factors affecting pattern reliability: • Pattern magnitude (how strong is the relation) • Sample size (how large is the support from data?) • Multiple testing (how many other patterns have been tested at the same time) • This translates into classical machine learning and statistics themes
Nicolo' Cesa-Bianchi On-line linear learning algorithms • Machine learning has various ways to model the pattern discovery process. • An approach completely different from classical statistics: on-line learning. • Prediction with expert advice. • Learning with linear experts. • The Perceptron algorithm and its extensions. • On-line learning with kernels. • Mistake bounds. • From mistake bounds to risk bounds
Grammatical Inference • Very classical theme in pattern recognition, based on Chomsky’s theory of formal languages and grammars: • Given a finite sample from a language, infer the grammar that generates it (with various constraints). • A child’s game…
Colin de la Higuera : "Grammatical Inference, a Tutorial" • The lectures will introduce the key ideas of grammatical inference and concentrate specially on the algorithmic aspects. • Some algorithms that will be described are: • The "State merging" family : Gold, Rpni, Edsm... • The "Window" languages : Local and k-testable • Learning with queries. • This class of approaches often goes under the name “Syntactical Pattern Recognition”
Patterns in Graphs • Edwin Hancock (University of York, UK) - ``Pattern Analysis with Graphs and Trees'‘ • Spectral representations of graphs, • Pattern spaces from graph spectra, • Spectral approaches to matching, • Heat kernel methods • Probabilistic and spectral methods for graph matching and clustering. • Applications in computer vision.