480 likes | 607 Views
Synthesis Generations. First Generation Evaluation Perfect speech could be generated Required perfect setting of the parameters Human intervention put supper limits on the achievable quality Second Generation Memorize pre-stored waveforms for concatenation
E N D
Synthesis Generations • First Generation Evaluation • Perfect speech could be generated • Required perfect setting of the parameters • Human intervention put supper limits on the achievable quality • Second Generation • Memorize pre-stored waveforms for concatenation • Cannot store enough data to concatenate all that we want • Only allows pitch and timing changes • Third Generation • Introduce statistical models to learn the data’s properties • Allows the possibility to modify the output in many ways
Dynamic Programming • Definition:Recursive algorithm that uses arrays. • Description: • Start with the base case, which initializes the arrays • Each step of the algorithm fills in table entries • Later steps access table entries filled-in by earlier steps • Advantages: • avoids repeat calculations performed during recursion • Uses loops without the overhead of creating activation records • Applications • Many applications beyond signal processing • Dynamic Time Warping: How close are two sequences • Hidden Markov Model Algorithms
Example: Minimum Edit Distance A useful dynamic programming algorithm • Problem: How can we measure how different one word is from another word (ie spell checker)? • How many operations will transform one word into another? • Examples: caat --> cat, fplc --> fireplace • Definition: • Levenshtein distance: smallest number of insertion, deletion, or substitution operations to transform one string into another • Each insertion, deletion, or substitution is one operation • Requires a two dimension array • Rows: source word positions, Columns: spelled word positions • Cells: distance[r][c] is the distance up to that point
Pseudo Code (minDistance(target, source)) n = character in source m = characters in target Create array, distance, with dimensions n+1, m+1 FOR r=0 TO n distance[r,0] = r FOR c=0 TOm distance[0,c] = c FOR eachrow r FOReach column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitution Result is in distance[n,m]
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Initialization
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 1
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 2
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 3
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 4
Example • Source: GAMBOL, Target: GUMBO • Algorithm Step: Column 5 • Result: Distance equals 2
Hidden Markov Model • Motivation • We observe the output • We don't know which internal states the model is in • Goal: Determine the most likely internal (hidden) state sequence • Hence the title, “Hidden” • Definition (Discrete HMM Ф = [O, S, A, B, Ω] • O = {o1, o2, …, oM} is the possible output states • S = {1, 2, …, N} possible internal HMM states • A = {aij} is the transition probability matrix from i to j • B = {bi(k)} probability of state i outputting ok • {Ω i} = is a set of initial State Probabilities where Ωi is the probability that the system starts in state i
HMM Applications Given an HMM Model and an observation sequence: • Evaluation ProblemWhat is the probability that the model generated the observations? • Decoding ProblemWhat is the most likely state sequence S=(s0, s1, s2, …, sT) in the model that produced the observations? • Learning ProblemHow can we adjust parameters of the model to maximize the likelihood that the observation will be correctly recognized?
Natural Language processing and HMM Speech Recognition Which words generated the observed acoustic signal? Handwriting Recognition Which words generated the observed image? Part-of-speech Which parts of speech correspond to the observed words? Where are the word boundaries in the acoustic signal? Which morphological word variants match the acoustic signal? Translation Which foreign words are in the observed signal? Speech Synthesis What database unit fits the synthesis script Hidden Markov Model (HMM) Demo: http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s3_pg1.html
Natural Language HMM Assumptions • A Stochastic Markov process • System state changes are not deterministic; they vary according to some probabilistic distribution • Discrete • There is a countable system state set can observable in time steps • Markov Chain: Next state depends solely on the current state • Output Assumption • Output at a given state solely depends on that state P(w1, …, wn) ≈∏i=2,n P(wi | wi-1) not P(w1, …, wn) =∏i=2,n P(wi | w1, …, wi-1) Demonstration of a Stochastic Process http://cs.sou.edu/~harveyd/classes/cs415/docs/unm/movie.html
Speech Recognition Example • Observations: The digital signal features • Hidden States:The spoken word that generated the features • Goal: Assume Word Maximizes P(Word|Observation) • Bayes Law gives us something we can calculate: • P(Word|Observation) = P(Word) P(Observation|Word)/P(O) • Ignore denominator: it’s probability = 1 (we observed it after all) • P(Word) can be looked up from a database • Use bi or tri grams to take the context into account • Chain rule: P(w) = P(w1)P(w2|w1)P(w3|w1,w2)P(w,w1w2…wG) • If there is no such probability, we can use a smoothing algorithm to insert a value for combinations never encountered.
HMM: Trellis Model Question: How do we find the most likely sequence?
Probabilities • Forward probability: The probability of being in state si, given the partial observation o1,…,ot • Backward probability: The probability of being in state si, given the partial observation ot+1,…,oT • Transition probability:αij = P(qt = si, qt+1 = sj | observed output) The probability of being in state si at time t and going from state si, to state sj, given the complete observation o1,…,oT
Forward Probabilities αt(j) = ∑i=1,N {αt-1(i)αijbj(ot)} and αt(j) = P(O1…OT | qt = sj, λ) • Notes • λ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith output • aij = probability of transitioning from state si to sj • bi(ot) = probability of observation ot resulting from si • αt(j) = probability of state j at time t given observations o1,o2,…,ot
Forward Algorithm Pseudo Code What is the likelihood of each possible observed pronunciation? forward[i,j]=0 for all i,j; forward[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ forward[s’,t+1] += forward[s,t]*a(s,s’)*b[s’,ot] RETURN∑forward[s,tfinal+1] for all states s Notes • a(s,s’) is the transition probability from state s to state s’ • b(s’,ot) is the probability of state s’ given observation ot Complexity: O(t S2) where S is the number of states
Viterbi Algorithm • Viterbi is an optimally efficient dynamic programming HMM algorithm that traces through a series of possible states to find the most likely cause of an observation • Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum • Forward Algorithm: • Viterbi:
Viterbi Algorithm Pseudo Code What is the likelihood of a word given an observation sequence? viterbi[i,j]=0 for all i,j; viterbi[0,0]=1.0 FOR each time step t FOR each state s FOR each state transition s to s’ newScore=viterbi[s,t]*a(s,s’)* b[s’,ot] IF (newScore > viterbi[s’,t+1]) viterbi[s’,t+1] = newScore maxScore = newScore save maxScore in a queue RETURNqueue Notes • a(s,s’) is the transition probability from state s to state s’ • B(s’,ot) is the probability of state s’ given observation ot
bull bear stable Markov Example • Problem:Model the Probability of stocks being bull, bear, or stable • Observe: up, down, unchanged • Hidden: bull, bear, stable Probability Matrix Initialization Matrix Example: What is the probability of observing up five days in a row?
1 2 3 HMM Example • O = {up, down, unchanged (Unch)} • S = {bull (1), bear (2), stable (3)} Observe 'up, up, down, down, up' What is the most likely sequence of states for this output?
Bi*Ωc = 0.7 * 0.5 0.35 0.179 0.02 0.009 0.09 0.036 t=1 t=0 Sum of α0,c * ai,c * bc Forward Probabilities X = [up, up] Note: 0.35*0.2*0.3 + 0.02*0.2*0.3 + 0.09*0.5*0.3 = 0.0357
Bi*Ωc = 0.7 * 0.5 0.35 0.147 0.02 0.007 0.09 0.021 t=1 t=0 Maximum of α0,c * ai,c * bc Viterbi Example Observed = [up, up] State 0 State 1 State 2 Note: 0.021 = 0.35*0.2*0.3, versus 0.02*0.2*0.3 and 0.09*0.5*0.3
Backward Probabilities • Similar algorithm as computing the forward probabilities, but in the other direction • Answers the question: What is the probability that given an HMM model and given the state at time t is i, when the partial observation ot+1 … oT is generated? βt(i) = ∑j=1,N {αijbj(ot+1)βt+1(i)}
Backward Probabilities βt(i) = ∑j=1,N {βt+1(j)αijbj(ot+1)} and βt(i) = P(Ot+1…OT | qt = si,λ) • Notes • λ = HMM, qt = HMM state at time t, sj = jth state, Oi = ith output • aij = probability of transitioning from state si to sj • bi(ot) = probability of observation ot resulting from si • βt(i) = probability of state i at time t given observations ot+1,ot+2,…,oT
Parameters for HMM states • Cepstrals • Why? They are largely statistically independent which make them suitable for classifying outputs • Delta coefficients • Why? To overcome the HMM limitation where transitions only depend on one previous state. Speech articulators change slowly, so they don’t follow the traditional HMM model. Without delta coefficients, HMM tends to jump too quickly between states • Synthesis requires more parameters than ASR • Examples: additional delta coefficients, duration and F0 modeling, acoustic energy
Cepstral Review • Perform Fourier transform to go from time to frequency domain • Warp the frequencies using the Mel-scale • Gather the amplitude data into bins (usually 13) • Perform the log power of the amplitudes • Compute first and second order delta coefficients • Perform a discrete cosine transform (no complex numbers) to form the Cepstrals Note: Phase data is lost in the process
Training Data • Question: How do we establish the transition probability between states when that information is not available • Older Method: tedious hand marking of wave files based on spectrograms • Optimal Method: NP complete is intractable • Newer Method: HMM Baum Welsh algorithm is a popular heuristic to automate the process • Strategies • Speech Recognition: train with data from many speakers • Speech Synthesis: train with data for specific speakers
Baum-Welsh Algorithm Pseudo-code • Initialize HMM parameters, its = 0 • DO • HMM’ = HMM; iterations++; • FOR each training data sequence • Calculate forward probabilities • Calculate backward probabilities • Update HMM parameters • UNTIL |HMM - HMM’|<delta OR iterations<MAX
Re-estimation of State Changes Forward Value Backward Value Sum forward/backward ways to arrive at time t with observed output divided by the forward/backward ways to get to state t α’ij = ∑t=1,T αi(t)αijb(ot+1) βj(t+1) Note: b(ot) is part of αi(t) ∑t=1,T αi(t)βi(t)
Joint probabilities State i at t and j at t+1 aijbj(Xt+1)
Re-estimation of Other Probabilities • The probability of an output, o, being observed from a given state, s • The probability of initially being in state, s, when observing the output sequence Number of times in state s observing o b’(o)= Number of times in state s ∑i=1,N α1(s)α1jb(o2) β2(i) ∑i=1,N ∑j=1,N α1(i) α1jb(o2) β2(j) Ω’s =
Summary of HMM Approaches • Discrete • The continuous valued observed outputs are compared against a codebook of discrete values for HMM observations • Performs well for smaller dictionaries • Continuous Mixture Density • The observed outputs are fed to the HMM in continuous form • Gaussian mixture: outputs map to a range of distribution parameters • Applicable for large vocabulary with a large number of parameters • Semi-Continuous • No mixture of Gaussian densities • Tradeoff between discrete and continuous mixture • Large vocabularies: better than discrete, worse than continuous
HMM limitations • HMM is a hill climbing algorithm • It finds local (not global) minimums, not global minimums • It is sensitive to initial parameter settings • HMM's have trouble modeling time duration or speech • The first order Markov assumption independence don't exactly model speech • Underflow when computing Markov probabilities. For this reason, log probabilities are normally used • Continuous output model performance limited by probabilities that incorrectly map to outputs • Relationship between outputs are interrelated, not independent
Decision Trees Partition a series of questions, each with a discrete set of answers x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Poor Partition Reasonably Good Partition
CART Algorithm Classification and regression trees • Create a set of questions that can distinguish between the measured variables • Singleton Questions: Boolean (yes/no or true/false) answers • Complex Questions: many possible answers • Initialize the tree with one root node • Compute the entropy for a node to be split • Pick the question that with the greatest entropy gain • Split the tree based on step 4 • Return to step 3 as long as nodes remain to split • Prune the tree to the optimal size by removing leaf nodes with minimal improvement Note: We build the tree from top down. We prune the tree from bottom up.
Example: Play or not Play? • Questions • What is the outlook? • What is the temperature? • What is the humidity? • Is it Windy? • Goal: Order the questions in • the most efficient way
Example Tree for “Do we play?” Goal: Find the optimal tree Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes
Which question to select? witten&eibe
Computing Entropy • Entropy: Bits needed to store possible question answers • Formula: Computing the entropy for a question: Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn • Where pi is the probability of the ith answer to a question log2x is logarithm base 2 of x • Examples: • A coin toss requires one bit (head=1, tail=0) • A question with 30 equally likely answers requires∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907
Example: question “Outlook” Compute the entropy for the question: What is the outlook? Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6 Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play. Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4 Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy) = 5/14*0.971+4/14*0+5/14*0.971 = 0.693
Computing the Entropy gain • Original Entropy : Do we play?Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.94014 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14 • Information gain equals (information before) – (information after) gain("Outlook") = 0.940 – 0.693 = 0.247 • Information gain for other weather questions • gain("Temperature") = 0.029 • gain("Humidity") = 0.152 • gain("Windy") = 0.048 • Conclusion: Ask, “What is the Outlook?” first
Continuing to split yes no no For each child question, do the same thing to form the complete decision tree Example: After the outlook sunny node, we still can ask about temperature, humidity, and windiness
The final decision tree Note: The splitting stops when further splits don't reduce entropy more than some threshold value
Other Models • Goal: Find database units to use for synthesizing some element of speech • Other approaches • Relax the Markov assumption • Advantage: Can better model speech • Disadvantage: Complicates the model • Neural nets • Disadvantage: Has not demonstrated to be superior to the HMM approach