90 likes | 170 Views
Everyone’s a Critic: Memory Models and Uses for an Artificial Turing Judge. W. Joseph MacInnes, Blair C. Armstrong, Dwayne Pare, George S. Cree and Steve Joordens. H?. C?. Turing Test. 3 parts to play Computer Program (AI) Human Confederate Judge Is it a test for intelligence?
E N D
Everyone’s a Critic:Memory Models and Uses for an Artificial Turing Judge W. Joseph MacInnes, Blair C. Armstrong, Dwayne Pare, George S. Cree and Steve Joordens
H? C? Turing Test • 3 parts to play • Computer Program (AI) • Human Confederate • Judge • Is it a test for intelligence? • Likely necessary but not sufficient • Unclaimed Loebner prize is evidence that this is still a hard problem for a conversational agent.
Here Comes the Judge • Focus on third party as first step • Can a computer agent function as a Turing Judge? • Language recognition not generation • Can an automated judge tell the difference between human and computer generated conversations? • Human judges may use • Grammar • Meaningfulness of reply • Relatedness of answer • ‘Commonsense’ knowledge Task: determine which is human/computer
Rationale and Applications • Use of judge as internal critic for language creation/generation for AGI • Models of human language comprehension and generation likely include critic • Garden path sentences suggest internal critic evaluates a sentence continuously while reading. • the cotton clothing is made of… is grown in the South • First reading interprets ‘cotton’ as adjective, but later adjust to noun • Applications in spam detection and detecting computer generated forum posts for narrow AI.
LSA • LSA is a corpus-based statistical method for generating representations that capture aspects of word meaning based on the contexts in which words co-occur. • The text corpus is converted into a word x passage matrix, where the passages can be any unit of text (e.g., sentence, paragraph, essay). • The elements of the matrix are the frequencies of each target word in each passage The entire matrix is then submitted to singular value decomposition (SVD), the purpose of which is to abstract a lower dimensional (e.g., 300 dimensions) meaning space in which each word is represented as a vector in this compressed space. A1. Humans built the Cylons to make their lives easier. A2. The Cylons did not like doing work for the humans. A3. In a surprise attack, the Cylons destroyed the humans that built them. A4. The Cylons were built by humans to do arduous work. B1. Some survivors escaped and fled on the Galactica. B2. The Galactica protected the survivors using its Viper attack ships. B3. The Cylons were no match for a Viper flown by one of the survivors. B4. A Viper flown by one of the survivors found Earth and led the Galactica there. • In addition to computational efficiency, this smaller matrix tends to better emphasize the similarities amongst words. • Following the generation of this compressed matrix, new passages can be compared by comparing the cosine of the word vectors the passage contains.
LSA: Example A1. Humans built the Cylons to make their lives easier. A2. The Cylons did not like doing work for the humans. A3. In a surprise attack, the Cylons destroyed the humans that built them. A4. The Cylons were built by humans to do arduous work. B1. Some survivors escaped and fled on the Galactica. B2. The Galactica protected the survivors using its Viper attack ships. B3. The Cylons were no match for a Viper flown by one of the survivors. B4. A Viper flown by one of the survivors found Earth and led the Galactica there. • Given the question • “The humans built what?” • And the possible replies • The humans built the Cylons” and • “They built the Galactica and the vipers • The question overlaps more in space with the first question and would have a higher rating by our judge • The effect of SVD is to reduce the dimensionality of the matrix, and discover word similarities which weren’t explicit in the original text (hence the ‘Latent’ part)
Results • Rank orderings and performance scores for all judges. • Substantial variability in the rank orderings of the different agents by the different human judges • There was a higher correlation of agent rankings within human judges (0.83) than with our artificial judge, though the artificial judge did reach a respectable (0.63). “Humanness” was operationalized as the vector cosine similarity for all of the words in the question relative to all of the words in the answer. Error bars are the standard error of the mean. With the exception of ALICE, humans scored significantly higher than the artificial agents.
Future work • Better training data • Live journal ‘conversations’ • Better Judge • Topic based, LDA, to try extract the topic or ‘gist’ of a sentence • Pure semantic spaces (LSA) are unlikely sufficient • Better ambitions • Full conversational agent Fom Forbidden Planet http://forbiddenplanet.co.uk/blog/w-content/uploads/2007/12/Adam%20LevermoreRich%20how%20to%20spot%20a%20cylon.jpg
Potential Fit Comprehension Context Grammar Semantics Response Generation Critic ??