780 likes | 1.02k Views
计算机视听觉-人工智能之梦 Computer Seeing and Hearing-A Dream of AI. 张钹 清华大学信息科学与技术学院 清华大学计算机科学与技术系 清华信息科学与技术国家实验室 智能技术与系统国家重点实验室. Computer Vision /Hearing. Is it possible ? Yes No It is just a daydream !. The Characteristic of Auditory Information (Data). Ears, Earphones
E N D
计算机视听觉-人工智能之梦Computer Seeing and Hearing-A Dream of AI 张钹 清华大学信息科学与技术学院 清华大学计算机科学与技术系 清华信息科学与技术国家实验室 智能技术与系统国家重点实验室
Computer Vision /Hearing Is it possible ? Yes No It is just a daydream !
The Characteristic of Auditory Information (Data) Ears, Earphones A continuous wave Digital Data: 20K-100K bits/s Sparseness (Redundant) Noisy
The Characteristics of Visual Information (Data) • Eyes, Digital Camera • Pixel-based (million, ten million bits) • Sparseness (Redundancy) • Noisy • Eyes: a sequence of images • 109 bits/sec
The Sparseness of Auditory Signal • 采样频率 位分辨率 • 广播质量-48kHz • CD质量-44kHz 16位 • 收音音质-22kHZ 8位 • 可接受的音乐-11kHz 4位 • 可接受的语音-5kHz
The Sparseness of Visual Signal 分辨率与识别率的关系 (conceptual)
一个不适定问题An Ill-posed Problem Microphone (Ears) (Camera (Eyes)) Sparse, redundant, noisy data (110000111100011100011000………… ) Existence Uniqueness Stability Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation (Object-invariant Representation)
Image Segmentation vs. Recognition Where is the object ? What is the object ? ? Which comes first, Chicken or Egg
Speech Segmentation vs. Recognition ?What, Where
技术上的困难(Technology) Sparse, redundant, noisy data A Robust Detector An Invariant Descriptor Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation
人类是如何解决的? Top-down feedback Top-down feedback Local connection Data-driven High-level Apriori-knowledge From egg to chicken
The Relation Between Activation Patterns and Early Stages of Sound Processing Speech Encoding occurs not only in specialized high-level region but also in early stages of sound processing. The early sound processing may exhibit complex spectrotemporal receptive fields and may participate in high-level encoding of auditory objects, e.g., via local feedback
Multi-layer Neural Network with feedback connections G. E. Hinton, The “wake-sleep” algorithms for unsupervised neural networks, SCIENCE vol.268, 26 May 1995, 1158-1161
Representation RBM: Restricted Boltzmann Machine
Experimental Results G. E. Hinton, Learning multiple layers of representation, TRENDS in Cognitive Sciences vol.11, no.10, 428-434, 2007
Computer Robustly Extractable Features Sparse, redundant, noisy data Statistical Approaches Speech-base Invariant Statistics (Features) Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation
Statistical Method • 选择一个语音训练库 • 提取语音特征 • 无监督学习(Classification) • 分类准则-Generalization • 提取何种特征 ? Computer robustly detectable
Representation at Different Granularities Global Features-one vector The coarsest An Image The finest Pixel Based-1280X800X3 vectors
Pixel-based Representation-the finest representation ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• • millionX3-dimensional vectors • -all the details
Global Features-the coarsest representation Color moments N-the number of pixels, P-the value of each color One 9-dimensional vector
Representation with Middle Grain-Size • Region-based Representation •••••• ••••••••••••••••
Local (Spatial) Feature Region-01 Region-11 Region-12 Foreground vs. Background
Vector Representation A set of vectors (tens) (with different length) Similarity Measure Weighted
Region-adaptive Grid Partition Jinhui Yuan (2005…)
Hierarchical(粒度)结构 Semantics (text, image) (X, F, f )-the finest space ([X], [F], [f] )-coarse space [X] the quotient space of X [F] the quotient structure of F an equivalence class [f]-the quotient attributes of f • ••• ••••••••••••• Semantic Gap ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• Primitive (words, pixels)
PM: Pyramid Match (feature space-quantization level) SPM: Spatial Pyramid Match (physical space-grid) FESCO: Feature Spatial Covariant Kernel
Experiments TRECVID 2005, 10 concepts 170 hours news (MSNBC, NBC Nightly News, CNN, LBC, CCTV, NTDTV) TRECVID 2006, 20 concepts 170+150 hours news Keypoint descriptor: 64-dimensional SURF feature (Speeded Up Robust Features) AP: Non-interpolated Average Precision MAP: Mean Average Precision (7 concepts)
TRECVID Data d: training data, t: testing data
Coarse vs. Fine Granulation MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront
Multi-granulation MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront
Multi-granulation (2) MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront
Multi-Granular & Multi-modal TRECVID2005 (Video Retrieval Evaluation Conference) 86.6 hours of news videos (45766 shots in 140 video clips) Features: A: auto-speech recognition text T: visual texture R: color of segmented image regions
PMSRA Probabilistic Model Supported Rank Aggregation
TRECVID Text Retrieval Conference Video Retrieval Evaluation
语音信息 Global Features-one vector The coarsest The Finest-sampling
不同粒度的语音特征 • 语音单元(粒度)选择: • 音素、音节、词…. • 语音参数选择 • MFCC: Mel 频率倒谱参数 (Mel Frequency Cepstral Coefficients) • LSP:线谱对 (Line Spectrum Pair) • ICA (Independent Component Analysis) • 多(粒度)特征融合
3、Structural Model • Temporal Model (HMM) • Spatial Model
Experiments • 4002 Corel images (384256 or 256384) • 11 basic (region) concepts • Features: color moment + wavelet • 5 models: 2 without structural knowledge • (GMM, SVM) • 3 with structural knowledge • (HMM*, RMF*, CRF*)
Spatial Structural Representation n images, each image has mi=HV grids (a) i.i.d generative model (b) i.i.d. discriminative model (c) 2-dimensional hidden Markov (2D HMM) (d) Markov Random Field (MRF) (e) Conditional Random Field (CRF)