1 / 78

计算机视听觉-人工智能之梦 Computer Seeing and Hearing-A Dream of AI

计算机视听觉-人工智能之梦 Computer Seeing and Hearing-A Dream of AI. 张钹 清华大学信息科学与技术学院 清华大学计算机科学与技术系 清华信息科学与技术国家实验室 智能技术与系统国家重点实验室. Computer Vision /Hearing. Is it possible ? Yes No It is just a daydream !. The Characteristic of Auditory Information (Data). Ears, Earphones

mabyn
Download Presentation

计算机视听觉-人工智能之梦 Computer Seeing and Hearing-A Dream of AI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 计算机视听觉-人工智能之梦Computer Seeing and Hearing-A Dream of AI 张钹 清华大学信息科学与技术学院 清华大学计算机科学与技术系 清华信息科学与技术国家实验室 智能技术与系统国家重点实验室

  2. Computer Vision /Hearing Is it possible ? Yes No It is just a daydream !

  3. The Characteristic of Auditory Information (Data) Ears, Earphones A continuous wave Digital Data: 20K-100K bits/s Sparseness (Redundant) Noisy

  4. The Characteristics of Visual Information (Data) • Eyes, Digital Camera • Pixel-based (million, ten million bits) • Sparseness (Redundancy) • Noisy • Eyes: a sequence of images • 109 bits/sec

  5. The Sparseness of Auditory Signal • 采样频率 位分辨率 • 广播质量-48kHz • CD质量-44kHz 16位 • 收音音质-22kHZ 8位 • 可接受的音乐-11kHz 4位 • 可接受的语音-5kHz

  6. The Sparseness of Visual Signal 分辨率与识别率的关系 (conceptual)

  7. 一个不适定问题An Ill-posed Problem Microphone (Ears) (Camera (Eyes)) Sparse, redundant, noisy data (110000111100011100011000………… ) Existence Uniqueness Stability Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation (Object-invariant Representation)

  8. 1. Segmentation & Recognition

  9. Image Segmentation vs. Recognition Where is the object ? What is the object ? ? Which comes first, Chicken or Egg

  10. Speech Segmentation vs. Recognition ?What, Where

  11. 技术上的困难(Technology) Sparse, redundant, noisy data A Robust Detector An Invariant Descriptor Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation

  12. 人类是如何解决的? Top-down feedback Top-down feedback Local connection Data-driven High-level Apriori-knowledge From egg to chicken

  13. The Relation Between Activation Patterns and Early Stages of Sound Processing Speech Encoding occurs not only in specialized high-level region but also in early stages of sound processing. The early sound processing may exhibit complex spectrotemporal receptive fields and may participate in high-level encoding of auditory objects, e.g., via local feedback

  14. Multi-layer Neural Network with feedback connections G. E. Hinton, The “wake-sleep” algorithms for unsupervised neural networks, SCIENCE vol.268, 26 May 1995, 1158-1161

  15. Representation RBM: Restricted Boltzmann Machine

  16. Experimental Results G. E. Hinton, Learning multiple layers of representation, TRENDS in Cognitive Sciences vol.11, no.10, 428-434, 2007

  17. 2、Feature Extraction

  18. Computer Robustly Extractable Features Sparse, redundant, noisy data Statistical Approaches Speech-base Invariant Statistics (Features) Speaker-invariant Vowel Representation Vowel-invariant Speaker Representation

  19. Statistical Method • 选择一个语音训练库 • 提取语音特征 • 无监督学习(Classification) • 分类准则-Generalization • 提取何种特征 ? Computer robustly detectable

  20. Representation at Different Granularities Global Features-one vector The coarsest An Image The finest Pixel Based-1280X800X3 vectors

  21. Pixel-based Representation-the finest representation ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• ••••••••••• • millionX3-dimensional vectors • -all the details

  22. Global Features-the coarsest representation Color moments N-the number of pixels, P-the value of each color One 9-dimensional vector

  23. Coarse vs. Fine Representation

  24. Representation with Middle Grain-Size • Region-based Representation •••••• ••••••••••••••••

  25. Local (Spatial) Feature Region-01 Region-11 Region-12 Foreground vs. Background

  26. Vector Representation A set of vectors (tens) (with different length) Similarity Measure Weighted

  27. Region-adaptive Grid Partition Jinhui Yuan (2005…)

  28. Hierarchical(粒度)结构 Semantics (text, image) (X, F, f )-the finest space ([X], [F], [f] )-coarse space [X] the quotient space of X [F] the quotient structure of F an equivalence class [f]-the quotient attributes of f • ••• ••••••••••••• Semantic Gap ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• Primitive (words, pixels)

  29. PM: Pyramid Match (feature space-quantization level) SPM: Spatial Pyramid Match (physical space-grid) FESCO: Feature Spatial Covariant Kernel

  30. Concept Detection from Video Shots

  31. Experiments TRECVID 2005, 10 concepts 170 hours news (MSNBC, NBC Nightly News, CNN, LBC, CCTV, NTDTV) TRECVID 2006, 20 concepts 170+150 hours news Keypoint descriptor: 64-dimensional SURF feature (Speeded Up Robust Features) AP: Non-interpolated Average Precision MAP: Mean Average Precision (7 concepts)

  32. TRECVID Data d: training data, t: testing data

  33. Coarse vs. Fine Granulation MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront

  34. Multi-granulation MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront

  35. Multi-granulation (2) MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront

  36. Multi-Granular & Multi-modal TRECVID2005 (Video Retrieval Evaluation Conference) 86.6 hours of news videos (45766 shots in 140 video clips) Features: A: auto-speech recognition text T: visual texture R: color of segmented image regions

  37. PMSRA Probabilistic Model Supported Rank Aggregation

  38. The Comparison between Uni-modal and Multi-granular, modal

  39. TRECVID Text Retrieval Conference Video Retrieval Evaluation

  40. 声波、声谱图(Spectrograms)

  41. 语音信息 Global Features-one vector The coarsest The Finest-sampling

  42. 不同粒度的语音特征 • 语音单元(粒度)选择: • 音素、音节、词…. • 语音参数选择 • MFCC: Mel 频率倒谱参数 (Mel Frequency Cepstral Coefficients) • LSP:线谱对 (Line Spectrum Pair) • ICA (Independent Component Analysis) • 多(粒度)特征融合

  43. 3、Structural Model • Temporal Model (HMM) • Spatial Model

  44. 语音的时间结构(Temporal Structure) 多粒度结构

  45. Image Region Annotation -horse, sky, mountain, grass, tree

  46. Region-adaptive Grid Partition (2)

  47. Experiments • 4002 Corel images (384256 or 256384) • 11 basic (region) concepts • Features: color moment + wavelet • 5 models: 2 without structural knowledge • (GMM, SVM) • 3 with structural knowledge • (HMM*, RMF*, CRF*)

  48. Image Region Annotation

  49. Image Region Annotation

  50. Spatial Structural Representation n images, each image has mi=HV grids (a) i.i.d generative model (b) i.i.d. discriminative model (c) 2-dimensional hidden Markov (2D HMM) (d) Markov Random Field (MRF) (e) Conditional Random Field (CRF)

More Related