Generative Modeling of Audible Shapes for Object Perception

Generate Audio-Visual Data

Posted by Tab on November 4, 2017

Outline

The paper proposes a a pipeline that generates audio-visual data based on their 3D object shapes and their physical properties:

  • Construct a synthetic audio-visual dataset: Sound-20K
  • Demonstrate that auditory and visual information play complementary roles for object perception tasks
  • The representation learned can transfer to real-world scenarios

Figure 1: Sample of audio-visual data

Background

  • Ability of recovering rich information from a short audio clip of object interacting
  • Collecting large-scale audio recordings is challenging:
    • Labeling objects at a finer granularity requires strong domain knowledge (e.g.: wood type/AudioSet);
    • Sound is generally a mixture of multiple sound sources and background noise (e.g.: The Greatest Hits dataset)
  • Instead, we try to use synthesized audio-visual data

Related Work

  • Human Visual and Auditory Perception
  • Physical Object Modeling
    • 3D shape modeling
    • 3D shape attributes
    • Material recognition
  • Synthesizing Data for Learning
    • For tasks like viewpoint estimation and 3D reconstruction
    • Sound simulation works (FEM, BEM, Rayleigh method)
  • Learning from Visual and Auditory Data

Algorithm

Overview

Figure 2: Overview of 3 Engines in generative model

Engine 1 & 2

  • Physics Engine
    • Motion–position; Collision–vibration;
    • Use the simulation engine Bullet
  • Graphics Engine
    • Use Blender and its Cycles ray tracer

Engine 3

Collision to Vibration (use FEM)
Surface $\to$ a volumetric tetrahedral mesh
Vibration Equation:
$M\frac{\partial^{2}u}{\partial t^{2}}+Ku=f$
Convert this problem to eigenvalue problem:
$\Phi^TM\Phi=I, \Phi^TK\Phi=D$
And the displacement $u$ can be decomposed:
$\Phi c=u$
Then the equation becomes:
$\frac{\partial^{2}c}{\partial t^{2}}+Dc=\Phi^Tf$

Solving the Wave Equation (use BEM)
Wave Equation:
$(\nabla^2-\frac{1}{v^2}\frac{\partial^{2}}{\partial t^{2}})p(x,t)=0$
Boundary condition based on vibration equation:
$\frac{\partial}{\partial n}p_i(x,0)=\Phi_i(x\in S)$
Then convert the equation into frequency domain:
$(\nabla^2+k^2)q_i(x)=0, s.t. q_i(x)=\Phi_i(x\in S)$

Offline-Online Decomposition
Pre-compute sound pressure field generated by points


Experiment

1. Audio Synthesis Validation Figure 3: Authors validate the audio synthesis pipeline through carefully-designed physics experiments. They record the sound of four tiles of different materials (a-b), and compare its spectrum with their synthesized audio (c) with corresponding physical properties. They also conducted behavioral studies, asking humans which of the two sounds match the image better. They show results in (d).

2. The Sound-20K Dataset

  • 20378 audio-video pairs and other related data
  • Shape
    • 39 3D shapes from ShapeNet(6 super categories and 21 categories)
    • 3D shape attributes: planarity, no planarity, has cylindrical, has roughness, mainly empty, has hole, thin structures, mirror symmetry and cubic aspect ratio
  • 7 Materials
    • Ceramic, polystyrene, steel, medium-density fiberboard(MDF), wood, polycarbonate and aluminum
    • Property: Density, friction coefficient, coefficient of restitution, Young’s modulus, Poisson ratio and Rayleigh damping coefficients
  • 22 Scenarios
    • 3 levels of complexity
    • Independent of shape and material designations
  • Result
    • Dataset with 20378 videos with audios
    • 2 constraints: Similar dimensions shape & relative sizes are resonable

Figure 4: Samples videos in Sound-20K dataset

Figure 5: Analysis of categories and data distribution in dataset

3. Object Perception with Audio-Visual Data

  • Datasets
  • Sound-20K (20378 videos with audios)
  • Physics 101 (15190 videos, 101 objects, 5 scenarios)
    • Annotation: material, mass and volume
  • The Greatest Hits (977 videos, drumstick, 48 actions in each video, 46577 actions in total)
    • Annotation: material label of target object

Method

Figure 6: Method in object perception with audio-visual data

Figure 7: Result of material classification

4. Transferring from Synthetic to Real Data

Figure 8: Result of material classification in transferring learning

Figure 9: Result of attributes recognition in transferring learning

Discussion

  • Application in video tasks
  • About synthesized data
  • Method of combination of visual and audio information
  • Simulation and Physics application

Reference

  1. Generative modeling of audible shapes for object perception. Zhang et al. ICCV17
  2. Visually Indicated Sounds. Andrew Owens et al.
  3. About SoundNet
  4. About 3D shape attributes