Outline
The paper proposes a a pipeline that generates audio-visual data based on their 3D object shapes and their physical properties:
- Construct a synthetic audio-visual dataset: Sound-20K
- Demonstrate that auditory and visual information play complementary roles for object perception tasks
- The representation learned can transfer to real-world scenarios
Background
- Ability of recovering rich information from a short audio clip of object interacting
- Collecting large-scale audio recordings is challenging:
- Labeling objects at a finer granularity requires strong domain knowledge (e.g.: wood type/AudioSet);
- Sound is generally a mixture of multiple sound sources and background noise (e.g.: The Greatest Hits dataset)
- Instead, we try to use synthesized audio-visual data
Related Work
- Human Visual and Auditory Perception
- Physical Object Modeling
- 3D shape modeling
- 3D shape attributes
- Material recognition
- Synthesizing Data for Learning
- For tasks like viewpoint estimation and 3D reconstruction
- Sound simulation works (FEM, BEM, Rayleigh method)
- Learning from Visual and Auditory Data
Algorithm
Overview
Engine 1 & 2
- Physics Engine
- Motion–position; Collision–vibration;
- Use the simulation engine Bullet
- Graphics Engine
- Use Blender and its Cycles ray tracer
Engine 3
Collision to Vibration (use FEM)
Surface $\to$ a volumetric tetrahedral mesh
Vibration Equation:
$M\frac{\partial^{2}u}{\partial t^{2}}+Ku=f$
Convert this problem to eigenvalue problem:
$\Phi^TM\Phi=I, \Phi^TK\Phi=D$
And the displacement $u$ can be decomposed:
$\Phi c=u$
Then the equation becomes:
$\frac{\partial^{2}c}{\partial t^{2}}+Dc=\Phi^Tf$
Solving the Wave Equation (use BEM)
Wave Equation:
$(\nabla^2-\frac{1}{v^2}\frac{\partial^{2}}{\partial t^{2}})p(x,t)=0$
Boundary condition based on vibration equation:
$\frac{\partial}{\partial n}p_i(x,0)=\Phi_i(x\in S)$
Then convert the equation into frequency domain:
$(\nabla^2+k^2)q_i(x)=0, s.t. q_i(x)=\Phi_i(x\in S)$
Offline-Online Decomposition
Pre-compute sound pressure field generated by points
Experiment
1. Audio Synthesis Validation Figure 3: Authors validate the audio synthesis pipeline through carefully-designed physics experiments. They record the sound of four tiles of different materials (a-b), and compare its spectrum with their synthesized audio (c) with corresponding physical properties. They also conducted behavioral studies, asking humans which of the two sounds match the image better. They show results in (d).
2. The Sound-20K Dataset
- 20378 audio-video pairs and other related data
- Shape
- 39 3D shapes from ShapeNet(6 super categories and 21 categories)
- 3D shape attributes: planarity, no planarity, has cylindrical, has roughness, mainly empty, has hole, thin structures, mirror symmetry and cubic aspect ratio
- 7 Materials
- Ceramic, polystyrene, steel, medium-density fiberboard(MDF), wood, polycarbonate and aluminum
- Property: Density, friction coefficient, coefficient of restitution, Young’s modulus, Poisson ratio and Rayleigh damping coefficients
- 22 Scenarios
- 3 levels of complexity
- Independent of shape and material designations
- Result
- Dataset with 20378 videos with audios
- 2 constraints: Similar dimensions shape & relative sizes are resonable
3. Object Perception with Audio-Visual Data
- Datasets
- Sound-20K (20378 videos with audios)
- Physics 101 (15190 videos, 101 objects, 5 scenarios)
- Annotation: material, mass and volume
- The Greatest Hits (977 videos, drumstick, 48 actions in each video, 46577 actions in total)
- Annotation: material label of target object
Method
4. Transferring from Synthetic to Real Data
Discussion
- Application in video tasks
- About synthesized data
- Method of combination of visual and audio information
- Simulation and Physics application
Reference
- Generative modeling of audible shapes for object perception. Zhang et al. ICCV17
- Visually Indicated Sounds. Andrew Owens et al.
- About SoundNet
- About 3D shape attributes