Outline

The paper proposes a a pipeline that generates audio-visual data based on their 3D object shapes and their physical properties:

Construct a synthetic audio-visual dataset: Sound-20K

Demonstrate that auditory and visual information play complementary roles for object perception tasks

The representation learned can transfer to real-world scenarios

Figure 1: Sample of audio-visual data

Background

Ability of recovering rich information from a short audio clip of object interacting

Collecting large-scale audio recordings is challenging:

Labeling objects at a finer granularity requires strong domain knowledge (e.g.: wood type/AudioSet);

Sound is generally a mixture of multiple sound sources and background noise (e.g.: The Greatest Hits dataset)

Instead, we try to use synthesized audio-visual data

Human Visual and Auditory Perception

Physical Object Modeling

3D shape modeling

3D shape attributes

Material recognition

Synthesizing Data for Learning

For tasks like viewpoint estimation and 3D reconstruction

Sound simulation works (FEM, BEM, Rayleigh method)

Learning from Visual and Auditory Data

Algorithm

Overview

Figure 2: Overview of 3 Engines in generative model

Engine 1 & 2

Physics Engine

Motion–position; Collision–vibration;

Use the simulation engine Bullet

Graphics Engine

Use Blender and its Cycles ray tracer

Engine 3

Collision to Vibration (use FEM)
Surface $\to$ a volumetric tetrahedral mesh
Vibration Equation:
$M\frac{\partial^{2}u}{\partial t^{2}}+Ku=f$
Convert this problem to eigenvalue problem:
$\Phi^TM\Phi=I, \Phi^TK\Phi=D$
And the displacement $u$ can be decomposed:
$\Phi c=u$
Then the equation becomes:
$\frac{\partial^{2}c}{\partial t^{2}}+Dc=\Phi^Tf$

Solving the Wave Equation (use BEM)
Wave Equation:
$(\nabla^2-\frac{1}{v^2}\frac{\partial^{2}}{\partial t^{2}})p(x,t)=0$
Boundary condition based on vibration equation:
$\frac{\partial}{\partial n}p_i(x,0)=\Phi_i(x\in S)$
Then convert the equation into frequency domain:
$(\nabla^2+k^2)q_i(x)=0, s.t. q_i(x)=\Phi_i(x\in S)$

Offline-Online Decomposition
Pre-compute sound pressure field generated by points

Experiment

1. Audio Synthesis Validation Figure 3: Authors validate the audio synthesis pipeline through carefully-designed physics experiments. They record the sound of four tiles of different materials (a-b), and compare its spectrum with their synthesized audio (c) with corresponding physical properties. They also conducted behavioral studies, asking humans which of the two sounds match the image better. They show results in (d).

2. The Sound-20K Dataset

20378 audio-video pairs and other related data

Shape

39 3D shapes from ShapeNet(6 super categories and 21 categories)

3D shape attributes: planarity, no planarity, has cylindrical, has roughness, mainly empty, has hole, thin structures, mirror symmetry and cubic aspect ratio

7 Materials

Ceramic, polystyrene, steel, medium-density fiberboard(MDF), wood, polycarbonate and aluminum

Property: Density, friction coefficient, coefficient of restitution, Young’s modulus, Poisson ratio and Rayleigh damping coefficients

22 Scenarios

3 levels of complexity

Independent of shape and material designations

Result

Dataset with 20378 videos with audios

2 constraints: Similar dimensions shape & relative sizes are resonable

Figure 4: Samples videos in Sound-20K dataset

Figure 5: Analysis of categories and data distribution in dataset

3. Object Perception with Audio-Visual Data

Datasets

Sound-20K (20378 videos with audios)

Physics 101 (15190 videos, 101 objects, 5 scenarios)

Annotation: material, mass and volume

The Greatest Hits (977 videos, drumstick, 48 actions in each video, 46577 actions in total)

Annotation: material label of target object

Method

Figure 6: Method in object perception with audio-visual data

Figure 7: Result of material classification

4. Transferring from Synthetic to Real Data

Figure 8: Result of material classification in transferring learning

Figure 9: Result of attributes recognition in transferring learning

Discussion

Application in video tasks

About synthesized data

Method of combination of visual and audio information

Simulation and Physics application

Generative Modeling of Audible Shapes for Object Perception

Generate Audio-Visual Data

Outline

Background

Algorithm

Experiment

Discussion

Reference

CATALOG

FEATURED TAGS

FRIENDS

Outline

Background

Related Work

Algorithm

Experiment

Discussion

Reference

CATALOG

FEATURED TAGS

FRIENDS