AnimeInterp

Deep Animation Video Interpolation in the Wild

Siyao, Li, et al

[Paper] [Code] [Data]

Paper Reading

Introduction

Different from natural video interpolation, animation video has unique characteristics:

  • Texture Insufficiency: Cartoons comprise lines and smooth color pieces. The smooth areas lack textures and make it difficult to estimate accurate motions on animation videos.
  • Large Motions: Cartoons express stories via exaggeration. Some of the motions are non-linear and extremely large.

Along with original challenges of natural video interpolation, like occlussion handling, video interpolation in animations remains a challenging task.

This paper propsed an effective framework, AnimeInterp[8], with two dedicated modules, SGM and RFR, in a coarse-to-fine manner.


Contributions

  • Formally define and study the animation video interpolation problem for the first time.
  • Propose an effective animation interpolation framework named AnimeInterp with two dedicated modules to resolve the “lack of textures” and “non-linear and extremely large motion” challenges, which outperforms existing state-of-the-art methods both quantitatively and qualitatively.
  • Build a large-scale cartoon triplet dataset called ATD-12K with large content diversity representing many types of animations to test animation video interpolation methods.

Limits

  • Not mentioned

Framework with Dataset and Correspondent Codes

Framework

-Dataset

ATD-12K Dataset[8] with triplets of animation frames from videos in the wild. It has been splited into 10k training samples and 2k test samples.

Specific annotations are in .json file, include:

  • difficulty levels: 0 : “Easy”, 1 : “Medium”, 2 : “Hard”.
  • motion RoI(Region of Interest): x, y, width, height.
  • general_motion_type: "translation", "rotation", "scaling", "deformation".
  • behavior: "speaking", "walking", "eating", "sporting", "fetching", "others".

-Segment-Guided Matching (SGM Module)

input: \(I_{0}\), \(I_{1}\) - input images

output: \(f_{0\rightarrow1}\), \(f_{1\rightarrow0}\) - coarse optical flow

In this part, '.' refer to directry models/sgm_model.

1. Color Piece Segmentation

Laplacian filter to extract contours of animation frames[1]. [./gen_labelmap.py/dline_of].

“Trapped- ball” algorithm to fill the contours then generate color pieces[1]. [./linefiller & gen_labelmap.py/trapped_ball_processed]

A segmentation map where pixels of each color piece is labeled by an identity number. [./linefiller/trappedball_fill.py/build_fill_map]

2. Feature Collection

Extract features of relu1_2, relu2_2, relu3_4 and relu4_4 layers from pretrained VGG-19 model[2]. [./my_models.py/create_VGGFeatNet]

Assemble the features belonging to one segment by the super-pixel pooling[3]. [gen_sgm.py/superpixel_pooling]

3. Color Piece Matching

Compute an affinity metric \(\mathcal{A}\) [./gen_sgm.py line 553], the distance penalty \(\mathcal{L}_{dist}\) [./gen_sgm.py line 559], the size penalty \(\mathcal{L}_{size}\) [./gen_sgm.py line 564], the matching map \(\mathcal{M}\) [./gen_sgm.py/mutual_matching].

4. Flow Generation

Compute flow f [./gen_sgm.py/get_guidance_flow]

-Recurrent Flow Refinement Network (RFR Module)

input: \(I_{0}\), \(I_{1}\), \(f_{0\rightarrow1}\), \(f_{1\rightarrow0}\) - input images and coarse optical flow computed by SGM module

output: \(f^{’}_{0\rightarrow1}\), \(f^{’}_{1\rightarrow0}\) - fine flow

In this part, '.' refer to directry models/rfr_model.

Inspired by [4], design a transformer-like architecture to recurrently refine the piece-wise flow.

  • 3-layer Conv [./rfr_new.py/ErrorAttention]
  • Feature Net [./extractor.py/BasicEncoder]
  • ConvGRU[5] [./update.py/SepConvGRU]
  • Correlation [./corr.py/CorrBlock]

-Frame Warping and Synthesis

input: \(I_{0}\), \(I_{1}\), \(f^{’}_{0\rightarrow1}\), \(f^{’}_{1\rightarrow0}\) - input images and fine flow computed by RFR module

output: \(\hat{I}_{1/2}\) - interpolated image

In this part, '.' refer to directry models.

Generate the intermediate frame by using the splatting and synthesis strategy of Soft-Splat[6].

All features and input frames are softmax splatted via forward warping. [./softsplat.py/ModuleSoftsplat]

All warped frames and features are fed into a GridNet[7] to synthesize the target frame. [./GridNet.py/GridNet]


References

[1] Zhang, Song-Hai, et al. "Vectorizing cartoon animations." IEEE Transactions on Visualization and Computer Graphics 15.4 (2009): 618-629.

[2] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

[3] Liu, Fayao, et al. "Learning depth from single monocular images using deep convolutional neural fields." IEEE transactions on pattern analysis and machine intelligence 38.10 (2015): 2024-2039.

[4] Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." European conference on computer vision. Springer, Cham, 2020.

[5] Cho, Kyunghyun, et al. "On the properties of neural machine translation: Encoder-decoder approaches." arXiv preprint arXiv:1409.1259 (2014).

[6] Niklaus, Simon, and Feng Liu. "Softmax splatting for video frame interpolation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[7] Fourure, Damien, et al. "Residual conv-deconv grid network for semantic segmentation." arXiv preprint arXiv:1707.07958 (2017).

[8] Siyao, Li, et al. "Deep animation video interpolation in the wild." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.