About unmentioned details and ambiguities in the paper!

Hi everyone!

For anyone trying to implement this paper or use this repo, here’s a heads-up: we reached out to the authors with some questions because there are quite a few ambiguities and missing details in the paper. **Due to these gaps, our implementation doesn’t exactly match the original work.**

It turns out that the authors used a highly customized backbone, which combines custom layers sequentially with a pretrained CNN network—details they didn’t mention in the paper. There are also other crucial elements necessary for the paper’s success that weren’t included.

**Because of these missing details, our implementation didn’t yield the best results.** To address this, we sent the authors a bunch of questions, and here are their responses:


OUR QUESTIONS TO AUTHORS:

1. For inputting the content and style images to the swin (2 stage) backbone, did you normalize them with ImageNet statistics?
2. For inputting the content, style, and stylized images to the VGG19 network for loss computation, did you normalize them with ImageNet statistics?
3. How did you initialize the weights of the MHA blocks and MLP layers?
4. How did you initialize the weights of the Decoder (after the style transformer)?
5. In the Style Decoder (Equation 5), should we treat the MHA blocks as shifted window swin cross attentions, or are they ordinary MHA?
6. For the content loss (Equation 6), what is the distance function (Euclidean/Euclidean squared)?
7. For the style loss (Equation 7), what is the distance function (Euclidean/Euclidean squared)?
8. Does the style transformer decoder contain layer normalization before the MHA and MLP layers? We assume the style transformer encoder does not, as you stated the reason in the paper.
9. You mentioned that the optimizer is Adam with learning rates of both inner and outer loops as 0.0001. Did you use Adam for the outer loop as well?
10. Did you use stochastic depth in training (from the original swin implementation)? If so, what is the probability parameter?
11. Which swin model did you use initially (swinB, swinL, etc.)? Were its parameters frozen during training?
12. When we downloaded WikiArt, it had over 80,000 images. How did you select 20,000 of them? Is the choice of style images important for training?
13. Did you use the VGG19 model with or without batch normalization?
14. In the Style Encoder (Fig. 3), do all the MHA layers have the same parameters, or are the V projections of Scale and Shift different?
15. By stating "there is only one copy of parameters shared by all Transformer encoder and decoder layers," do you mean there is only one style transformer layer applied multiple times? Is there anything else shared within a style transformer layer besides the encoder MHA's?
16. Are the MLP blocks taken from the swin implementation (hidden dimension ratio=4, activation=GELU)?
17. For the MHA blocks in the model, what is the inner dimension? (We assumed it to be 256)
18. How did you implement the cross-attention in the MHA blocks? Did you directly copy the shifted-window attention from Microsoft's implementation and change the qkv weights to separate weights?
19. In the similarity loss (Equation 9), did you use the lower (or upper) triangle of the self cosine similarity matrix?
20. Did you uniformly sample the number of stacked layers during training?
21. In the Style Encoder (Equation 4), should we process the Ks with MHA and MLP, then use the processed Ks in the cross-attention of other MHAs, or use the initial Ks at each layer?
22. In the Style Decoder (Equation 5), should we use the MLP after the Fcs MHA layer? (It exists in Fig. 3, but is absent in the equation.)
23. In the Style Decoder (Fig. 3), what do the Linear Transformations represent? Are they the projections in the MHA blocks?
24. If they are the projections, does it mean we should not project Fcs to the Q matrix?
25. In the Style Decoder (Fig. 3), do the Instance Normalizations contain learnable parameters (affine=True)?
26. In the Style Decoder (Fig. 3), should the Instance Normalization in the key be applied after the linear transformation (as shown in the figure)?



THEIR ANSWERS:

1. We do not do the ImageNet normalization.
2. We do not do the ImageNet normalization.
3. We use normal initialization, please refer to the codes for details.
4. We use normal initialization, please refer to the codes for details.
5. They are window-based attention in SWIN.
6. The distance function is Euclidean squared (F.mse_loss in Pytorch).
7. The distance function is Euclidean squared (F.mse_loss in Pytorch).
8. There is no layer normalization. We only do instance normalization when computing Q and K in cross-attention.
9. The optimizer for the outer loop is Adam. Since you focus on the zero-shot setting, there is actually no inner loop because the number of inner iterations is 1.
10. For SWIN we do not use stochastic depth.
11. We do not use existing SWIN structures. There are 6 SWIN blocks and we replace the MLP part of each block with CNN, whose weights are initialized from VGG and fixed. Only attention parameters are trainable. Please refer to the codes for details.
12. We use the test set of Wikiart which contains 20,000 images. This part should not influence the performance too much.
13. We use the model without batch normalization. You can download the model from https://drive.google.com/file/d/1BinnwM5AmIcVubr16tPTqxMjUCE8iu5M/view?usp=sharing
14. All the MHA layers have the same parameters.
15. The whole layer is shared by using multiple times, including MHA and MLP.
16. The MLP in the backbones is actually CNN, as mentioned in Question 11. As for Style Transformer, the hidden dimension ratio is 1 and the activation is ReLU. Please refer to the codes for details.
17. It is 512.
18. The implementation of cross-attention is similar to the vanilla Transformer. The difference is that there are 2 parts of V, namely mean and standard deviation, as mentioned in the paper. Please refer to the codes for details.
19. We use the whole similarity matrix.
20. Yes.
21. Use the processed Ks. Please refer to the codes for details.
22. Yes. MLP is after the MHA.
23. Yes. They are projections for computing Q, K, and V.
24. F_cs also needs a projection. We missed an 'L' in this branch of Fig. 3.
25. No. Affine is False.
26. IN should be before the L. Sorry for the mistake.


Note: You can also ask first 2 authors of the paper for their own code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About unmentioned details and ambiguities in the paper! #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About unmentioned details and ambiguities in the paper! #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions