So is the example with the dogs/wolves and the example in the OP.
As to how hard to resolve, the dog/wolves one might be quite difficult, but for the example in the OP, it wouldn’t be hard to feed in all images (during training) with randomly chosen backgrounds to remove the model’s ability to draw any conclusions based on background.
However this would probably unearth the next issue. The one where the human graders, who were probably used to create the original training dataset, have their own biases based on race, gender, appearance, etc. This doesn’t even necessarily mean that they were racist/sexist/etc, just that they struggle to detect certain emotions in certain groups of people. The model would then replicate those issues.
Assuming we shrink all spacial dimensions equally: With Z, the diagonal will also shrink so that the two horizontal lines would be closer together and then you could not fit them into the original horizontal lines anymore. Only once you shrink the Z far enough that it would fit within the line-width could you fit it into itself again. X I and L all work at any arbitrary amount of shrinking though.