Machine perception models are usually modality-specific and optimised for unimodal benchmarks.