How do you see Lightcap and speak: Mobile Magic in only 188 millimeters for each image

Authors:
(1) Ning Wang, Huawei Inc. ;
(2) Jiangrong Xie, Huawei Inc. ;
(3) Hang Luo, Huawei Inc. ;
(4) Chendglene Cheng, Huwi.
(5) Jihao Wu, Huawei Inc. ;
(6) Mingbo Jia, Huawei Inc. ;
(7) Linlin Li, Huawei Inc. ;
Links table
Abstract and 1 introduction
2 related work
3 methodology and 3.1 model engineering
3.2 Model training
3.3 Knowledge distillation
4 experiments
4.1 Data and Metrology sets and 4.2 Implementation details
4.3 Achievement study
4.4 Inference on the mobile device and 4.5 comparison
5 Conclusion and references
Implementation details
B results of the perception
C. Results on Nocaps
Restrictions D and future work
Implementation details
A.1 Training details
For the visual concept number, we set K = 20 experimental to determine the best concepts of K for effective user fusion. We note that the performance will decrease slightly when the concept number is less than 15 years. Our visual concept extract is trained in the VG Data set (Krishna Et Al
A.2 Rating on the mobile device
In this work, we test the time of the LightCAP model on the Huawei P40 mobile model. Huawei P40 test slice is Kirin 990[1]. Detailed inference speeds of the components are displayed in LightCAP in Table 7. To achieve the purely in the form of the form of the form, we set the package searches on 1. The use of memory is 257MB on the mobile phone. It takes only about 188 mm of our light model to process one image on the central processing unit of mobile devices, which meets efficiency requirements in the real world. It is well recognized that taking advantage of NPU or GPU on mobile devices can achieve a higher conclusion, while all mobile devices are not equipped with a strong slide. Thus, we use the Kirin 990 CPU to test our method (188ms per image). The time for a PC with the TITAN X graphics unit is about 90 mm seconds.
B results of the perception
We imagine the results of retrieving the concept of image in Figure 4. In the second column, we show the discovery of the introduction
Yolov5n small detector results. Although this detector is relatively weak and fails to outperform the methods of detecting two stages in his condition, it is very light with only 1.9 meters. Moreover, the micro -surrounding boxes are not necessary for our work frame. Based on almost the return on the investment provided, we focus on recovering the visual concepts of the image. As shown in the third column, our visual concept extract is capable of predicting signs of an accurate and dense object to form the concept of the image.
B.2 Imagine the changing through the media
In Figure 5, we imagine the most attention of the recovered visual concept channel. For the image specified in Figure 5, the first three visual concepts are desserts, cakes and spoon. These visual concepts are expected to take care of the channel to adjust the features of the raw clip. As shown
In the bottom numbers in Figure 5, the stimulant channels are few (i.e., only a few channels that give high attention values that exceed 0.8) and most of the canal weights are less than 0.5. This verifies our assumption that the raw clip features are excessive of the channel’s dimension. Moreover, attention to the channel is similar from candy and cakes, and perhaps because of its high similarity in the semantic space. However, the weight of attention caused by the spoon is completely different from the attention of sweets and cakes. It is well recognized that different features of features represent some indications, and that our approach is able to stimulate useful channels using the concepts that were recovered to name the effective image.
B.3 qualitative evaluation
Finally, we show the results of the explanatory designation of our approach to the group of elderly cars (Karpathy and Fei-FEI 2015) in Figure 6, along with the phrases of the Earth’s truth (GT). Fig. In general, on these unrestricted images from the Coco Karpathy test collection, LightCAP generates accurate and comparable comments with the strong OSCARB. The proposed approach gives even more accurate comments than Oscarb in the third picture, as Oscarb expects a woman instead of men. It should be noted that such a strong model achieves promising results by keeping only 2 % of the current labels on the latest films.
[1] https://www.hisilicon.com/en/products/kirin/kirin-flagships/kirin-990