gtag('config', 'G-0PFHD683JR');
Price Prediction

How do you see Lightcap and speak: Mobile Magic in only 188 millimeters for each image

Authors:

(1) Ning Wang, Huawei Inc. ;

(2) Jiangrong Xie, Huawei Inc. ;

(3) Hang Luo, Huawei Inc. ;

(4) Chendglene Cheng, Huwi.

(5) Jihao Wu, Huawei Inc. ;

(6) Mingbo Jia, Huawei Inc. ;

(7) Linlin Li, Huawei Inc. ;

Abstract and 1 introduction

2 related work

3 methodology and 3.1 model engineering

3.2 Model training

3.3 Knowledge distillation

4 experiments

4.1 Data and Metrology sets and 4.2 Implementation details

4.3 Achievement study

4.4 Inference on the mobile device and 4.5 comparison

5 Conclusion and references

Implementation details

B results of the perception

C. Results on Nocaps

Restrictions D and future work

Implementation details

A.1 Training details

For the visual concept number, we set K = 20 experimental to determine the best concepts of K for effective user fusion. We note that the performance will decrease slightly when the concept number is less than 15 years. Our visual concept extract is trained in the VG Data set (Krishna Et Al

A.2 Rating on the mobile device

In this work, we test the time of the LightCAP model on the Huawei P40 mobile model. Huawei P40 test slice is Kirin 990[1]. Detailed inference speeds of the components are displayed in LightCAP in Table 7. To achieve the purely in the form of the form of the form, we set the package searches on 1. The use of memory is 257MB on the mobile phone. It takes only about 188 mm of our light model to process one image on the central processing unit of mobile devices, which meets efficiency requirements in the real world. It is well recognized that taking advantage of NPU or GPU on mobile devices can achieve a higher conclusion, while all mobile devices are not equipped with a strong slide. Thus, we use the Kirin 990 CPU to test our method (188ms per image). The time for a PC with the TITAN X graphics unit is about 90 mm seconds.

B results of the perception

We imagine the results of retrieving the concept of image in Figure 4. In the second column, we show the discovery of the introduction

Table 7: The proposed LightCAP conclusion on the CPU.Table 7: The proposed LightCAP conclusion on the CPU.

Yolov5n small detector results. Although this detector is relatively weak and fails to outperform the methods of detecting two stages in his condition, it is very light with only 1.9 meters. Moreover, the micro -surrounding boxes are not necessary for our work frame. Based on almost the return on the investment provided, we focus on recovering the visual concepts of the image. As shown in the third column, our visual concept extract is capable of predicting signs of an accurate and dense object to form the concept of the image.

Figure 4: From left to right: an input image, the results of the discovery of the introduction, and the concept recovery results. All test photos are from the Coco (Lin Et Al. 2014).Figure 4: From left to right: an input image, the results of the discovery of the introduction, and the concept recovery results. All test photos are from the Coco (Lin Et Al. 2014).

B.2 Imagine the changing through the media

In Figure 5, we imagine the most attention of the recovered visual concept channel. For the image specified in Figure 5, the first three visual concepts are desserts, cakes and spoon. These visual concepts are expected to take care of the channel to adjust the features of the raw clip. As shown

Figure 5: In the upper form, we show the expected explanatory designation of the image, the illustrations of the truth (GT), and our expected visual concepts. In the lower form, we display the weight of the channel's attention to the first three concepts (i.e. dessert, cake, and spoon).Figure 5: In the upper form, we show the expected explanatory designation of the image, the illustrations of the truth (GT), and our expected visual concepts. In the lower form, we display the weight of the channel's attention to the first three concepts (i.e. dessert, cake, and spoon).

Fig.Fig.

In the bottom numbers in Figure 5, the stimulant channels are few (i.e., only a few channels that give high attention values ​​that exceed 0.8) and most of the canal weights are less than 0.5. This verifies our assumption that the raw clip features are excessive of the channel’s dimension. Moreover, attention to the channel is similar from candy and cakes, and perhaps because of its high similarity in the semantic space. However, the weight of attention caused by the spoon is completely different from the attention of sweets and cakes. It is well recognized that different features of features represent some indications, and that our approach is able to stimulate useful channels using the concepts that were recovered to name the effective image.

B.3 qualitative evaluation

Finally, we show the results of the explanatory designation of our approach to the group of elderly cars (Karpathy and Fei-FEI 2015) in Figure 6, along with the phrases of the Earth’s truth (GT). Fig. In general, on these unrestricted images from the Coco Karpathy test collection, LightCAP generates accurate and comparable comments with the strong OSCARB. The proposed approach gives even more accurate comments than Oscarb in the third picture, as Oscarb expects a woman instead of men. It should be noted that such a strong model achieves promising results by keeping only 2 % of the current labels on the latest films.


[1] https://www.hisilicon.com/en/products/kirin/kirin-flagships/kirin-990

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button