Challenges in the retrieval of information on the web: From keyword
In retrieving traditional information, user and documents are represented as a list of the main words, and the retrieval is based on the matching of the keywords. However, simple keywords face many challenges. First, he cannot clearly understand the intentions of users. In particular, the positive and negative emotions of users may not be estimated and they may accidentally reinforced. Second, it cannot combine the synonymous expressions, which reduces the diversity of results [18]. Third, he cannot deal with spelling errors and will return unrealistic results. Therefore, the query change is used to meet the above challenges. Unfortunately, it is difficult to cover all kinds of modifications to inquiries, especially those newly shown modifications.
With the great success in deep learning in the treatment of natural language, both inquiries and documents can be represented more famous as semantic include. Since the recovery based on the inclusion solves the three challenges mentioned above, it has been widely used in modern information systems to facilitate new quality and performance in the latest model. Several previous studies focused on deep inclusion models, from DSSM [21]CDSSM [46]Lstm-rnn [38]And Arc-I [20] For models based on transformers [10, 16, 39, 40, 45, 53, 54]. They have shown impressive gains with the nearest neighbor in brutal force on some small data collections compared to matching traditional keywords.
Due to the very high calculations and independence of the query to search for brute power carriers, there are many research methods that focus on algorithms and design on a large scale systems (AnN) [5–7, 11, 19, 24–26, 26, 41, 48]. It can be divided into section -based solutions and graphics. Solutions based on division, such as Span [11]Divide the entire vector space into a large number of groups and only take care of a small number of the nearest groups into online search inquiries. Solutions based on graph, such as Diskann [48]Build a running graph for the entire data set and do the first exceeding some of the fixed starting points when the query comes. Each of these two rituals works well on some unified data collections.
Unfortunately, when applying a retrieval -based retrieval in the web scenario, many new challenges appear. First, web scale data volumes require large models, high inclusion dimensions and a wide range of training set to ensure sufficient coverage of knowledge. Second, performance gains for modern inclusion models that are verified on small data sets cannot be transferred directly to the web scale data set (see section 4.4). Third, models of inclusion need to work with AnN systems to serve data sizes widely efficiently. However, the various training data distributions may affect the accuracy and performance of the system of AnN algorithm, which will significantly reduce the accuracy of the result compared to the combination of models with the search for brute force. Distillation [52] Check that Cocondenser [17] FAISS -VFPQ Ann Model Form Msmarco Index [35] And nq [28] Data sets. Moreover, even the distribution of the training data itself will also lead to various distributions of the distributions of the inclusion vector, which will lead to different classification trends for models of inclusion in the search for brute force (KNN) and the nearest approximate current research (see section 4.6).
Authors:
(1) Che Chen, Microsoft Beijing, China;
(2) XIUBO GING, Microsoft Beijing, China;
(3) Corby Rosset, Microsoft, Redmond, United States;
(4) Caroline Poorkton, Microsoft, Redmond, United States;
(5) Jingwin Le, Microsoft, Redmond, United States;
(6) Tao Shin, Sydney University, Sydney, Australia and work took place in Microsoft;
(7) Kon Chu, Microsoft, Beijing, China;
(8) Xinyan Xiong, University of Carnegie Mellon, Pittsburgh, United States and work took place in Microsoft;
(9) Yeyun Gong, Microsoft, Beijing, China;
(10) Paul Bennett, Spotify, New York, the United States and the work took place in Microsoft;
(11) Nick Kraswell, Microsoft, Redmond, United States;
(12) Xing XIE, Microsoft, Beijing, China;
(13) Fan Yang, Microsoft, Beijing, China;
(14) Brian Tower, Microsoft, Redmond, United States;
(15) Najil Rao, Microsoft, Mountain View, United States;
(16) Anlei Dong, Microsoft, Mountain View, United States;
(17) Wenqi Jiang, Eth Zürich, Zürich, Switzerland;
(18) Cheng Leo, Microsoft, Beijing, China;
(19) Mingqin Li, Microsoft, RedMond, United States;
(20) Chouanji Liu, Microsoft, Beijing, China;
(21) Zengzhong Li, Microsoft, RedMond, United States;
(22) Rangan Magmand, Microsoft, Redmond, United States;
(23) Jennifer Neville, Microsoft, Redmond, United States;
(24) Andy Okley, Microsoft, Redmond, United States;
(25) Knut Magne Risvik, Microsoft, OSLO, Norway;
(26) Harsha VARDHAN SIMHADRI, Microsoft, Bengaluru, India;
(27) Manic Pharma, Microsoft, Bangaluru, India;
(28) Yujing Wang, Microsoft, Beijing, China;
(29) Linjun Yang, Microsoft, RedMond, United State;
(30) Mao Yang, Microsoft, Beijing, China;
(31) CE Zhang, Eth Zürich, Zürich, Switzerland and work was done in Microsoft.