With the increasing popularity of mobile and voice-assisted, extracting short and precise answer passages to open-domain questions is becoming an increasingly important information retrieval (IR) task. The recently released large-scale corpus for answer passage retrieval—WikiPass
...
With the increasing popularity of mobile and voice-assisted, extracting short and precise answer passages to open-domain questions is becoming an increasingly important information retrieval (IR) task. The recently released large-scale corpus for answer passage retrieval—WikiPassageQA—was shown to be challenging for both traditional retrieval models and neural architectures. One of the classic approaches to improving retrieval effectiveness across tasks is automatic query expansion (QE). QE is the process of reformulating a user’s query by adding more terms with the goal of retrieving more relevant information. Word embeddings are commonly employed to obtain QE terms by taking advantage of the low dimensional semantic space formed by these embeddings.
Recently, Diaz et al. showed that QE using word embeddings trained on a local query-specific corpus performed better than embeddings that were trained on an entire global corpus for document ranking tasks. We aim to examine the effectiveness of QE, specifically using locally-trained word embeddings, in this new context of answer passage retrieval. Additionally, a query-specific corpus can be small in size with limited vocabulary which forms a challenge for training word embedding models. Since the extent to which limited vocabulary influences the semantic information captured by word embeddings is relatively unexplored, we compare two word embedding models—CBOW and IWE—in this thesis. Having the same underlying training philosophy, the IWE model differs from CBOW in two aspects—it incorporates sub- word information of words and uses a convolutional neural network to learn context representation.
Our results corroborate the findings of Diaz et al.—query-specific data is also beneficial in the task of retrieving passages to open-domain questions. Word embeddings trained on a global corpus fail to capture the nuances of query-specific language present in the answer passages. We also found out that IWE word embeddings capture more semantic information than CBOW word embeddings when trained on local data with a limited vocabulary. Our experiments show that both the IWE model components contribute to the improved quality of word embeddings and consequently better QE terms. Our work can be extended by using the same methodology in other domains or by using different word embedding models to obtain QE terms. The insights from our thesis can help researchers to make an informed decision while choosing word embedding models and training data for their IR and natural language understanding tasks.