UTDRM: unsupervised method for training debunked-narrative retrieval models

A key task in the fact-checking workﬂow is to establish whether the claim under investigation has already been debunked or fact-checked before. This is essentially a retrieval task where a misinformation claim is used as a query to retrieve from a corpus of debunks. Prior debunk retrieval methods have typically been trained on annotated pairs of misinformation claims and debunks. The novelty of this paper is an Unsupervised Method for Training Debunked-Narrative Retrieval Models ( UTDRM ) in a zero-shot setting, eliminating the need for human-annotated pairs. This approach leverages fact-checking articles for the generation of synthetic claims and employs a neural retrieval model for training. Our experiments show that UTDRM tends to match or exceed the performance of state-of-the-art methods on seven datasets, which demonstrates its eﬀectiveness and broad applicability. The paper also analyses the impact of various factors on UTDRM ’s performance, such as the quantity of fact-checking articles utilised, the number of synthetically generated claims employed, the proposed entity inoculation method, and the usage of large language models for retrieval.


Introduction
Automated fact-checking systems are pivotal not only for combatting false information on digital media but also for reducing the workload of fact-checkers [1,2].A key functionality of these systems is the retrieval of already debunked narratives for misinformation claims, which essentially means retrieving previously fact-checked similar claims [2][3][4].This function is accomplished by training debunked-narrative retrieval models that utilise misinformation claims as queries to retrieve relevant debunked narratives.
Previous methods for training debunked-narrative retrieval models heavily rely on annotated pairs of misinformation claims and debunks [2,4,5].However, the process of manually creating annotated pairs is time-consuming, labour-intensive, and often limited in scale, which can impede the performance of the retrieval models.
In this paper, we propose an Unsupervised method for Training Debunked-Narrative Retrieval Models (UTDRM) that utilises synthetic claims to overcome the limitation of relying on manual annotations (see Fig. 1).Moreover, we hypothesise that UTDRM has the potential to detect topical misinformation by generating claims from incoming topical Figure 1 End-to-end pipeline for UTDRM: a two-step method involving the generation of topical claims and the training of a neural retrieval model fact-checks, thereby expanding its overall impact.Furthermore, our proposed entity inoculation method (Sect.6.3) addresses the pressing challenge of similar false narratives evolving with different entities [6].Our inspiration for this approach stems from an independent analysis, noting similar misinformation claims involving distinct entities.For example, misinformation about crocodile sightings during floods vary across locations -Hyderabad, 1 Patna, 2 Bengaluru, 3 and Florida 4 (see Appendix A. 4 for more examples).By replacing named entities in generated claims, entity inoculation enhances the robustness of our UTDRM method, directly addressing the issue of narrative adaptability (see Sect. 6.3).
In particular, the research question addressed in this study is: how to train efficient debunked-narrative retrieval models without relying on human-annotated data?
The main contributions of this paper are: • UTDRM, a two-step method for training debunked-narrative retrieval models that achieves comparable or superior retrieval scores to supervised models, all without relying on annotations.Figure 1 illustrates the UTDRM's end-to-end pipeline.• A large-scale dataset of synthetic topical claims created using topical claim generation techniques based on text-to-text transformer-based models and large language models (LLMs).• A comprehensive performance evaluation of UTDRM on seven publicly available datasets, demonstrating its effectiveness and generalisability in retrieving accurate debunks for misinformation in tweets, political debates, or speeches.• Extensive ablation experiments that assess the impact of different factors on UTDRM's performance.This includes: (1) the volume of fact-checking articles utilised, (2) the number of synthetically generated claims used for training, (3) the proposed entity inoculation method, and (4) the usage of LLMs, such as Large Language Model Meta AI (LLaMA 2) and Chat Generative Pre-trained Transformer (ChatGPT), for retrieval.
In the following sections, we discuss related work (Sect.2) and our proposed UTDRM method (Sect.3).Section 4 presents the various experimental methods and the datasets used for evaluation.The results and ablation experiments are presented in Sect. 5 and Sect.6 respectively.Finally, we conclude the paper in Sect.8.

Related work
Information retrieval involves the search and retrieval of relevant documents from a collection in response to a query.Initially, conventional lexical methods such as, Okapi Best Match 25 (BM25) [7], Term Frequency-Inverse Document Frequency (TF-IDF) weighting [8], Query Likelihood model (QL) [9], and Divergence From Randomness (DFR) [10], were the primary information retrieval techniques, which demonstrated the effectiveness of lexical and statistical approaches.However, these traditional approaches faced challenges in addressing lexical gaps and semantic issues in relevance matching [11].In response to these challenges, recent recent Transformer-based methods [12] aim to harness the power of deep learning to enhance performance [13].In the following sections, we review related work in two main areas: supervised and unsupervised methods for debunked-narrative retrieval.

Supervised training methods
Many existing methods for training debunked-narrative retrieval models rely on supervised learning techniques which typically leverage annotated pairs of misinformation claims and fact-checking articles as training data [3,[14][15][16][17][18].For instance, Shaar et al. [16] train a pairwise learning-to-rank model for identifying debunked narratives.They also release Snopes and Politifact datasets [16], which we use for evaluation in this paper (Sect.4.1).Similarly, Vo and Lee [19] train a ranking model that incorporates both textual and visual features to retrieve previously fact-checked content, while Shaar et al. [20] employ the Transformer-XH [21] to examine the role of context in political debates.On the other hand, Kazemi et al. [5,22] address the task of debunked-narrative retrieval as a binary classification problem and train support vector machines model to classify misinformation tweets.However, formulating it as a classification problem is computationally not scalable due to its quadratic complexity.
The Conference and Labs of the Evaluation Forum (CLEF) CheckThat!Lab shared task 2020, 2021 and 2022 [2,3,14,23] focus on debunked-narrative retrieval task and release different datasets for training and testing.In this paper, we utilise all of these CLEF test datasets for evaluation (Sect.4.1).Teams in CLEF 22 use diverse methods, such as Sentence-T5 and GPT-Neo for re-ranking [24], Simple Contrastive Learning of Sentence Embeddings (SimCSE) [25], and data augmentation like back translation [26].We utilise the state-of-the-art performance demonstrated by the shared task winners as a benchmark for comparing against our UTDRM method (Sect.4.2).While supervised training approaches require annotated training data, which can be costly and time-consuming to collect, this research proposes an alternative novel approach.By utilising fact-checking articles from professional fact-checking organisations, our method generates high-quality training data without the need for annotations.This methodology yields high scores in debunked-narrative retrieval (Sect.5).

Unsupervised training methods
In recent years, unsupervised training methods for information retrieval have gained significant interest [13,[27][28][29][30].Our proposed UTDRM method falls within this category.These unsupervised methods aim to overcome the challenges associated with acquiring annotated training data by utilising large corpora of unlabeled documents.For example, Lee et al. [27] introduce the Inverse Cloze Task (ICT) for training models using synthetic query-passage pairs by uniformly sampling sentences from random passages.Alternatively, Tranformer-based Denoising AutoEncoder (TSDAE) [28] encodes sentences with randomly deleted 60% of the tokens and the decoder to reconstruct the original sentences.Similarly, methods like SimCSE [25] and Contrastive Tension [31] focus on minimising the distance between embeddings from the same sentence.ICT, TSDAE, and SimCSE are among the unsupervised methods employed for comparison with our proposed UTDRM method (as discussed in Sect.4.2).
Other line of unsupervised methods explore query generation as an alternative to improve retrieval performance.For eg. Nogueira et al. [32,33] enhance traditional BM25 search by expanding passages with synthetic queries.On the other hand, Ma et al. [34] propose a zero-shot learning approach for passage retrieval using synthetic question generation, while Wang et al. [29] introduce Generative Pseudo Labeling (GPL), an unsupervised domain adaptation method that combines a T5-based query generator with pseudo labelling from a cross-encoder.However, these methods are not suitable for our specific use case since generating claims from fact-checking articles is a novel task in itself, and therefore, relying on pre-trained query generation models trained for different purposes is not appropriate.Additionally, the use of Margin Mean Squared Error (MarginMSE) [35] in GPL, which relies on a cross-encoder trained on Microsoft Machine Reading Comprehension (MSMARCO) data, may not be effective for our specific debunked-narrative retrieval task.This is because our task differs from general information retrieval tasks that typically require general queries as input, while the task in this paper specifically focuses on false claims on social media and political debates (Sect.4.1).
While existing unsupervised methods show promising results, there is still room for improvement in retrieval performance and applicability.UTDRM aims to address these challenges by utilising unsupervised learning techniques tailored specifically for training debunked-narrative retrieval models.It focuses on generating high-quality topical misinformation claims from fact-checking articles (Sect.3.1) which, to the best of our knowledge, has not been explored in previous work.These generated claims are employed to train the retrieval model in a zero-shot setting (Sect.3.2).
Finally, this study is the first to assess the performance of LLMs (LLaMA 2 and Chat-GPT) as listwise re-rankers on seven publicly available debunked-narrative retrieval datasets (Sect.6.4).This assessment is conducted to examine how LLMs perform in comparison to other unsupervised methods, including our UTDRM.

UTDRM: unsupervised method for training debunked-narrative retrieval models
Debunked-narrative retrieval is a key task in a typical fact-checking workflow, where the verification professionals determine whether the claim or content that they need to verify has already been debunked in a publicly available debunking article posted by another fact-checking organisation.This is essentially a retrieval, where a misinformation claim serves as the query to extract relevant debunked claims (or fact-checked claims) from a database of already published publicly available debunking articles.It must be noted that if a claim has not already been debunked in a published article, there may not be suitable matches.This section presents our proposed UTDRM method, which consists of two steps: (i) generation of topical claims (Sect.3.1); and (ii) training of a debunked-narrative retrieval model (Sect.3.2). Figure 1 illustrates the end-to-end pipeline for UTDRM.

Topical claim generation
We synthetically generate topical claims that resemble misinformation claims based on the debunked information provided by professional fact-checkers.To accomplish this, we propose two novel methods: the use of Text-to-Text Transfer Transformer (T5) and Chat-GPT as claim generators.In this work, we specifically investigate the zero-shot scenario, where annotated pairs of social media posts and debunked claim pairs are unavailable, and only a large corpus for fact-checks is available.

T5 claim generator
The T5 claim generator is a sequence-to-sequence model based on the text-to-text transfer transformer (T5) [36].We choose T5 model because of its proven effectiveness in various sequence-to-sequence tasks in prior research [29,32,36].T5 is used to generate claims from fact-checking articles by framing the task as an encoder-decoder problem.The encoder is trained to understand and represent the fact-checking articles, while the decoder generates potential misinformation claims that can be effectively debunked using the corresponding fact-checking articles.
To train the T5 claim generator, first, we create a corpus of fact-checking articles published by different fact-checking organisations, namely Boomlive, 5 Agence France-Presse (AFP) 6 and Politifact. 7We choose these fact-checking websites for their wide topic coverage, deferring the comparison of claim generators trained on different websites for future research.A total of 23,901 fact-checking articles were collected.For each fact-checking article, we collect the debunked claim statement, the title and the main body of the article.During fine-tuning, the input to the T5 model consists of the title and the main body of the fact-checking article, and the model is trained to generate the debunked claim statement.Since the generated claims are conditioned on the fact-checking article, they remain closely related to the actual claims being debunked in the fact-checking article.Please refer to Appendix A.1 for hyperparameter details.

ChatGPT claim generator
We use ChatGPT (gpt-3.5-turbo) 8to generate tweets that are relevant to the debunked claims of fact-checking articles collected above.To achieve this, we provide an input prompt instructing the model to generate five different tweets about the text, ensuring that the generated tweets are not fact-checks or debunks.Additionally, we encourage the diversity of hashtags in the generated tweets to enhance their variability.For this, we use the input prompt as: Generate ten different tweets about the text delimited by triple backticks.Make sure that generated tweets should not be a fact-check or a debunk.Also, tweets should have different hashtags.'''{Debunked Claim}''' In summary, we use ChatGPT in conjunction with the T5 claim generator due to our observation that ChatGPT generates claims that are more diverse (Table 2) and closely resemble actual tweet claims (Sect.3.1.3).Additionally, both T5 and ChatGPT claim generator can address emerging topics by generating claims from incoming topical fact-checks.These generated claims serve as valuable inputs for training our neural retrieval model (Sect.3.2).

Generated claims
Table 1 showcases sample claims generated from T5 and ChatGPT.We present five random instances of debunked claims alongside three generated claims from each model.In the first example, T5 produces three claims pertaining to Senator Kamala Harris potentially violating laws during a visit to an Ohio voting site, while ChatGPT generates alternative claims with similar themes.Similarly, for the other examples, T5 and ChatGPT generate diverse variations of claims related to Dr Kafeel Khan's involvement in a farmers' rally in Delhi and a supposed COVID-19 cure by a Pondicherry University student.
In summary, both T5 and ChatGPT generate different types of claims with variations in wording, focus, and emphasis, while still conveying similar information related to the original debunked claims.Moreover, our analysis reveals that the claims generated by T5 exhibit simplicity and a higher level of similarity to the debunked claims.On the other hand, the claims generated by ChatGPT demonstrate greater diversity and closely resemble actual tweets, often incorporating hashtags (as shown in Table 2 -Sect.3.1.4).Notably, some of the ChatGPT generated claims ask questions while stating the debunked claim (last example in Table 1).Finally, by using both T5 and ChatGPT, we can capture a broader range of claim styles and ensure comprehensive coverage for training debunkednarrative retrieval models.

Quality and diversity
Table 2 evaluates the generated claims using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [37] and self Bilingual Evaluation Understudy (selfBLEU) [38] metrics.Following previous work [29,32], our evaluation does not involve human assessment.Instead, we rely on automatic metrics to assess the quality of generated claims.ROUGE measures the proximity of the generated claims to the reference debunked claims, while selfBLEU assesses the diversity among the generated claims.The choice of these metrics is justified by their close alignment with our research objectives, emphasising both quality and diversity as crucial evaluation criteria.We generate a total of six claims (three from each claim generator) from the collected fact-checking articles (Sect.3.1), as it yields the best scores during experiments (see Sect. 6.2).The results in Table 2 indicate that T5 outperforms ChatGPT in ROUGE scores across all n-gram levels, indicating higher overlap with the reference debunked claims.This performance difference can be attributed to the fine-tuning of T5 in the T5 claim generator.Further evaluation of retrieval models trained  on generated claims will provide insights into the claim quality and their alignment with task requirements (Sect.5).Table 2 also presents the selfBLEU scores, which computes the similarity between the generated claims, with lower scores indicating higher diversity.T5 exhibits higher self-BLEU scores across all N-gram levels, indicating more similarity among its generated claims.In contrast, ChatGPT achieves lower selfBLEU scores, suggesting greater diversity and distinctiveness in its generated claims.

Neural retrieval model
The neural retrieval model is a transformer model fine-tuned on the generated claim and the original debunked claim statement pairs using multiple negatives ranking loss (MNRL) [39,40].In this, consider a dataset of synthetically generated claims g = (g 1 , ..., g N ) along with their corresponding debunked claim statements d = (d 1 , ..., d N ).During fine-tuning, each batch of size K contains one generated claim g i and one corresponding relevant debunked claim statement d i , which is the same debunked claim used for generating g i .The remaining K -1 elements in the batch are irrelevant debunked claim statements which are the hard negatives mined using a pretrained retrieval model.Every debunked claim statement d j is a negative candidate for generated claim g i if i = j.The loss for a single batch of size K is defined as, where f θ is the sentence encoder using the transformer model and Sim is the similarity between the encoded embeddings.We employ cosine similarity function with the meanpooling technique due to its proven effectiveness in prior research [41].MNRL aims to maximise the similarity between the generated claim and its relevant debunked claim statement while minimising the similarity with irrelevant statements.Hyperparameter details are in Appendix A.1.

Experimental setup 4.1 Evaluation datasets
We evaluate the models on the test set of seven publicly available datasets.The datasets are divided into two types based on whether the claims are sourced from Twitter or from political debates or speeches: • Twitter-based datasets: Snopes [16] and CLEF CheckThat!Lab task datasets which include CLEF 22 2A [3], CLEF 21 2A [14] and CLEF 20 2A [2].• Political-based datasets: Politifact [16] and CLEF CheckThat!Lab task datasets which include CLEF 22 2B [3] and CLEF 21 2B [14].To assess the diversity of domains, we calculate the pairwise domain overlap between all the claims in the datasets using a weighted Jaccard similarity measure [42].Figure 2 shows a heatmap illustrating the pairwise weighted Jaccard similarity scores.Besides CLEF 22 2B and CLEF 21 2B, the results indicate a relatively low overlap among most datasets, suggesting that the evaluation of UTDRM is conducted on diverse data.
In order to avoid any data leakage with the fact-checking articles utilised for claim generation (Sect.3.1), we exclude all fact-checking articles that exhibit a Jaccard similarity of 0.5 or higher between the debunked claim statements.Please note that fact-checking articles used for claim generation are removed and are not from the evaluation datasets.

Out-of-the-box models
We use two strong out-of-the-box pre-trained models for information retrieval.We test these models in their default configuration without any supervision from the generated claims to assess their zero-shot performance.The models are: (1) Sentence-Transformer's model based on Masked and Permuted Pre-training for Language Understanding (MPNet) [45] all-mpnet-base-v2 10 which has been trained on a large and diverse dataset of over a billion training examples.(2) Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE), which is a RoBERTa [46] model fine-tuned on MSMARCO dataset [47] with hard negatives selected using approximate nearest neighbor [48].
Unsupervised methods We use five different unsupervised methods which utilise the same set of fact-checking articles for training, as used in the claim generation process (Sect.3.1): (1) ICT [27] is employed to generate pseudo-claims by uniformly sampling sentences from the fact-checking articles.MNRL loss (Sect.3.2) is then applied to train the model using the pairs of pseudo and debunked claim statements.(2) Back-Translation (BT) [49] involves translating all debunked claim statements to Hindi and then back to English.The resulting pairs of back-translated claim and the original debunked claim statement are further used for training the model using MNRL loss.(3) SimCSE [25] encodes the same debunked claim statement twice with different dropout masks and utilises MNRL loss for training.( 4) TSDAE [28] pre-trains a retrieval model using a denoising autoencoder.It encodes debunked claim statements with randomly deleted 60% of the tokens and the decoder reconstructs the original debunked claim statements [28].All unsupervised methods employ a distilled version of the RoBERTa-base [46] 11 as the underlying model.Hyperparameter details are in Appendix A.1.

Supervised methods
We also report previous State-Of-The-Art (SOTA) performance achieved by the winners of the shared tasks on the test set, as published in their respective papers [2,3,14,16].Please note that these supervised methods benefit from annotated training data, which enables them to utilise specific information pertaining to real-world instances of misinformation claims and their corresponding debunks.
For example, the winning team of CLEF 22 2A [24] use Sentence-T5 [50] for candidate selection and GPT-Neo [51] for re-ranking.The winning team in CLEF 22 2B [52] employ a combination of semantic and lexical similarity features between claims and debunks for retrieval.In CLEF 21 2A [14], the top-performing team utilise a combination of TF-IDF, Sentence-BERT, and Lambda Multiple Additive Regression Trees (LambdaMART) for ranking [53], while the winning team in CLEF 21 2B [54] combines the Sentence-BERT model with a custom neural network to get the final list of sorted debunks based on relevance.The top-performing team in CLEF 20 2A [55] use a fine-tuned RoBERTa model for retrieval.
Lastly, for Snopes and Politifact, we directly report scores from Shaar et al. [16], who utilise a pairwise learning-to-rank model for debunk retrieval.

Experimental details
UTDRM is tested on two models: a distilled version of the RoBERTa-base model (UTDRM-RoBERTa) and the MPNet model (UTDRM-MPNet) (Sect.5).We generate six topical claims (three from each claim generator) for all the collected 23,901 fact-checking articles (Sect.3.1), as this approach yields the best scores during experiments (Sect.6.2).Following previous work [29], we employ nucleus sampling during generation, using a Top-k value of 25 and a Top-p value of 0.95.For the ChatGPT claim generator, we keep all API parameters at their default values, except for the temperature, which is set to 0.7 to ensure diversity.The total cost of using ChatGPT to generate the claims was 14 GBP.Finally, a total of 1,43,406 (23,901x6) generated claims are used for training the neural retrieval model.

Evaluation metrics
For evaluation, we employ two widely used ranking metrics [3,14]: Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP).MRR computes the score based on the highest-ranked relevant debunk for each misinformation tweet and is defined as MRR = 1 |C| |C| i=1 1 rank i , where |C| is the number of input claims used as query and rank i is the rank of the relevant debunk for the ith claim.The higher the MRR score the better.MAP, on the other hand, measures the precision of the system in returning relevant results for a given query.We use two variations of MAP: MAP@1 and MAP@5, which evaluate the top one and top five retrieved documents, respectively.A higher MAP@k score indicates better performance.BM25 and out-of-the-box models These models consistently achieve high retrieval scores across all metrics, with MPNet outperforming the others (Table 3 column 3-5).This indicates that leveraging models trained on other information retrieval datasets can improve retrieval effectiveness (Sect.4.2).However, it is important to note that there are variations in performance among the datasets, suggesting that the models' effectiveness might depend on the specific characteristics of the dataset.

Results and discussion
Among the Twitter-based datasets, MPNet stands out as the best-performing model with the highest average scores.It achieves an average MAP@1 score of 0.841, MAP@5 score of 0.886, and MRR score of 0.888.In contrast, when considering the political-based datasets (Politifact, CLEF 22 2B-EN, and CLEF 21 2B-EN), BM25 emerges as the top-Table 3 Performance of BM25, out-of-the-box, unsupervised and SOTA supervised models.The first part of the table shows the individual and average results for Twitter-based datasets, while the second part shows the individual and average results for political-based datasets.UTDRM results are highlighted in blue.The highest scores for each dataset and metric are in bold performing model with the average MAP@1 score of 0.353, MAP@5 score of 0.406, and MRR score of 0.446, indicating its effectiveness in retrieving relevant information from political speech datasets.Overall, the average scores suggest that the models perform better on the Twitter-based datasets compared to the political-based datasets.This difference in performance can be attributed to the fact that political-based claims pose greater challenges for the models.
Unsupervised methods Table 3 reports the results of the unsupervised methods, including the baselines BT, ICT, SimCSE, TSDAE (columns 6-9), as well as the proposed UTDRM-RoBERTa (Table 3 columns 10).All these methods utilise a distilled RoBERTa model, as described in Sect.4.2.Among the baselines, ICT achieves the highest scores across all metrics, followed by SimCSE and TSDAE.However our proposed UTDRM-RoBERTa achieves the highest average scores for both Twitter-based and politicalbased datasets, followed by ICT and SimCSE.Additionally, the table reveals that each method has its own strengths and weaknesses on different datasets.For instance, UTDRM-RoBERTa performs well on all datasets except Politifact, where it is surpassed by ICT.
Furthermore, given the impressive performance of the out-of-the-box MPNet model, we also test UTDRM on the MPNet model (Table 3 column 11).UTDRM-MPNet outperforms all other methods, achieving the highest scores across all evaluation metrics.
Supervised methods Table 3 (last column) reports the results for the previous SOTA methods (Sect.4.2).These methods benefit from annotated training data, allowing them to leverage specific information about real-life misinformation claims and debunked claim statements (Sect.4.2).In contrast, the UTDRM does not have access to any annotated training data.Surprisingly, the UTDRM-MPNet model, despite being an unsupervised method, achieves comparable or even superior retrieval scores compared to the SOTA supervised models.This demonstrates the effectiveness of UTDRM without the need for any annotations.
Summary We find that the choice of method depends on specific requirements, data availability, and the desired performance-resource trade-off.UTDRM-RoBERTa and UTDRM-MPNet consistently yield the highest retrieval scores, while the out-of-the-box models offer viable alternatives without the need for any training data whatsoever for debunked-narrative retrieval.Additionally, our proposed method, UTDRM, has the potential to detect topical misinformation claims by generating claims from incoming topical fact-checks; thus allowing it to address emerging topics and contribute to the timely detection of misinformation.

Influence of fact-checking articles
Table 4 shows the results UTDRM-MPNet when trained using different numbers of factchecking articles (1K, 5K, 10K, and All).Due to space limitations, the table reports only the MAP@1 and MRR metrics.
The results suggest that the size of the corpus does have a postive effect on the performance of UTDRM, but the extent of the improvement may vary depending on the specific dataset and corpus size being used (Table 4).For instance, the CLEF 21 2A (Twitter-based dataset) shows an increasing trend until the number of fact-checking articles reaches 10K, after which it becomes relatively constant.On the other hand, for political-based datasets, the average performance continues to increase as the number of fact-checking articles increases, suggesting that a larger corpus of fact-checking articles has a more pronounced impact on improving retrieval performance.

Influence of the generated claims
Table 5 shows results of UTDRM-MPNet using different numbers of generated claims for training N : 2, 6, 10, 20.It should be noted that the proportion of claims generated using T5 and ChatGPT is kept the same for all cases.The individual performance of models trained on T5 and ChatGPT generated claims separately is generally lower (Appendix A.3 and A.2).
Table 5 demonstrates an overall improvement in performance as the number of generated claims increases from N = 2 to N = 6 and N = 10 across most datasets.However, performance either declines or stabilises beyond N = 10.For instance, in the Snopes dataset, MAP@1 and MRR scores show a slight decline from N = 6 to N = 20.Similar trends are observed in the CLEF 22 2A, CLEF 21 2A, and CLEF 20 2A datasets, where MAP@1 performance peaks at N = 6 and then plateaus or slightly decreases.In contrast, the CLEF 22 2B and CLEF 21 2B datasets reach their peak performance at N = 10.In general, the results suggest that N = 6 is the optimal value for the number of generated claims, as it yields the highest average retrieval performance, while going beyond this range may introduce noise and decrease performance.

Influence of entity inoculation
We propose an entity inoculation method, which involves replacing a random named entity in the generated claims with another random named entity to simulate real-world scenarios where similar misinformation narratives spread with different entities (see Appendix A.4 for examples).By training the model with these modified claims, it is expected to become more robust in retrieving debunked narratives regardless of the specific entities involved.Table 6 presents the results of entity inoculation using different entity types: geopolitical entities (GPE), person (PERSON), and organisation name (ORG), as well as a combined approach that uses all types.The Default column represents the performance of UTDRM-MPNet without entity inoculation (from Table 3).
Entity inoculation shows positive results on political-based datasets with an average increase of two MRR points with the combined approach as compared to the Default performance without entity inoculation.This indicates the effectiveness of entity inoculation in handling misinformation narratives in political contexts.On the other hand, for Twitterbased datasets, the impact of entity inoculation is less pronounced.While entity inoculation shows benefits in making models' entities agnostic, we hypothesise that its effectiveness may be limited to datasets that contain cases where similar narratives are spread with different entities.Examples of such false narratives can be found in Appendix A.4.

Influence of large language models (LLMs)
Large Language Models (LLMs) have consistently demonstrated impressive performance across a wide range of natural language processing (NLP) tasks [56,57].However, their application in information retrieval tasks remains an ongoing area of research, with the aim of optimising their ability to retrieve relevant information from large corpora in response to a given input query [58,59].Therefore, to assess the performance of LLMs in comparison to our UTDRM method, we employ a Listwise Re-ranker with a Large Language Model (LRL) [59] to re-rank the Top-k documents retrieved by the initial stage ranker.In this context, the LLM is provided with the following instruction template: Please note that due to LLM memory constraints, input sequences may exceed the maximum input sequence length.In such cases, we implement progressive re-ranking (M = 20) following the approach of Ma et al. [59].This technique re-ranks M debunks at a time and incrementally shifts the window by M/2 towards the beginning of the retrieved debunks, leading to an enhancement in the top-ranked results.In this work, we test two types of LLMs: (1) the open-sourced LLaMA 2 13B [56, 57];12 and (2) the private LLM ChatGPT (gpt-3.5-turbo).LLaMA 2 was hosted on our local server (2x24GB NVIDIA GeForce RTX 3090) and for ChatGPT, we use OpenAI API. 13 The total cost of testing using ChatGPT was 20 GBP.
Table 7 shows the results of LRL using BM25 and UTDRM-MPNet as first-stage rankers.For eg. "BM25+ChatGPT" (column 5 -Table 7) signifies that BM25 performs the firststage ranking, and ChatGPT conducts the second-stage ranking.Following the methodology from prior work [59], the LLM is used to re-rank 100 documents on top of BM25 and 20 documents on top of UTDRM.The results indicate that ChatGPT outperforms LLaMA 2 across all datasets and metrics.Moreover, we find that re-ranking on top of UTDRM yields  superior scores compared to re-ranking on top of BM25 (Table 7).Figure 3 visually depicts the average MRR performance of different retrieval methods.For the Twitter-based datasets, although UTDRM achieves the highest average scores, UTDRM+ChatGPT outperforms UTDRM in Snopes (MAP@1) and in CLEF 21 2A-EN (MAP@1 and MRR).For the political-based datasets, notably, UTDRM+ChatGPT beats UTDRM and attains the highest performance in MAP@1, MAP@5, and MRR across all datasets.
While LLMs exhibit impressive performance, it is important to consider the trade-offs, one of which is retrieval cost and latency.We conduct experiments to measure the time taken per claim to retrieve debunks for each method and we observe notable differences in retrieval speed.Figure 3 shows retrieval times and average MRR performance comparison of different retrieval methods.We find that BM25+LLaMA2 and BM25+ChatGPT exhibit longer retrieval times, averaging around 80 seconds and 50 seconds per claim, respectively.In contrast, UTDRM+LLaMA2 and UTDRM+ChatGPT significantly reduce retrieval time, taking only 8 seconds and 5 seconds per claim, respectively, possibly due to the fewer number of debunks to be re-ranked.Remarkably, UTDRM-MPNet on its own achieves an exceptionally low retrieval time of just 0.04 seconds per claim.These findings underscore that, despite LLMs' impressive performance in relevance ranking, they often come at the cost of extended retrieval times, whereas our proposed UTDRM-MPNet approach offers both high relevance and exceptional retrieval speed.

Error analysis
The evaluation of UTDRM would be incomplete without a thorough examination of the types of errors it may produce.To address this, we manually review cases where the retrieval model fails to rank the most relevant debunked claim at the top.We conduct this analysis by inspecting the retrieved debunked claims for 50 randomly selected cases from the Snopes and Politifact datasets.We find that the primary cause of such errors is when a misinformation claim is associated with multiple debunked claims (19 out of 50).For instance, the false claim "African Union warning African citizens against the safety of travelling to the United States" in Snopes has multiple relevant debunked claims.In such instances, the model assigns highly similar high scores to all relevant debunked claims, even though each misinformation claim is linked to a single debunked claim in the dataset.This highlights inconsistencies in the existing datasets and the need for further improvement.
The second type of error occurs when the retrieved debunked claim is not entirely relevant, but there is some degree of relevance to the input misinformation claim (16 out of 50).For instance, for the claim "Governor Christie has endorsed many of the ideas that Barack Obama supports, whether it is gun control or the appointment of Sonia Sotomayor", the top retrieved debunked claim discusses Governor Chris Christie and Barack Obama sharing similar views on gay marriage.This highlights the challenge of distinguishing closely related debunked claims, emphasising the need for continued refinement in retrieval models for enhanced precision.Moreover, we hypothesise that this may also be attributed to limitations in the claim generation model, where it generates claims that, while not entirely irrelevant, are only tangentially related to the intended debunked claim.
Such errors suggest the propagation of errors in the retrieval process and suggests the need for improvement in the claim generation model.
The third category, accounting for 15 out of 50 cases, involves errors that occur when a misinformation claim lacks sufficient context to find the relevant debunked claim.For example, one of the misinformation claims in the Politifact dataset states "very few children" which is ambiguous and makes finding a relevant debunk challenging.Moreover, the task becomes even more challenging when misinformation claims span multiple modalities, such as combining text and images.For instance, one of the misinformation claims is a X (formerly Twitter) post stating "Botswana condemns remarks made by President Trump", along with an image containing details of the remarks.In such cases, retrieval models also require information contained in the image, as the text of the tweet alone is not sufficient.This motivates future work on multimodal debunked-narrative retrieval, where models can exploit joint information from different modalities.

Conclusion
This paper presents UTDRM, an unsupervised method for training debunked-narrative retrieval models that effectively overcomes the reliance on manually annotated training data.UTDRM introduces a novel approach to synthetically generate large-scale topical claims from fact-checking articles.A comprehensive comparison with other out-of-the-box, unsupervised, and supervised models confirm the efficacy of UTDRM in retrieving accurate debunked claims.In general, UTDRM-MPNet and UTDRM-RoBERTa consistently achieve the highest scores across all datasets, with UTDRM-MPNet exhibiting slightly better performance.
Furthermore, this study emphasises the importance of corpus size, demonstrating that larger corpora contribute to improved retrieval performance.The paper also examines how different factors, such as the quantity of synthetically generated claims used and the entity inoculation method, influence the performance of UTDRM.While entity inoculation shows benefits in making models entity agnostic, its effectiveness may be limited to cases involving narratives that adapt and propagate with different entities.
Additionally, this paper experiments with state-of-the-art LLMs as listwise re-rankers and compares them to our UTDRM method.While LLMs exhibit slight performance improvements over UTDRM on some datasets, their use comes at the cost of lower computational efficiency, making UTDRM a more practical choice for real-time applications.
Finally, UTDRM allows models to adapt and learn from synthetically generated topical claims in real-time; thus providing significant benefits in combating ever-evolving topical misinformation.

Limitations and future work
The present work acknowledges certain limitations and identifies several avenues for future improvement.Firstly, this study focused solely on English-language datasets and did not explore cross-lingual retrieval.However, the UTDRM approach can be replicated and adapted to other languages using pre-trained multilingual language models.Conducting cross-lingual experiments would provide a more comprehensive understanding of UT-DRM's performance and applicability in diverse linguistic contexts, thereby extending its potential impact in combating misinformation on a global scale.Additionally, future work can include testing on a broader range of fact-checking articles and exploring novel approaches to further improve the information retrieval models used in UTDRM.

A.1 Hyperparameters
For the T5 claim generator, we fine-tune the base variant of the T5 model 14 using a constant learning rate of 1e-4 for 2 epochs, with a batch size of 12.The maximum input tokens allowed is 512, and the maximum output tokens is set to 64.
The training details for the neural retrieval model are as follows.UTDRM-RoBERTa is fine-tuned for two epochs with a batch size of 64 and a learning rate of 4e-5.For UTDRM-MPNet, we fine-tune it for one epoch with a batch size of 64 and a learning rate of 8e-7.The maximum input sequence length is set to 350, the optimiser used is AdamW and we 14 https://huggingface.co/t5-base.
use linear warmup as the learning rate scheduler.Hard negatives for training the neural retrieval model are mined using the all-mpnet-base-v2 15 and all-MiniLM-L12-v2 16 models because of their demonstrated efficacy . 17Both UTDRM-RoBERTa and UTDRM-MPNet are validated using the respective dataset's validation set, and we manually tune the hyperparameters based on the evaluation metrics (Sect.4.3).The hyperparameter bounds are as follows: 1) Epochs range from 1 to 5, 2) Learning rate ranges from 1e-7 to 1e-5, and 3) Batch size ranges from 8 to 64, limited by the GPU requirements of the model.The training time for each epoch ranges from 10 to 15 minutes.
For the baselines, BT and ICT use the same hyperparameters as UTDRM-RoBERTa to ensure a fair comparison.For SimCSE and TSDAE, we use the same hyperparameters as stated by the authors in their respective papers [25,28].Finally, all experiments are conducted on a machine with a 24GB NVIDIA GeForce RTX 3090.
From Table 8, we can observe that the model generally performs better on Twitterbased datasets, with the highest MAP@1 and MRR values of 0.945 and 0.962 respectively, recorded on the CLEF 20 2A dataset with N = 6 and N = 10 generated claims.In contrast, performance on political-based datasets is comparatively lower, with the highest MAP@1 and MRR values of 0.512 and 0.612 respectively, both recorded on the Politifact   In summary, Table 10 highlights the importance of our adopted approach of entity inoculation, as detailed in Sect.6.3.This method involves replacing one randomly chosen named entity in the generated claims with another random named entity, with the intent to mimic real-world scenarios where similar misinformation narratives disseminate involving different entities.This emphasises both the adaptability and resilience of misinformation, underlining the need for effective methods like entity inoculation to detect debunked narratives.

Figure 3
Figure 3 Retrieval times (in seconds per query) and average MRR performance (scaled by a factor of 100) comparison of different retrieval methods

Table 1
Sample claims generated from T5 and ChatGPT claim generator (Sect.3.1).The debunked claim statements from the fact-checking articles used to generate claims are also mentioned Kamala harris "said "this weekend, yells at people in line to vote in ohio, may have broken laws." 1. Did Sen. Kamala Harris cross the line during her Ohio voting site visit?Investigation underway.Did Sen. Kamala Harris break the law during a recent visit to an Ohio voting site?The controversy continues to unfold.#VoterFraud Dr Kafeel Khan was present at farmers tractor rally in Delhi. 1.A picture worth a thousand words -Dr.Kafeel Khan driving a tractor at the farmers rally in Delhi on Republic Day.#Solidarity #India 2. Dr Kafeel Khan sitting on a red tractor in Delhi 2. Dr. Kafeel Khan showing his solidarity with farmers as he drives a tractor at the Republic Day rally in Delhi.#FarmersProtest #India 3. Dr Kafeel Khan Was Present On Tractor Rally By Farmers On Republic Day 3. A powerful image of Dr. Kafeel Khan joining the farmers tractor rally in Delhi on Republic Day.#Solidarity #FarmersRights Pondicherry university student found a coronavirus cure by mixing ginger, black pepper powder and honey for five days.1. #Inspiration This is the kind of news that inspires us all.A student from Pondicherry University has made a groundbreaking discovery in the fight against COVID-19.Let's support and celebrate our young innovators! 2. "An Indian student from Pondicherry university found a home remedy cure for COVID-19 which is for the very first time accepted by WHO." 2. The world is one step closer to defeating COVID-19 thanks to a brilliant mind from Pondicherry University.Let's celebrate this incredible achievement!#science #innovation 3. BJP member of parliament, Manaskar Singh, found "a home remedy cure for COVID-19, which is for the very first time accepted by WHO". 3. #Hope A glimmer of hope in these trying times!A student from Pondicherry University has uncovered a potential cure for COVID-19.Let's keep our fingers crossed for a breakthrough!Photos show money seised from the Income Tax raids conducted at the house of MK Stalin's daughter Senthamarai 1.The rich get richer?Income tax raids at MK Stalin's daughter's house expose a hoard of cash and gold.Will they be held accountable?#corruption #India 2. Photos shows the money seized at the house of MK Stalins daughter Senthamarai Sabareesan 2. Shocking revelation!Income tax raids at MK Stalin's daughter's house reveal cash and gold -photos don't lie! #corruption #IndiaPolitics 3. Photos show cash and gold taken in Income Tax raids at the house of MK Stalins daughter Senthamarai 3. The truth is out!Income tax raids at MK Stalin's daughter's house reveal a stash of cash and gold.Will justice prevail?#corruption #India #votingrights #legalissues 2. Sen. Kamala Harris broke ohio election laws.2.Sen. Kamala Harris is facing accusations of breaking the law during a recent visit to an Ohio voting site.Stay tuned for updates! 3. Kamala harris broke ohio election laws by speaking to people in line to vote outside a polling site.3.

Table 2
ROUGE and selfBLEU scores for claims generated from T5 and ChatGPT claim generator.Lower selfBLEU scores indicate higher diversity, while higher ROUGE scores indicate greater overlap with the reference debunked claims

Table 3
reports the results of UTDRM evaluation divided into two parts: the top part presents the individual and average results for Twitter-based datasets (Snopes, CLEF 22 2A-EN, CLEF 21 2A-EN & CLEF 20 2A-EN), while the bottom part showcases the individual and average results for political-based datasets (Politifact, CLEF 22 2B-EN & CLEF 21 2B-EN).

Table 4
Influence of fact-checking articles on UTDRM.The highest scores for each dataset and metric are in bold

Table 5
Influence of the generated claims on UTDRM.The highest scores for each dataset and metric are in bold

Table 6
Influence of entity inoculation on UTDRM.UTDRM is the deafult UTDRM-MPNet performance from Table3.The highest scores for each dataset and metric are in bold

Table 7
Influence of large language models (LLaMA 2 and ChatGPT) as a second stage retriever to re-rank the top candidate claims retrieved by BM25 and UTDRM.UTDRM is the deafult UTDRM-MPNet performance from Table3.UTDRM+ChatGPT signifies that UTDRM-MPNet performs the initial ranking, and ChatGPT conducts the second-stage ranking.The highest scores for each dataset and metric are in bold

Table 10
Examples showcasing the variation of similar debunked claims across multiple entities and contexts, with corresponding fact-check links.The text in bold shows difference in named entities between the claims