Accepted at IEEE MIPR 2025
NexusIndex is an advanced fake news detection framework that integrates multimodal embeddings, vectorized proximity layers, and the FAISSNexusIndex layer to significantly enhance retrieval efficiency and detection accuracy. It leverages Transformer-based models for text and MobileNet V3 for image analysis, combined with an adaptive semi-supervised learning approach that dynamically refines the model with evolving misinformation.
git clone https://github.com/solmazsm/NexusIndex.git
cd NexusIndex
We evaluate NexusIndex on several datasets, including:
Politifact: A well-known dataset for fact-checking. Politifact
GossipCop: A dataset containing news articles, labeled as real or fake.GossipCop
huggingface: The GossipCop dataset is available on Hugging Face. huggingface
ABC News: A large-scale dataset used for semi-supervised learning with pseudo-labeling. WELFake: A text-based dataset containing real and fake news articles. The datasets are used to train and test the fake news detection model.
All the News: This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1,2016 to April 2, 2020. It is an expanded edition of the original All the News dataset, which was compiled in early 2017. While the original dataset contains more than 100,000 articles, the new dataset’s greater size and breadth should allow researchers to study a wider selection of media. To enhance the performance of fake news detection, we propose integrating a threshold-based pseudolabeling strategy within the NexusIndex framework. This approach begins by training the initial model on a labeled dataset, followed by applying the trained model to predict probabilities on an unlabeled dataset.
WELFake: The WELFake dataset consists of 72,134 news articles, with 35,028 classified as real and 37,106 as fake. This dataset was created by merging four well-known news datasets: Kaggle, McIntire, Reuters, and BuzzFeed Political. The goal of this merger was to mitigate the risk of overfitting in machine learning classifiers and to offer a larger corpus of text data to enhance the training process for fake news detection models.
Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).
There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.
NexusIndex evaluates the model performance using several metrics:
© 2025 Solmaz Seyed Monir. All rights reserved.