SVM Linear Model System to evaluate news article credibility whether it is Fake or Real, trained on the WELFake dataset.
This project uses a Linear SVM model trained on the WELFake dataset to classify news articles as real or fake. The system combines machine learning with rule-based legitimacy checks for improved accuracy on current news.
The model was trained and evaluated on 20,000 articles with the following metrics:
| Metric | Value |
|---|---|
| Accuracy | 94.30% |
| Precision | 93.96% |
| Recall | 94.88% |
| F1-Score | 94.42% |
- Accuracy: Percentage of correctly classified articles (both real and fake)
- Precision: Of articles predicted as fake, how many were actually fake
- Recall: Of all actual fake articles, how many were correctly identified
- F1-Score: Harmonic mean of precision and recall
- Linear SVM classification model
- TF-IDF feature extraction (10,000 features)
- Legitimacy checks for news agencies and journalistic language
- Real-time credibility scoring (0–100)
- Confidence percentage display
- Current event detection (2024–2025 topics)
Follow these steps to run the project locally.
git clone https://github.com/ashvin2005/AI_ML_project.git
cd AI_ML_projectpip install -r requirements.txtIf streamlit is not installed properly:
pip install streamlitstreamlit run app.pyOpen browser at:
http://localhost:8501
├── app.py
├── model.ipynb
├── requirements.txt
├── svm_model.joblib
├── tfidf_vectorizer.joblib
└── .gitignore
- Source: WELFake Dataset
- Size: 72,000+ news articles
- Labels: Real (0) and Fake (1)
- Training Sample: 20,000 articles (Milestone-1)
- Full dataset (72K) planned for Milestone-2
- Lowercasing
- URL removal
- Punctuation removal
- Digit removal
- TF-IDF vectorization
- Unigrams and bigrams
- 10,000 maximum features
- Linear Support Vector Machine (SVM)
- News agency identifiers (Reuters, AP, etc.)
- Journalistic language (said, according to, reported)
- Current events (2024, 2025, Gaza, Ukraine, COVID, elections)
- Enter news article text in the text area (minimum 30 characters)
- Click "Analyze" button
- View results:
- Classification: Real News or Fake News
- Credibility Score: 0–100
- Confidence Percentage
- Debug Info: Legitimacy and fake indicator counts
| Detail | Value |
|---|---|
| Algorithm | Linear Support Vector Machine (SVM) |
| Features | TF-IDF with 10,000 max features |
| Stop Words | English |
| Training Time | ~2–3 minutes (20,000 articles) |
| Future Scope | Training on full 72K dataset (Milestone-2) |
- Model trained mainly on 2016–2018 data
- May not recognize very recent events or names
- Best suited for formal news articles
- Short texts (<30 characters) not supported
- Scikit-learn: https://scikit-learn.org/stable/
- Streamlit: https://docs.streamlit.io/
- Pandas: https://pandas.pydata.org/docs/
- NLTK: https://www.nltk.org/