Methodology

Our transparent methodology for analyzing research integrity

Rule-Based COI Detection Algorithm

Research Integrity Project uses an algorithm based on explicit rules to evaluate conflicts of interest (COI) in scientific papers. The methodology combines semantic extraction through language models with fixed rules and thresholds, following international standards (ICMJE, COPE, WAME, CONSORT, PRISMA, DOAJ).

Objective: Produce a structured estimation of COI risk and editorial credibility using only the paper's text as source, with reproducible and explainable results.

Detection Algorithm Flow

1 Ingestion and Preprocessing

The system extracts and cleans the scientific document content:

2 Semantic Extraction with AI

The language model identifies structured textual facts:

Identifies

  • Author and institution names
  • Potential funders
  • Explicit/implicit COI declarations
  • Fragments about funding and sponsors

Detects

  • Language patterns (promotional vs critical)
  • Presence/absence of limitations
  • Companies, foundations, organizations
  • Relationships between authors and sponsors
Output: A set of structured textual facts that feed the algorithm rules. The AI doesn't decide "by eye", it only extracts objective information.

! Predatory Journal Detection Module

A specialized module runs in parallel to detect potential predatory publishing practices:

1. Metadata Extraction

The system extracts specific metadata from the PDF:

  • ISSN (International Standard Serial Number)
  • Journal Name (Normalized)
  • Publisher Name

2. Blacklist Matching

Cross-references extracted data against curated blacklists:

  • Beall's List (Archived)
  • PredatoryJournals.org
  • Community Contributed Lists

3. External Verification (AI Internet Scan)

If the internal database match is inconclusive, a second-layer AI scan is triggered:

  • Checks open web for predatory signals (e.g., fast review times, spam complaints).
  • Verifies against online watchlists and indexes.
  • Provides a confidence score and evidence summary.
New: Database Enrichment

If the external scan detects a high risk but the internal database is silent, users can manually add the journal to the internal database, improving future detection for all users.

Impact: If a match is found (Internal or External), the paper is flagged as HIGH RISK (Score 80-100), overriding other dimension scores.

D Data Sources & Access

We believe in transparency. Our predatory journal detection relies on open databases.

Open Data

Download our full aggregated database of predatory journals used in the analysis.

Download Full Database (CSV for Excel)

3 Rule Application by Dimensions

From the extracted facts, scores 0-100 are calculated for each of the 5 dimensions.

1. Disclosure & Funding Transparency

Transparencia de conflictos declarados y financiación

  • Sin sección COI ni funding en estudio sensible → score 75-90, nivel 'high'
  • COI declarado 'no conflicts' + funding claro → score 20-35, nivel 'low'
  • Funding presente pero sin mención COI → score 40-60, nivel 'medium'
  • Declaraciones vagas → score 60-75, nivel 'high'
2. Funding-Outcome Alignment

Relación entre financiación y resultados

  • Sponsor comercial + resultados muy positivos + sin crítica → score 60-85, 'high'
  • Sponsor público/académico + discusión equilibrada → score 20-40, 'low'
  • Sin sponsor identificable → score 40-55, 'medium'
  • Sponsor + resultados favorables + lenguaje promocional → score 70-90, 'high'
3. Author-Institution-Sponsor Network

Red autores-instituciones-financiadores

  • Varios autores empleados de empresa financiadora → score 70-90, 'high'
  • Afiliaciones académicas diversas sin vínculos comerciales → score 20-40, 'low'
  • Afiliaciones ausentes o genéricas → score 60-80, 'high'
  • Institución única = sponsor → score 55-75, 'high'
4. Journal / Editorial Integrity

Integridad editorial y riesgo de predatory journal

  • Predatory Journal Detected (Blacklist) → score 100, 'high' (CRITICAL)
  • Señales de predatory journal (texto) → score 70-90, 'high'
  • Indicios de peer review, políticas éticas → score 20-40, 'low'
  • Información insuficiente sobre revista → score 40-60, 'medium'
5. Textual Bias & Reporting Quality

Sesgos de lenguaje y calidad de reporte

  • Lenguaje promocional + sin limitaciones → score 60-80, 'high'
  • Lenguaje sobrio + limitaciones honestas → score 20-40, 'low'
  • Falta de transparencia metodológica (CONSORT/PRISMA) → +10-20 puntos
  • Autocitación excesiva + tono promocional → score 55-75, 'high'

4 Global Score Calculation

The global score and risk level are calculated:

overall_score = (dim1 + dim2 + dim3 + dim4 + dim5) / 5
0-33
LOW
34-66
MEDIUM
67-100
HIGH

5 Report Generation

The model writes a report in natural language with fixed structure:

Fixed Structure
Always the same sections
Stable Labels
Consistent levels
Explicit Rules
Direct references

Role of the AI Model

The AI does not decide risk levels by intuition. Its specific function is:

1. Extract

Information from text

2. Map

Findings to predefined rules

3. Draft

Report with fixed structure

Current Algorithm Limitations

Privacy & Data Security

What We Store:

  • Analysis results and metadata

What We Don't Store:

  • Full PDF file contents permanently
  • Personal user information
  • Analysis IP addresses or tracking data