Methodology
Our transparent methodology for analyzing research integrity
Rule-Based COI Detection Algorithm
Research Integrity Project uses an algorithm based on explicit rules to evaluate conflicts of interest (COI) in scientific papers. The methodology combines semantic extraction through language models with fixed rules and thresholds, following international standards (ICMJE, COPE, WAME, CONSORT, PRISMA, DOAJ).
Detection Algorithm Flow
1 Ingestion and Preprocessing
The system extracts and cleans the scientific document content:
- Plain text: title, authors, affiliations, main sections
- Structure detection: Abstract, Introduction, Methods, Results, Discussion, Conclusions
- Key section identification: Conflict of Interest, Funding, Acknowledgements
- Cleaning artifacts and duplicate spaces
2 Semantic Extraction with AI
The language model identifies structured textual facts:
Identifies
- Author and institution names
- Potential funders
- Explicit/implicit COI declarations
- Fragments about funding and sponsors
Detects
- Language patterns (promotional vs critical)
- Presence/absence of limitations
- Companies, foundations, organizations
- Relationships between authors and sponsors
! Predatory Journal Detection Module
A specialized module runs in parallel to detect potential predatory publishing practices:
1. Metadata Extraction
The system extracts specific metadata from the PDF:
- ISSN (International Standard Serial Number)
- Journal Name (Normalized)
- Publisher Name
2. Blacklist Matching
Cross-references extracted data against curated blacklists:
- Beall's List (Archived)
- PredatoryJournals.org
- Community Contributed Lists
3. External Verification (AI Internet Scan)
If the internal database match is inconclusive, a second-layer AI scan is triggered:
- Checks open web for predatory signals (e.g., fast review times, spam complaints).
- Verifies against online watchlists and indexes.
- Provides a confidence score and evidence summary.
New: Database Enrichment
If the external scan detects a high risk but the internal database is silent, users can manually add the journal to the internal database, improving future detection for all users.
D Data Sources & Access
We believe in transparency. Our predatory journal detection relies on open databases.
External Resources
Open Data
Download our full aggregated database of predatory journals used in the analysis.
Download Full Database (CSV for Excel)3 Rule Application by Dimensions
From the extracted facts, scores 0-100 are calculated for each of the 5 dimensions.
Transparencia de conflictos declarados y financiación
- Sin sección COI ni funding en estudio sensible → score 75-90, nivel 'high'
- COI declarado 'no conflicts' + funding claro → score 20-35, nivel 'low'
- Funding presente pero sin mención COI → score 40-60, nivel 'medium'
- Declaraciones vagas → score 60-75, nivel 'high'
Relación entre financiación y resultados
- Sponsor comercial + resultados muy positivos + sin crítica → score 60-85, 'high'
- Sponsor público/académico + discusión equilibrada → score 20-40, 'low'
- Sin sponsor identificable → score 40-55, 'medium'
- Sponsor + resultados favorables + lenguaje promocional → score 70-90, 'high'
Red autores-instituciones-financiadores
- Varios autores empleados de empresa financiadora → score 70-90, 'high'
- Afiliaciones académicas diversas sin vínculos comerciales → score 20-40, 'low'
- Afiliaciones ausentes o genéricas → score 60-80, 'high'
- Institución única = sponsor → score 55-75, 'high'
Integridad editorial y riesgo de predatory journal
- Predatory Journal Detected (Blacklist) → score 100, 'high' (CRITICAL)
- Señales de predatory journal (texto) → score 70-90, 'high'
- Indicios de peer review, políticas éticas → score 20-40, 'low'
- Información insuficiente sobre revista → score 40-60, 'medium'
Sesgos de lenguaje y calidad de reporte
- Lenguaje promocional + sin limitaciones → score 60-80, 'high'
- Lenguaje sobrio + limitaciones honestas → score 20-40, 'low'
- Falta de transparencia metodológica (CONSORT/PRISMA) → +10-20 puntos
- Autocitación excesiva + tono promocional → score 55-75, 'high'
4 Global Score Calculation
The global score and risk level are calculated:
LOW
MEDIUM
HIGH
5 Report Generation
The model writes a report in natural language with fixed structure:
Always the same sections
Consistent levels
Direct references
Role of the AI Model
The AI does not decide risk levels by intuition. Its specific function is:
1. Extract
Information from text
2. Map
Findings to predefined rules
3. Draft
Report with fixed structure
Current Algorithm Limitations
- Based solely on the available paper's textual content
- No access to external COI forms, trial registries, or external databases
- Predatory journal detection is based on known blacklists (may not cover new journals)
- The algorithm indicates COI risk, does not prove its legal existence
- Tool for critical reading and activism, not a court of truth
Privacy & Data Security
What We Store:
- Analysis results and metadata
What We Don't Store:
- Full PDF file contents permanently
- Personal user information
- Analysis IP addresses or tracking data