How to use AI to screen papers for systematic reviews

Screening thousands of papers by hand is the single most time-consuming step in any systematic review — and it is where most projects stall. Research published in BMC Medical Research Methodology shows that the average s

Jan 6, 2026
How to use AI to screen papers for systematic reviews

Screening thousands of papers by hand is the single most time-consuming step in any systematic review — and it is where most projects stall. Research published in BMC Medical Research Methodology shows that the average systematic review takes 67.3 weeks to complete, with abstract screening and full-text review ranking among the most labor-intensive tasks. Meanwhile, a study in PLOS ONE found that human reviewers have a total error rate of nearly 11% during abstract screening, meaning even careful manual work misses relevant studies or lets irrelevant ones through. AI systematic review screening tools use machine learning to prioritize the most relevant records first, cut workload by up to 95%, and reduce the risk of human error — without sacrificing the rigor that PRISMA demands.

This guide walks you through exactly how AI-assisted screening works, compares the leading tools — including ASReview, Rayyan, and ScholarDock — and gives you a practical, step-by-step workflow for integrating AI into your next systematic review while staying fully PRISMA-compliant.

What is AI systematic review screening?

AI systematic review screening is the use of machine learning algorithms — most commonly active learning — to semi-automate the process of deciding which studies from a database search should be included or excluded in a systematic review. Instead of reviewing every title and abstract in random order, an AI model learns from your early inclusion and exclusion decisions and then reorders the remaining records so the most likely relevant papers appear first.

In practical terms, this means a reviewer screens a small seed set of papers, the algorithm trains on those decisions, and the queue is continuously reprioritized. As more decisions are made, the model becomes increasingly confident about which remaining records are irrelevant, allowing the reviewer to stop screening earlier while still capturing all or nearly all relevant studies.

AI screening does not replace human judgment. It accelerates it. The reviewer still makes every include or exclude decision — the AI simply changes the order in which papers are presented and flags when continued screening is unlikely to find new relevant records.

How active learning differs from keyword search

Traditional database searches rely on Boolean keyword strings. Active learning goes further by using the semantic content of titles and abstracts — not just keyword matches — to predict relevance. This means the model can surface papers that use different terminology but address the same research question, reducing the risk of missing studies that keyword searches alone would overlook.

Why manual screening is the biggest bottleneck

If you have ever run a systematic review, you know the pain: a well-constructed search strategy returns thousands, sometimes tens of thousands, of records. According to data from the PROSPERO registry, the mean number of citations retrieved per review ranges from 27 to over 92,000, yet only about 2.94% of those records end up included in the final review. That means reviewers spend the vast majority of their screening time reading and rejecting irrelevant papers.

The real cost of manual screening

  • Time. A full systematic review averages over 67 weeks from registration to publication. Screening alone — title-abstract review followed by full-text assessment — accounts for a large share of that timeline. One case study published in the Journal of Clinical Epidemiology documented that a review completed with automation tools still required 61 person-hours, with abstract screening and full-text screening among the top five most time-consuming tasks.

  • Money. Research estimates that a single-screening systematic review costs approximately £13,825, while a dual-screening review costs around £35,781 — much of which goes toward the labor of reading and categorizing papers.

  • Error. A study analyzing 139,467 citations across 86 reviewers found a combined false-inclusion and false-exclusion rate of 10.76% during abstract screening. Single-reviewer screening performs even worse: a randomized controlled trial showed that solo reviewers missed 13% of relevant studies on average, with inexperienced screeners missing up to 58%.

These numbers make a strong case for AI-assisted screening — not to eliminate human reviewers, but to make their time dramatically more productive and their decisions more consistent.

How AI screening tools actually work

Most AI screening tools for systematic reviews rely on a technique called active learning, a branch of machine learning where the model actively selects which data points a human should label next. Here is the general workflow:

  1. Import your search results. Export citations from databases like PubMed, Scopus, or Web of Science and upload them to the screening tool.

  2. Label a seed set. The reviewer screens a small number of records (typically 5–20) to give the model initial examples of both relevant and irrelevant papers.

  3. Model training. The algorithm trains on the labeled records, analyzing features of titles and abstracts such as word frequency, semantic similarity, and topic distribution.

  4. Prioritized queue. The remaining unlabeled records are reordered so the papers most likely to be relevant appear at the top of the queue.

  5. Iterative refinement. Each time the reviewer screens another record, the model retrains and updates the queue. Over time, the reviewer encounters a long stretch of irrelevant records — a signal that most relevant papers have already been found.

  6. Stopping rule. The reviewer decides when to stop, based on the model's confidence, the number of consecutive irrelevant records, or a predefined threshold (for example, screening until 95% or 99% estimated recall is reached).

Key metrics to understand

  • Work Saved over Sampling (WSS@95): The percentage of records you can skip while still finding 95% of all relevant studies. Top-performing tools regularly achieve WSS@95 values above 80%, meaning you screen less than 20% of the total records.

  • Recall: The proportion of all relevant studies that the AI successfully surfaces. For systematic reviews, recall must be very high — typically 95% or above.

Best AI tools for systematic review screening compared

Several tools now offer AI-assisted screening for systematic reviews. Here is how the leading options compare for research teams in 2026.

ASReview

ASReview is a free, open-source active learning tool developed at Utrecht University. It is one of the most rigorously validated AI screening tools available, with published benchmarks showing it can reduce screening workload by up to 95% while identifying 95% of relevant articles. A real-time comparison study in rheumatology found ASReview reduced screening time by 83% on average.

  • Best for: Researchers who want full transparency, open-source code, and the ability to run simulations to validate screening performance.

  • Limitations: Focused primarily on title-abstract screening. Does not include built-in collaboration features, data extraction, or project management. Requires some technical comfort for installation and setup.

Rayyan

Rayyan is a web-based platform designed specifically for collaborative systematic review screening. Its AI engine learns from reviewer decisions and assigns relevance ratings to unscreened records, with the platform claiming to reduce screening time by up to 90%. Rayyan also handles deduplication and supports blind screening by multiple reviewers.

  • Best for: Teams that need collaborative screening with multiple reviewers working in parallel, especially those who want a simple web interface.

  • Limitations: AI features are less transparent than ASReview. The free plan limits the number of active reviews. Primarily a screening tool — does not support data extraction, synthesis, or project-level research management.

Covidence

Covidence is a comprehensive systematic review management platform endorsed by Cochrane. It supports the full review pipeline from screening through data extraction and risk-of-bias assessment. Its AI screening feature uses machine learning to rank studies by predicted relevance.

  • Best for: Cochrane reviewers and teams that need an end-to-end review platform with structured data extraction forms and quality assessment tools.

  • Limitations: Subscription-based pricing can be expensive for unfunded researchers. The AI component is less configurable than ASReview.

DistillerSR

DistillerSR offers AI-driven screening automation alongside data extraction and reporting. Its machine learning models can be trained to auto-classify low-relevance records, and it supports customizable workflows for complex reviews.

  • Best for: Large institutional teams conducting multiple concurrent reviews with strict regulatory requirements.

  • Limitations: Enterprise pricing puts it out of reach for many independent researchers and smaller teams.

ScholarDock

ScholarDock, a research project and reference management platform, approaches screening differently by embedding it within your broader research workflow. Rather than treating screening as an isolated task, ScholarDock lets you organize screened papers directly into project-specific reference libraries, tag and annotate sources as you screen, and connect screening decisions to your ongoing research projects. ScholarDock's AI automatically suggests related sources, extracts key findings, and helps you maintain living literature reviews that evolve as your systematic review progresses.

  • Best for: Research teams that need screening connected to project management, reference organization, and collaborative workspaces — all in one platform instead of switching between a screening tool, a reference manager, and a project tracker.

  • Key advantage: While standalone screening tools handle one step of the review, ScholarDock connects every step — from initial literature search through screening, extraction, annotation, and citation — in a single workspace.

Step-by-step: how to set up AI screening for your systematic review

Follow this workflow to integrate AI-assisted screening into your next review without compromising methodological rigor.

Step 1: Develop your search strategy as usual

AI screening does not replace a well-designed search strategy. Use the PICO framework (Population, Intervention, Comparison, Outcome) to define your research question, then build comprehensive Boolean search strings and run them across relevant databases (PubMed, Scopus, Web of Science, Embase, and others relevant to your discipline). Document every search exactly as PRISMA 2020 requires.

Step 2: Export and deduplicate

Export your results in a standard format (RIS, BibTeX, or CSV). Import them into your screening tool and run deduplication. Tools like Rayyan and ScholarDock handle deduplication automatically. If using ASReview, you may want to deduplicate beforehand using a reference manager or a tool like Systematic Review Accelerator.

Step 3: Define inclusion and exclusion criteria

Write explicit, unambiguous criteria before you begin screening. These should map directly to your registered protocol. Share them with all reviewers to ensure consistency.

Step 4: Screen a seed set

Begin screening records one at a time. Most active learning tools need at least one relevant and one irrelevant record to start learning. Aim to label 10–20 records as your seed set. Choose a mix of clearly relevant and clearly irrelevant papers to give the algorithm strong initial signal.

Step 5: Let the AI prioritize

Once the model has enough labels, it will reorder your queue. Continue screening from the top of the prioritized list. You should notice relevant papers appearing more frequently at the beginning and becoming rarer as you progress.

Step 6: Monitor recall and decide when to stop

Track how many consecutive irrelevant records you have seen. Many researchers use the rule of screening until they encounter 50–100 consecutive irrelevant records without finding a new relevant study. Some tools provide a real-time estimate of recall to help you decide. For high-stakes reviews, consider screening the full dataset or having a second reviewer screen a random 10% sample to validate the stopping decision.

Step 7: Proceed to full-text review

Move included records to full-text assessment. This step is still overwhelmingly manual, though AI tools like ScholarDock can help by extracting key information from PDFs and highlighting relevant sections, saving time during full-text evaluation.

Step 8: Document everything

Record the AI tool used, the version, the model settings, the stopping rule, and the estimated recall in your methods section. This is essential for PRISMA compliance and reproducibility.

How to maintain PRISMA compliance when using AI screening

The PRISMA 2020 statement provides the gold standard for reporting systematic reviews. When you incorporate AI-assisted screening, you need to transparently report it. Here is what to include:

  1. Methods section. Name the AI tool, specify the version and model used, describe the active learning approach, and explain your stopping criteria.

  2. PRISMA flow diagram. Report the total number of records identified, the number screened by AI-assisted methods, the number excluded at title-abstract stage, and the number assessed at full-text stage. If you used a stopping rule and did not screen every record, state this and report the estimated recall.

  3. Limitations. Acknowledge that AI-assisted screening introduces a probability-based element — while recall is typically very high, there is a small chance that relevant studies were in the unscreened portion of the dataset.

  4. Supplementary materials. Where possible, share the labeled dataset and model output so other researchers can reproduce or validate your screening decisions.

The upcoming PRISMA-AI reporting extension, currently in development, will provide more specific guidance on reporting AI use in systematic reviews. In the meantime, full transparency about your workflow is the best practice.

Common mistakes to avoid with AI-assisted screening

Even with strong tools, researchers can undermine their screening quality with avoidable mistakes. Here are the most common pitfalls:

  • Skipping the seed set. Starting with too few labeled examples gives the model weak signal. Always screen at least 10–20 records before relying on AI prioritization.

  • Using AI screening as the sole reviewer. AI reorders papers — it does not make inclusion decisions. Every record still needs a human decision. For dual-screening reviews, both reviewers should use the AI-prioritized queue independently.

  • Stopping too early. Aggressive stopping rules save time but risk missing relevant studies. Use a conservative threshold (at least 50 consecutive irrelevant records) and validate with a random sample check.

  • Not reporting the AI workflow. Failure to document the tool, model, and stopping criteria violates PRISMA transparency requirements and makes your review non-reproducible.

  • Treating screening in isolation. Screening is one step in a multi-stage process. If your screened papers end up in a disconnected spreadsheet with no link to your project notes, annotations, or citation library, you lose efficiency downstream. Platforms like ScholarDock solve this by keeping your screened references, project tasks, and collaborative notes in one connected workspace.

How ScholarDock connects screening to your full research workflow

Most AI screening tools solve one problem — getting through the title-abstract pile faster. But systematic reviews do not end at screening. After you decide which papers to include, you still need to organize references, extract data, annotate findings, track progress across team members, and connect everything to your broader research project.

This is where ScholarDock, a research project and reference management platform, stands apart. Instead of exporting your screened papers from one tool and importing them into another, ScholarDock keeps your entire workflow in a single workspace:

  • Organize screened papers directly into project-specific reference libraries with tags, annotations, and custom metadata.

  • Collaborate with your team on screening decisions, data extraction, and manuscript writing — with full visibility into who is working on what.

  • Use AI to surface related sources you may have missed, summarize key findings from included papers, and automatically organize your references.

  • Track your review's progress from protocol registration through search, screening, extraction, and manuscript submission — all in one dashboard.

  • Build living literature reviews that update as new research is published, so your systematic review stays current beyond its initial publication date.

If your research team is tired of juggling a screening tool, a reference manager, a shared drive, and a project tracker, ScholarDock brings your entire systematic review workflow — from first search to final citation — into one connected workspace.

Key takeaways

AI-assisted screening is no longer experimental — it is a validated, practical approach that leading systematic review teams use to save 80–95% of screening time while maintaining high recall. The key is choosing the right tool for your team's needs, following a structured workflow, and documenting everything for PRISMA compliance.

For teams that only need screening, open-source tools like ASReview offer excellent performance with full transparency. For teams that need collaboration, Rayyan and Covidence add multi-reviewer workflows. And for research teams that want screening connected to the rest of their research lifecycle — reference management, project tracking, team collaboration, and AI-powered knowledge structuring — ScholarDock is the platform that brings it all together.