AI as a Research Partner
The scientific enterprise is being restructured around a new set of tools. Researchers who understand what AI can and cannot do — and who approach it as a collaborator rather than an oracle — are finding meaningful acceleration in their work. Those who approach it uncritically are generating new categories of error. This module maps the territory honestly: the landscape that has changed, the capabilities that are genuinely useful, and the limits that are genuinely dangerous.
The researcher's new landscape
Something significant happened across disciplines around 2022 and 2023. Researchers at bench and desk began widely adopting large language models not as curiosities but as everyday tools integrated into their workflows. The shift was not driven by a single breakthrough but by a threshold crossing: the models became good enough at enough tasks that the friction of integration became worthwhile. A biologist drafting a grant introduction, a political scientist synthesizing a literature, a chemist explaining a reaction mechanism to a collaborator in another discipline — all of these found something genuinely useful in the new tools.
The adoption, however, outpaced understanding. Surveys of graduate students at research universities show adoption rates for AI writing assistance exceeding 70%, while the same surveys reveal substantial misunderstanding of how these tools work, what they know, and critically — what they can fabricate. The gap between adoption and understanding is where the most consequential errors live. This course exists to close that gap for researchers working at the level where errors matter most.
The landscape change is also institutional. Major journals — Nature, Science, Cell, PLOS ONE, and hundreds of others — have issued AI disclosure policies. Funding agencies including NIH and NSF have begun requiring AI use statements in grant applications. Universities are updating their academic integrity policies. The infrastructure around research is adapting to the reality that AI is now embedded in the scientific process, whether or not any given institution has formally acknowledged it.
As of 2024, most major journals require disclosure of AI tool use in manuscript preparation. The practical question is not whether to disclose, but how to document AI use accurately and completely. Developing a personal documentation practice — noting which tools you used, for what tasks, and how you verified the outputs — is now a professional requirement, not a nicety.
What AI can and cannot do in research contexts
The confusion about AI in research stems largely from conflating very different capabilities under a single umbrella label. "AI" in research contexts encompasses tools with radically different strengths: large language models trained on text corpora, specialized retrieval systems that search academic databases, code interpreters that execute analytical scripts, and multimodal systems that reason about figures and data tables. Understanding which tool to reach for requires understanding what each one actually does.
Large language models like Claude, ChatGPT (GPT-4), and Gemini are extraordinarily capable at tasks involving language: explaining concepts clearly, restructuring arguments, improving prose clarity, identifying logical gaps in reasoning, generating code, and synthesizing ideas. They have broad knowledge of established scientific concepts up to their training cutoff. They are genuinely useful for understanding a field you are entering, getting an explanation of a technique you haven't encountered before, or getting feedback on the clarity of your writing.
What these models cannot reliably do — and where the most dangerous errors live — is accurately report specific empirical facts, recent findings, or specific citations. The mechanism matters here: language models generate text by predicting what words come next given their training. They are not retrieving stored facts from a database; they are producing fluent text that sounds like it should be accurate. When asked for a specific citation, a language model will often produce author names, journal names, and titles that sound plausible but are entirely fabricated. This phenomenon — called hallucination — is not a bug being fixed; it is a fundamental property of how these systems work.
Never cite a source that you have not personally verified exists. Researchers across every discipline have submitted manuscripts, grant applications, and conference papers citing AI-generated references that do not exist. When reviewers or editors check these citations — and they do — the consequences range from manuscript rejection to formal misconduct investigations. No AI tool that generates text, including the most advanced models available today, is a reliable source of specific citations. Verify every reference independently.
The safer workflow: use AI to identify topics and authors to search for, then perform the actual search through Google Scholar, PubMed, Semantic Scholar, or your institution's library database. Only cite what you have read.
The collaboration mindset vs. the delegation mindset
The most important conceptual distinction for researchers using AI is between collaboration and delegation. Delegation means handing a task to AI and accepting its output: asking ChatGPT to write your literature review, asking it to identify the key papers in a field, asking it to analyze your data and tell you the results. Delegation treats AI as an autonomous agent whose outputs can be trusted with minimal verification. For research work, this mindset produces errors at scale.
Collaboration means engaging with AI as a tool that extends your own capacity, but where your judgment, domain expertise, and critical evaluation remain the final authority on every output. You use AI to help you think through a problem — to generate options you might not have considered, to identify weaknesses in an argument, to draft a section that you then revise extensively based on your own expert knowledge. The outputs are starting points, not endpoints. Errors are caught because you, the domain expert, evaluate everything.
The collaboration mindset is more cognitively demanding than delegation. It requires maintaining engagement rather than offloading work. But it is the only mindset compatible with research integrity, because research outputs carry your name and reputation, and the standard of accuracy in research — especially empirical claims and citations — is substantially higher than AI tools can reliably meet without expert human oversight.
Think of AI as a very well-read but unreliable graduate student who has read everything and can discuss it fluently, but who sometimes confidently invents facts when their knowledge has gaps. You would never submit their draft without careful review. You would absolutely use their facility with the literature to explore ideas and get a first draft on the page. That calibration — use the capability, verify the specifics — is the professional workflow.
Calibrated expectations
Calibrated expectations require understanding not just what AI gets wrong, but the pattern of its errors. AI errors in research contexts cluster in predictable ways, and knowing the patterns allows you to apply scrutiny precisely where it is needed.
Specific empirical claims are the highest-risk category. Statements like "a 2022 meta-analysis found effect sizes ranging from 0.4 to 0.7" or "the largest study to date enrolled 47,000 participants" — these are exactly the kinds of statements that LLMs will generate fluently and that may be entirely fabricated. Any specific number, date, sample size, effect size, or percentage in AI-generated text should be treated as unverified until you have located the source yourself.
Recent developments are the second highest-risk category. LLMs have training cutoffs, typically 12-24 months before their public release. Fields that move quickly — machine learning, genomics, clinical trials — may look very different from what a model knows. AI is an unreliable narrator about recent science.
Interdisciplinary connections are generally safer. AI is often quite good at drawing conceptual connections across fields, suggesting analogies, and explaining how a method used in one field might apply to another. These are generative, exploratory suggestions that you will evaluate yourself — so the lower reliability of specific facts matters less.
Writing, structure, and clarity are the safest categories. Using AI to improve the clarity of a paragraph you have already written, to suggest an alternative structure for an argument, or to identify jargon that needs defining for a general audience — these tasks leverage what LLMs are genuinely good at, with minimal risk of propagating errors into your scientific record.
How top researchers are actually using AI today
Across disciplines, a set of high-value AI use patterns has emerged among researchers who are both enthusiastic adopters and rigorous scientists. These workflows share a common structure: AI is used to extend capacity in well-defined tasks where human verification is tractable, rather than to replace judgment in tasks where the stakes of error are high.
The research AI toolkit
The landscape of AI tools for research has differentiated significantly. Different tools are optimized for different tasks, and using the right tool for each task matters considerably. Here is an honest assessment of the major tools as of 2024-2025, organized by primary use case.
Literature and evidence retrieval
Elicit (elicit.org) is purpose-built for research synthesis. Unlike general LLMs, Elicit actually searches the academic literature — primarily sourced from Semantic Scholar's database of over 200 million papers. It extracts structured information from papers: population, methods, outcomes, sample sizes. For systematic review-adjacent work, it is substantially more reliable than asking a general LLM about the literature, because it retrieves actual papers and extracts from their actual text. It has limitations (coverage of non-English literature and very recent papers is imperfect) but it is the most trustworthy AI tool for literature questions.
Consensus (consensus.app) focuses specifically on scientific consensus questions: "Does X intervention work?" or "What does the evidence say about Y?" It searches peer-reviewed literature and attempts to surface the direction of evidence across multiple studies. Best used for well-studied questions where substantial literature exists.
Semantic Scholar (semanticscholar.org) is an AI-powered academic search engine developed by the Allen Institute for AI. Its AI features include citation context analysis (why papers cite each other) and research recommendation. The underlying database is one of the largest and most comprehensive available for free.
Perplexity (perplexity.ai) is a general-purpose AI search assistant that cites its sources inline. It is faster and more transparent than asking ChatGPT about a topic, because you can see exactly where claims come from. Still requires verification but provides a starting point for verification.
General-purpose AI assistants
Claude (Anthropic) is consistently rated highly for long-context tasks — reading and analyzing entire papers, synthesizing multiple documents, careful reasoning about complex arguments. Its extended context window (up to 200,000 tokens in some versions) makes it well-suited for tasks that require holding large amounts of text in context simultaneously.
ChatGPT (OpenAI, GPT-4 and later) is the most widely used and has the broadest ecosystem of plugins and integrations. The Code Interpreter (now Advanced Data Analysis) feature can execute Python code, making it useful for data analysis tasks where you want to see the computation run in real time.
| Tool | Best for | Key limitation |
|---|---|---|
| Elicit | Systematic literature synthesis, extracting data from papers | Coverage limited; recent papers may be missing |
| Consensus | Directional evidence questions on well-studied topics | Struggles with contested or emerging fields |
| Semantic Scholar | Citation network exploration, finding related papers | AI features still developing; search, not synthesis |
| Perplexity | Quick exploratory questions with cited sources | Source quality varies; still requires verification |
| Claude | Long documents, careful reasoning, writing feedback | Training cutoff; cannot browse live academic databases |
| ChatGPT | Code generation, data analysis, broad tasks | Citation hallucination; training cutoff |
Developing an AI research workflow
The researchers who get the most value from AI tools are those who have developed explicit, documented workflows — not ad hoc approaches where they reach for AI when they feel like it. An explicit workflow has three advantages: it allows you to identify where AI is and isn't adding value for your specific work; it creates documentation of AI use that satisfies disclosure requirements; and it prevents the gradual creep toward over-reliance that undermines the critical judgment that makes the workflow work.
A basic research workflow might designate specific AI-appropriate tasks at each stage of a project: orientation (using AI to get oriented in a literature before doing structured searches), drafting (using AI to produce initial drafts of sections for which you already have the evidence), analysis (using AI to write and debug code for analysis you then validate), and review preparation (using AI to simulate reviewer objections before submission). Within each designated task, document which tool you used and how you verified outputs.
The workflow should also designate AI-inappropriate tasks where AI is explicitly not part of the process: identifying which papers to cite (you do this yourself, through literature search), drawing empirical conclusions (you draw these from your own reading of primary sources), and making interpretive claims about the meaning of your findings (this requires your own expert judgment and cannot be outsourced).
The researchers who are most effectively using AI are those who treat it as a tool that extends their reach without replacing their judgment. They use AI heavily in low-stakes, high-volume tasks (drafting, orienting, code generation) and use it minimally or not at all in high-stakes, hard-to-verify tasks (empirical claims, citation selection). They verify everything that will appear in a final document. Their work is faster and the quality of their thinking is arguably sharper — because having a capable interlocutor surfaces questions they might not otherwise have considered. That productive relationship requires maintaining the expert judgment that makes you able to evaluate the AI's outputs critically.