Data Analysis and Visualization
AI has emerged as a capable coding partner for researchers who need to analyse data but are not professional software developers. The ability to generate, explain, and debug Python and R code has made statistical computing more accessible than at any previous point in the history of science. What has not changed — and what AI cannot substitute for — is the statistical judgment required to choose the right analysis, interpret its outputs correctly, and recognise when results are spurious. This module teaches you to leverage AI's genuine capability while keeping your expert judgment in control.
AI as a coding partner for data analysis
The dominant paradigm that has emerged among researchers using AI for data work is the coding partner model: you provide the scientific question, the data structure, and the analytical intent; AI translates that intent into executable code that you then review, test, and modify. This is importantly different from asking AI to "analyse my data and tell me the results" — the latter delegates the analytical judgment that is irreplaceable, while the former preserves it.
Both Claude and ChatGPT (with Advanced Data Analysis / Code Interpreter enabled) are effective partners for Python and R code generation. Claude is often preferred for long, complex scripts where maintaining a coherent structure across many functions matters; ChatGPT's Code Interpreter is particularly useful when you want to execute code in real time and see the output directly in the chat interface. For most routine analytical tasks, the choice between them is less important than learning to prompt effectively.
The key to effective code generation prompts is specificity about your data structure. Before asking for analysis code, describe your dataframe: the variables it contains, their types (continuous, categorical, ordinal), the unit of observation, any relevant groupings, and sample size. An AI that knows it is working with a wide-format dataframe of 847 participants across 12 repeated-measures time points in a 2x3 factorial design will generate substantially more useful code than one that receives a vague description of "my dataset."
Prompt patterns that work
Effective prompting for data analysis code follows a consistent pattern. Start by describing the data, then the question, then any specific requirements for the output:
"I have a pandas dataframe called df with columns: participant_id (string), condition (categorical: 'control', 'treatment_a', 'treatment_b'), timepoint (integer 1–4), score (continuous float). I want to run a mixed ANOVA with condition as a between-subjects factor and timepoint as a within-subjects factor. Please write Python code using pingouin that will run this analysis and print the results table, then test the assumption of sphericity and apply the Greenhouse-Geisser correction if it's violated."
This level of specificity produces code that is immediately useful rather than generic, and it surfaces assumptions in the AI's response that you can then verify. Contrast this with "how do I run a mixed ANOVA in Python?" — which produces a generic tutorial that may not match your data structure at all.
Generating visualization code
Data visualization is one of the highest-value applications of AI in research workflows, because the cognitive overhead of matplotlib customization in particular is disproportionate to the intellectual content of the task. Writing the code to produce a correctly formatted, publication-ready figure in matplotlib is tedious and error-prone; it is exactly the kind of task that benefits from AI assistance where the output is immediately verifiable.
Matplotlib and seaborn for Python users
Matplotlib is the foundational visualization library and is required for any fine-grained control over figure appearance. The challenge is its verbose, low-level API — producing a properly formatted figure with custom tick labels, appropriate font sizes, and a clean legend requires many lines of boilerplate code that AI generates well. Ask AI to produce a figure with your exact specifications (axes labels, color scheme, font family, figure size in inches at a specific DPI) and you will typically get code that produces the figure you want with minor tweaks rather than starting from scratch.
Seaborn builds on matplotlib with higher-level functions for statistical visualization. Its default aesthetics are considerably better than matplotlib's defaults for most research purposes, and its API for grouped plots (stripplots, boxplots with overlaid points, violin plots) is substantially simpler. AI prompts for seaborn visualization tend to produce more immediately usable results with fewer iterations, particularly for visualizing distributions and group comparisons. A useful prompt structure: "Using seaborn, create a figure with two panels side by side. Left panel: a violin plot of [variable] by [group], with individual data points overlaid using stripplot. Right panel: a heatmap of the correlation matrix of [variables]. Use the 'ticks' style and a colorblind-friendly palette."
ggplot2 for R users
R users working with ggplot2 have a particularly good experience with AI code generation because ggplot2's grammar-of-graphics structure makes prompts map naturally to code layers. Describe your desired visualization using ggplot2 layer language — geom_point, geom_smooth, facet_wrap, scale_color_brewer — and AI will generate code that is easy to read and modify. A useful pattern: ask AI to generate the basic plot, then ask it to add layers progressively ("now add a facet by condition," "add a custom theme that matches journal specifications for the Journal of Experimental Psychology").
EDA is a natural fit for AI assistance. A complete EDA prompt: "Given a dataframe df with the columns listed above, write Python code that: (1) prints the shape, dtypes, and missing value counts; (2) plots a histogram for each continuous variable; (3) plots a correlation heatmap of all numeric variables; (4) for each categorical variable, plots a bar chart of value counts; (5) plots boxplots of each continuous variable broken down by [key grouping variable]. Arrange all figures in a grid with seaborn and save as a single PDF." This produces a complete initial exploration in one code block.
Using AI to explain statistical outputs
One of the genuinely undervalued uses of AI in data analysis is explaining statistical output in plain language. Researchers whose primary expertise is not statistics — biologists, clinical researchers, psychologists running their first multilevel models — sometimes produce correct output from correctly specified models but are uncertain about how to interpret the specific numbers in the results table. AI can bridge this gap, but with an important caveat: the explanation is only as good as the output you provide, and AI can generate confident but wrong interpretations if the question is ambiguous or the output is unusual.
The most reliable use of AI for interpretation is asking it to translate standard output — a regression table, an ANOVA table, a model summary from lme4 — into plain language that describes the findings without technical jargon. Paste the output into your conversation with a description of what the model was testing, and ask: "Explain what this output tells me about the relationship between [predictor] and [outcome], in language suitable for a results section aimed at a general academic audience." The result will typically be a reasonable draft of plain-language interpretation that you then verify against your own understanding of the analysis.
This is not a theoretical risk — it is a documented, recurring problem. AI systems make specific categories of statistical errors: choosing the wrong test for data distributions, ignoring assumption violations, misinterpreting interaction effects, confusing statistical significance with practical significance, and generating incorrect code for edge cases. A model specification that looks syntactically correct may be statistically wrong for your data structure. Never submit AI-generated analyses without manually checking that the code actually does what you intended, that assumptions are met, and that the results are plausible given your understanding of the data.
The most dangerous scenario: AI generates analysis code, you run it, the output looks reasonable, and you report results that are in fact the product of a miscoded model. This has happened in published research and has required corrections and retractions. Validation is not optional.
Data cleaning prompts
Data cleaning is tedious, repetitive, and highly amenable to AI code generation. Common tasks — recoding variables, handling missing values, reshaping from wide to long format, merging dataframes, standardizing string columns — all follow predictable patterns that AI generates reliably. The key is describing your data structure precisely and specifying exactly what transformation you need.
Effective data cleaning prompts describe the before and after states: "I have a column 'age' that contains values like '25 years', '30', 'twenty-eight', and NaN. I want to convert it to an integer column with NaN for any values that cannot be cleanly converted to a number. Write Python code using pandas to do this and print a count of the values before and after."
AI is particularly useful for handling the idiosyncratic problems that appear in real research data: inconsistent date formats across entries from different data collectors, categorical variables with slight spelling variations, merge keys that match on most but not all records. These problems have no general solution that works across all datasets — they require case-by-case code — and AI is well-suited to generating that code from a precise description of the problem.
When to trust AI code vs. verify manually
Not all statistical code carries equal risk of error, and calibrating your verification effort appropriately is a professional skill. A heuristic framework:
No-code AI analysis tools
Not all researchers want to write code, and a growing ecosystem of no-code AI analysis tools has emerged for quantitative research. Two platforms worth knowing:
Julius AI (julius.ai) allows you to upload a dataset and ask analytical questions in natural language: "What are the correlations between all numeric variables?", "Show me a regression predicting Y from X1, X2, and X3", "Are there significant group differences between conditions A and B on the primary outcome?" It generates both the analysis and a visualization, and shows the code it used, which you can inspect. It is not a substitute for statistical expertise, but for researchers doing routine analyses who want to move quickly, it reduces the gap between question and output substantially.
DataChat (datachat.ai) offers a similar natural language interface for data analysis with a stronger focus on collaborative team workflows and auditability of the analytical path. Its design philosophy emphasizes showing users exactly what operations were performed on the data, which aligns well with reproducibility requirements.
Reproducibility requirements when using AI
Reproducibility is a foundational norm of science, and AI-assisted analysis introduces new reproducibility challenges that the field is still working through. The central issue is that AI-generated code may rely on specific library versions, random seeds, or default behaviors that change across versions — meaning that rerunning the same AI-generated code in a different environment may produce different results.
The professional standard that is emerging: treat AI-generated analysis code the same way you would treat any analysis code — document it, version-control it, specify your computing environment (language version, package versions), and include it in your supplementary materials when you publish. The fact that AI generated the initial version of the code does not change your obligation to understand it, document it, and make it available for inspection.
Additionally, disclose AI tool use in your methods section. A brief statement: "Initial data cleaning and visualization scripts were generated using [tool name] and subsequently reviewed and validated by the authors" is accurate, informative, and increasingly expected. The scientific community's trust in your analysis depends on transparency about your process, not just your results.
The most defensible workflow for AI-assisted data analysis: (1) Describe your data structure and analytical question to AI in precise terms. (2) Review the generated code before running it — does the model specification match your intended design? Does the code test assumptions? (3) Run the code on a subset of your data first and verify that the output looks plausible. (4) Check assumption tests embedded in the output — are any violated? Does the code address violations? (5) Cross-validate critical results using an independent implementation in a different library or language. (6) Document the code, the AI tool used, and your validation steps. (7) Make the code available with your data upon publication. This workflow takes longer than accepting AI output uncritically but produces results you can defend.