Structuring Research Data for AI Analysis: A Practical Guide
Learn how to organize research trial data effectively for AI-assisted analysis. A comprehensive guide covering database principles, data formats, and documentation strategies.
In the era of AI-assisted data analysis, the way we structure our research data has never been more important. Proper data organization is the foundation that enables powerful AI tools like Claude, ChatGPT, and specialized statistical packages to help you extract meaningful insights from your trials.
This guide will show you how to structure your research data using database principles, even if you're working in Excel. By following these practices, you'll make your data more accessible, reproducible, and ready for AI-assisted analysis.
Outline
The general organization is to promote and explain 3 pilars:
- Structure Data ➡️ Tabular data such as CSV or carefully constructed tabs in Excel (1 per data type)
- Provide Meta Data ➡️ what is contained in each CSV or tab (column names, units, data type, description, etc)
- Markdown Files ➡️ provide markdown files for data, experimental design or data collection, and statistical analysis at minimum
Why Data Structure Matters for AI Analysis
Modern AI tools are incredibly powerful, but they rely on well-structured input data to provide accurate analysis. When your data is properly organized:
- AI can understand your data structure without extensive manual explanation
- Analysis is reproducible and can be verified by others
- Errors are easier to catch because patterns are consistent
- Collaboration is seamless across team members and institutions
- Future analysis is simpler when you revisit the data months or years later
The key insight: Think of Excel as a database, not a notebook.
Core Principle: Database-First Thinking
What Does This Mean?
Instead of treating Excel like a paper notebook where you might add notes, merged cells, colors for meaning, and multiple types of data mixed together, treat each sheet (or CSV file) as a database table with these characteristics:
- One row = one observation
- One column = one variable
- Every cell contains only one type of data
- Column names are descriptive and consistent
- No merged cells, no color-coded data, no formulas in data cells
The Problem with "Notebook-Style" Data
Here's what we often see in research data (and what to avoid):
❌ BAD EXAMPLE - Don't do this:
| Pig Study - Fall 2024 | | | |
| Treatment A | Rep 1 | Rep 2 | Rep 3 |
| Week 1 wt | 25.3 | 26.1 | 24.8 |
| Week 2 wt | 31.2 | 32.5 | 30.9 |
| Treatment B | Rep 1 | Rep 2 | Rep 3 |
| Week 1 wt | 24.9 | 25.5 | 25.2 |
Why is this bad?
- Merged cells (the title)
- Multiple data types mixed together (treatments, weeks, measurements)
- Wide format makes it hard to add new weeks or treatments
- No clear animal IDs
- Can't easily filter or query this data
- AI tools struggle to parse this structure
The Database Approach
Here's the same data structured properly:
✅ GOOD EXAMPLE - Do this:
animal_id | treatment | replicate | week | weight_kg | measurement_date
----------|-----------|-----------|------|-----------|------------------
A001 | A | 1 | 1 | 25.3 | 2024-09-15
A002 | A | 2 | 1 | 26.1 | 2024-09-15
A003 | A | 3 | 1 | 24.8 | 2024-09-15
A001 | A | 1 | 2 | 31.2 | 2024-09-22
A002 | A | 2 | 2 | 32.5 | 2024-09-22
A003 | A | 3 | 2 | 30.9 | 2024-09-22
B001 | B | 1 | 1 | 24.9 | 2024-09-15
B002 | B | 2 | 1 | 25.5 | 2024-09-15
B003 | B | 3 | 1 | 25.2 | 2024-09-15
Why is this better?
- Each row is one measurement
- Each column is one variable
- Easy to add new animals, weeks, or measurements
- Can filter, sort, and query easily
- AI tools can immediately understand the structure
- Ready for statistical analysis
Long Format vs. Wide Format
Understanding when to use each format is crucial for effective data management.
Wide Format
When to use: When you have a manageable number of repeated measurements and want a compact view.
Example:
animal_id | treatment | weight_week1 | weight_week2 | weight_week3
----------|-----------|--------------|--------------|-------------
A001 | A | 25.3 | 31.2 | 36.8
A002 | A | 26.1 | 32.5 | 38.2
Advantages:
- Compact and easy to read
- Each animal is one row
- Good for data entry
Disadvantages:
- Hard to add new time points (must add new columns)
- Difficult to analyze with most statistical software
- Can't easily handle missing data
Long Format (Recommended for Analysis)
When to use: For data analysis, statistical modeling, and AI-assisted analysis.
Example:
animal_id | treatment | week | weight_kg
----------|-----------|------|----------
A001 | A | 1 | 25.3
A001 | A | 2 | 31.2
A001 | A | 3 | 36.8
A002 | A | 1 | 26.1
A002 | A | 2 | 32.5
A002 | A | 3 | 38.2
Advantages:
- Flexible - easy to add new time points
- Works naturally with statistical software (R, Python, SAS)
- Handles missing data gracefully
- Ready for mixed models and repeated measures analysis
- AI tools can process this format efficiently
My Recommendation: Store your data in long format. You can always create wide format views when needed for presentations or reports.
The Three-Table Approach
For most research trials, organize your data into three primary tables:
1. Animal/Subject Metadata Table
This table contains one row per animal/subject with all the identifying information and fixed characteristics.
Example: animal_data.csv
animal_id | breed | sex | birth_date | sire_id | dam_id | pen | treatment | initial_weight_kg
----------|-------|------|------------|---------|--------|-----|-----------|------------------
A001 | Duroc | M | 2024-07-15 | S123 | D456 | 1 | A | 20.5
A002 | Duroc | F | 2024-07-16 | S124 | D457 | 1 | A | 19.8
A003 | York | M | 2024-07-15 | S125 | D458 | 2 | B | 20.1
What to include:
- Unique identifier (animal_id, subject_id, plot_id)
- Fixed characteristics (sex, breed, genotype, variety)
- Treatment assignment
- Pen/block/replicate assignment
- Genealogy (if applicable)
- Any baseline measurements
2. Measurement Data Table(s)
This table contains time-series measurements with one row per measurement event.
Example: weight_data.csv
animal_id | measurement_date | age_days | weight_kg | body_condition_score | technician
----------|------------------|----------|-----------|---------------------|------------
A001 | 2024-09-15 | 62 | 25.3 | 3 | JD
A001 | 2024-09-22 | 69 | 31.2 | 3 | JD
A001 | 2024-09-29 | 76 | 36.8 | 4 | SM
A002 | 2024-09-15 | 61 | 26.1 | 3 | JD
A002 | 2024-09-22 | 68 | 32.5 | 3 | JD
What to include:
- Animal/subject ID (links to metadata table)
- Measurement date/time
- All measurements taken at that time
- Who collected the data
- Any relevant conditions or notes
Pro tip: If you have very different types of measurements (e.g., weights vs. blood samples vs. behavioral observations), consider separate tables for each measurement type.
3. Pen/Group Level Data Table
If you collect data at the group level (like feed intake per pen), use a separate table.
**Example: pen_feed_data.csv
pen_id | measurement_date | feed_delivered_kg | feed_remaining_kg | water_consumption_L
-------|------------------|-------------------|-------------------|---------------------
1 | 2024-09-15 | 50.0 | 5.2 | 45.3
1 | 2024-09-16 | 50.0 | 4.8 | 47.1
2 | 2024-09-15 | 50.0 | 6.1 | 43.8
2 | 2024-09-16 | 50.0 | 5.5 | 44.2
What to include:
- Pen/group ID
- Date of measurement
- All group-level measurements
- Environmental data (temperature, humidity, etc.)
Column Naming Best Practices
Good column names make your data self-documenting and easier to work with.
Rules for Column Names
- Use lowercase with underscores:
animal_idnotAnimalIDorAnimal ID - Be descriptive but concise:
weight_kgnotwtoranimal_weight_in_kilograms - Include units when applicable:
weight_kg,height_cm,temperature_celsius - Use consistent prefixes for related variables:
feed_delivered_kg,feed_remaining_kg,feed_wasted_kg - Avoid special characters: No spaces, periods, or special characters except underscore
- Start with a letter: Not a number or underscore
- Avoid reserved words: Don't use
date,time,group,if,for, etc. (programming keywords)
Examples
❌ Bad column names:
- Wt (not descriptive)
- Animal ID (has space)
- Weight(kg) (has parentheses)
- 1st_measurement (starts with number)
- data (reserved word, not specific)
✅ Good column names:
- weight_kg
- animal_id
- measurement_date
- initial_weight_kg
- feed_conversion_ratio
One Data Type Per Column
Each column should contain only ONE type of data in a consistent format.
Common Mistakes
❌ Mixing Text and Numbers
| weight |
|--------|
| 25.3 |
| 26.1 |
| sick | # Should be in a separate 'notes' column
| 24.8 |
❌ Mixing Units
| weight |
|--------|
| 25.3kg |
| 57lb | # Convert to one unit
| 26.1kg |
❌ Multiple Values in One Cell
| treatment |
|-----------|
| A, B | # Should be separate rows or columns
✅ Correct Approach
Use separate columns for different data types:
animal_id | weight_kg | status | notes
----------|-----------|--------|------------------
A001 | 25.3 | normal |
A002 | 26.1 | normal |
A003 | | sick | not weighed due to illness
A004 | 24.8 | normal |
The Three-Document System
To make your data truly ready for AI analysis, supplement your data files with three markdown documentation files:
1. DATA.md - Data Dictionary
This file describes every table and every column in detail.
Example structure:
# Data Dictionary
## animal_data.csv
Description: One row per animal with baseline characteristics and treatment assignments.
| Column Name | Data Type | Description | Values/Units |
|-------------------|-----------|------------------------------------------------|---------------------------|
| animal_id | string | Unique identifier for each animal | Format: A001, B002, etc. |
| breed | string | Breed of animal | Duroc, Yorkshire, Landrace|
| sex | string | Sex of animal | M (male), F (female) |
| birth_date | date | Date of birth | YYYY-MM-DD format |
| sire_id | string | Unique identifier for sire | Format: S001, S002, etc. |
| dam_id | string | Unique identifier for dam | Format: D001, D002, etc. |
| pen | integer | Pen number animal was housed in | 1-20 |
| treatment | string | Treatment group assignment | A, B, C, Control |
| initial_weight_kg | numeric | Weight at start of trial (kilograms) | Measured at day 0 |
## weight_data.csv
Description: Repeated weight measurements for each animal over time.
| Column Name | Data Type | Description | Values/Units |
|------------------------|-----------|--------------------------------------|----------------------|
| animal_id | string | Links to animal_data.csv | Format: A001, etc. |
| measurement_date | date | Date weight was measured | YYYY-MM-DD format |
| age_days | integer | Age of animal in days | Calculated from birth|
| weight_kg | numeric | Body weight in kilograms | Measured to 0.1 kg |
| body_condition_score | integer | Body condition on 1-5 scale | 1=thin, 5=fat |
| technician | string | Initials of person collecting data | JD, SM, etc. |
## pen_feed_data.csv
Description: Daily feed delivery and consumption by pen.
| Column Name | Data Type | Description | Values/Units |
|-----------------------|-----------|------------------------------------------|----------------|
| pen_id | integer | Pen number | 1-20 |
| measurement_date | date | Date of measurement | YYYY-MM-DD |
| feed_delivered_kg | numeric | Amount of feed added to feeder (kg) | Daily total |
| feed_remaining_kg | numeric | Amount of feed left at measurement (kg) | Manual measure |
| water_consumption_L | numeric | Water consumed by pen (liters) | From meter |2. EXPERIMENTAL_DESIGN.md - Study Design Documentation
This file explains how the trial was designed and how data was collected.
Example structure:
# Experimental Design
## Study Title
Effects of Dietary Lysine Levels on Growth Performance in Finishing Pigs
## Objectives
1. Evaluate growth performance (ADG, ADFI, G:F) across four dietary lysine levels
2. Determine optimal lysine level for finishing pigs (100-130 kg BW)
3. Assess economic returns for each treatment
## Experimental Design
**Design Type:** Randomized Complete Block Design (RCBD)
**Blocking Factor:** Initial body weight
**Treatments:**
- Treatment A: 0.70% SID lysine (control)
- Treatment B: 0.85% SID lysine
- Treatment C: 1.00% SID lysine
- Treatment D: 1.15% SID lysine
**Experimental Units:** Individual pigs
**Replicates:** 8 pigs per treatment (32 pigs total)
**Housing:** 4 pigs per pen, 8 pens total
## Animals
- Species: Pigs (Sus scrofa domesticus)
- Genetics: Duroc x (Large White x Landrace)
- Initial weight: 100 ± 5 kg
- Initial age: ~150 days
- Source: University research farm
## Timeline
- **Day -7 to -1:** Acclimation period, all pigs on common diet
- **Day 0:** Initial weights, treatment assignment, start experimental diets
- **Day 0-56:** Treatment period
- **Day 56:** Final weights, trial end
## Measurements
### Weekly (Every 7 days)
- Individual body weight (kg)
- Body condition score (1-5 scale)
### Daily (by pen)
- Feed delivered (kg)
- Feed remaining (kg) - measured at same time each day
- Water consumption (L) - from automatic meter
### End of Trial
- Backfat depth (mm) - ultrasound
- Loin depth (mm) - ultrasound
## Data Collection Protocols
### Body Weights
- Collected between 7:00-9:00 AM
- Pigs fasted for 12 hours before weighing
- Scale calibrated daily with 50kg standard weight
- Two technicians present, one reading scale, one recording
### Feed Measurements
- Feed delivered at 6:00 AM daily
- Remaining feed measured at 5:00 PM (23 hours later)
- Feeders cleaned weekly (Sunday evening)
## Randomization
Animals were blocked by initial weight into 8 blocks. Within each block, 4 animals were randomly assigned to treatments A, B, C, D using a random number generator in R.
Random seed: 12345
## Power Analysis
Expected to detect a 0.10 kg/day difference in ADG with 80% power at α=0.05.
Calculation based on σ=0.15 kg/day from previous trials.3. STATS.md - Analysis Plan
This file explains what analysis should be performed and what models to use.
Example structure:
# Statistical Analysis Plan
## Primary Analysis
### Response Variables (Y variables)
1. **Average Daily Gain (ADG)**
- Calculate as: (final_weight_kg - initial_weight_kg) / days_on_trial
- Units: kg/day
2. **Average Daily Feed Intake (ADFI)**
- Calculate per pen, then average for individual pigs in pen
- Units: kg/day
3. **Feed Conversion Ratio (FCR)**
- Calculate as: ADFI / ADG
- Units: kg feed / kg gain
### Primary Statistical Model
**Model type:** Analysis of Variance (ANOVA) / Linear Mixed Model
**Fixed Effects:**
- Treatment (4 levels: A, B, C, D)
**Random Effects:**
- Block (8 levels)
- Pen (nested within treatment)
**Model equation:**Y_ijk = μ + Treatment_i + Block_j + Pen_ij + ε_ijk
where: Y_ijk = response for pig k in pen ij μ = overall mean Treatment_i = fixed effect of treatment i Block_j = random effect of block j Pen_ij = random effect of pen in treatment i and block j ε_ijk = residual error
**Assumptions to check:**
- Normality of residuals (QQ plot, Shapiro-Wilk test)
- Homogeneity of variance (Levene's test, residual plots)
- Independence of observations
### Pairwise Comparisons
If overall treatment effect is significant (p < 0.05):
- Compare all treatments using Tukey's HSD
- Report adjusted p-values
### Regression Analysis
Fit linear and quadratic regression models with lysine level (continuous) as predictor:
Y = β0 + β1lysine_level + ε (linear) Y = β0 + β1lysine_level + β2*lysine_level^2 + ε (quadratic)
Compare models using AIC to determine if quadratic term is needed.
## Secondary Analysis
### Growth Curves
Fit mixed model with repeated measures:
weight_kg = β0 + β1days + β2treatment + β3daystreatment + (1|animal_id) + (days|animal_id) + ε
This allows animal-specific intercepts and slopes.
### Feed Efficiency Over Time
Calculate weekly feed efficiency and test if it changes over time or differs by treatment.
## Software
Primary analysis in R using:
- `lme4` package for mixed models
- `emmeans` package for pairwise comparisons
- `ggplot2` for graphics
## Significance Level
α = 0.05 for all tests
## Missing Data
- If < 5% of data missing: complete case analysis
- If 5-20% missing: multiple imputation with 5 imputations
- If > 20% missing: investigate cause and consider sensitivity analysis
## Reporting
Report for each response variable:
- Least squares means ± SE for each treatment
- P-value for overall treatment effect
- Pairwise comparison results (if significant)
- Effect size (Cohen's d for pairwise comparisons)
- R² or pseudo-R² for variance explained
Include:
- ANOVA table
- Residual diagnostic plots
- Treatment means plot with error bars
Practical Tips for Implementation
Starting a New Trial
-
Before data collection starts:
- Create your DATA.md file with all planned tables and columns
- Write your EXPERIMENTAL_DESIGN.md documenting the study design
- Create your STATS.md file with the analysis plan
- Set up your Excel workbook or CSV files with proper column headers
-
During data collection:
- Enter data consistently - use the same format every time
- Use data validation in Excel to prevent errors (dropdown lists, number ranges)
- Keep a separate "issues log" for unusual events or deviations
- Back up your data regularly (ideally daily)
-
Quality checks:
- Regularly check for missing values
- Look for outliers and investigate them
- Verify that all dates are in correct format
- Ensure all animal IDs match between tables
Working with AI Tools
When you provide your data to AI assistants (like Claude):
-
Share your documentation first:
- Upload DATA.md, EXPERIMENTAL_DESIGN.md, and STATS.md
- This gives context for the analysis
-
Then share your data files:
- The AI can now understand the structure
- It knows what analysis you want
- It can spot issues or inconsistencies
-
Ask specific questions:
- "Run the primary analysis as specified in STATS.md"
- "Check my data for outliers and missing values"
- "Create visualization of growth curves by treatment"
Excel-Specific Tips
If you're using Excel:
-
Use separate sheets for each table
- Sheet 1: animal_data
- Sheet 2: weight_data
- Sheet 3: pen_feed_data
-
Freeze the header row (View → Freeze Top Row)
-
Use Tables (Ctrl+T or Cmd+T)
- Converts your data to a formal Table object
- Automatically extends formulas
- Makes filtering easier
-
Data validation
- Data → Data Validation
- Set up dropdown lists for categorical variables
- Set number ranges for numeric variables
-
Name your sheets clearly
- Match your CSV/table names
- No spaces in names
-
Avoid formulas in data cells
- Calculate derived variables in R/Python/SAS, not Excel
- If you must use formulas, put them in separate columns clearly labeled as "calculated"
Common Pitfalls to Avoid
1. Color-Coded Data
Don't do this:
- Red text for outliers
- Yellow cells for estimated values
- Green for good data
Instead:
- Add a column called
data_qualitywith values: "measured", "estimated", "outlier" - Add a column called
notesfor explanations
2. Multiple Header Rows
Don't do this:
| Study 2024 | |
| Pig Growth Trial | |
| animal_id | weight_kg |
| A001 | 25.3 |
Instead: Put all metadata in a separate sheet or the documentation files.
3. Merged Cells for Repeated Values
Don't do this:
| animal_id | week | weight |
|-----------|------|--------|
| A001 | 1 | 25.3 |
| | 2 | 31.2 | # Merged cells for A001
| | 3 | 36.8 |
Instead: Repeat the value in every row:
| animal_id | week | weight |
|-----------|------|--------|
| A001 | 1 | 25.3 |
| A001 | 2 | 31.2 |
| A001 | 3 | 36.8 |
4. Units in Data Cells
Don't do this:
| weight |
|---------|
| 25.3 kg |
| 26.1 kg |
Instead: Put units in the column name:
| weight_kg |
|-----------|
| 25.3 |
| 26.1 |
5. Missing Data Indicators
Don't do this: Use "NA", "N/A", "-", "missing", blank cells inconsistently
Instead: Be consistent:
- Leave cells truly blank for missing data
- Or use NA consistently
- Document what missing data means in DATA.md
Real-World Example: Complete Pig Growth Trial
Let me show you what a complete, properly structured dataset looks like for a pig growth trial.
animal_data.csv
animal_id,breed,sex,birth_date,sire_id,dam_id,pen,treatment,block,initial_weight_kg
A001,Duroc,M,2024-07-15,S123,D456,1,A,1,100.2
A002,Yorkshire,F,2024-07-16,S124,D457,1,B,1,101.5
A003,Duroc,M,2024-07-15,S125,D458,2,C,1,99.8
A004,Yorkshire,F,2024-07-17,S126,D459,2,D,1,100.9weight_data.csv
animal_id,measurement_date,age_days,weight_kg,body_condition_score,technician,notes
A001,2024-09-15,62,100.2,3,JD,initial weight
A001,2024-09-22,69,107.5,3,JD,
A001,2024-09-29,76,115.2,3,SM,
A001,2024-10-06,83,122.8,4,JD,
A002,2024-09-15,61,101.5,3,JD,initial weight
A002,2024-09-22,68,109.2,3,JD,
A002,2024-09-29,75,117.8,3,SM,
A002,2024-10-06,82,126.1,4,JD,pen_feed_data.csv
pen_id,date,feed_delivered_kg,feed_remaining_kg,feed_consumed_kg,water_consumption_L,temperature_c
1,2024-09-15,50.0,5.2,44.8,45.3,22.1
1,2024-09-16,50.0,4.8,45.2,47.1,22.5
1,2024-09-17,50.0,6.1,43.9,46.8,21.8
2,2024-09-15,50.0,6.1,43.9,43.8,22.1
2,2024-09-16,50.0,5.5,44.5,44.2,22.5
2,2024-09-17,50.0,7.2,42.8,45.1,21.8With this structure:
- Each file serves a clear purpose
- All tables can be linked by animal_id or pen_id
- Data is ready for statistical analysis
- AI tools can immediately understand and work with it
Conclusion
Structuring your research data properly is an investment that pays dividends throughout your project:
- During data collection: Reduces errors and makes data entry consistent
- During analysis: Makes statistical analysis straightforward and reproducible
- During review: Allows collaborators and reviewers to understand your data
- During publication: Makes data sharing and archiving simple
- For future use: Enables re-analysis with new methods or questions
Key Takeaways
- Think database, not notebook - Structure data for machines to read
- One observation per row - Use long format for analysis
- One data type per column - Be consistent
- Use the three-table approach - Separate metadata, measurements, and group data
- Document everything - Create DATA.md, EXPERIMENTAL_DESIGN.md, and STATS.md
- Follow naming conventions - lowercase_with_underscores
- Be consistent - Use the same structure across all your trials
By following these principles, you'll make your data ready for modern AI-assisted analysis while also making it more accessible to your colleagues, reviewers, and your future self.
Additional Resources
Tools for Data Management
- R packages:
dplyr,tidyrfor data manipulation - Python libraries:
pandasfor data handling - Excel features: Tables, Data Validation, Power Query
- Database software: Access, SQLite for larger datasets
Further Reading
- Tidy Data by Hadley Wickham (foundational paper on data structure)
- R for Data Science - free online book
- FAIR Data Principles - Findable, Accessible, Interoperable, Reusable
Questions or Feedback?
If you have questions about structuring your research data or want to discuss specific challenges with your trials, feel free to reach out on LinkedIn.
Last updated: November 20, 2024