Structuring Research Data for AI Analysis: A Practical Guide

In the era of AI-assisted data analysis, the way we structure our research data has never been more important. Proper data organization is the foundation that enables powerful AI tools like Claude, ChatGPT, and specialized statistical packages to help you extract meaningful insights from your trials.

This guide will show you how to structure your research data using database principles, even if you're working in Excel. By following these practices, you'll make your data more accessible, reproducible, and ready for AI-assisted analysis.

Outline

The general organization is to promote and explain 3 pilars:

Structure Data ➡️ Tabular data such as CSV or carefully constructed tabs in Excel (1 per data type)
Provide Meta Data ➡️ what is contained in each CSV or tab (column names, units, data type, description, etc)
Markdown Files ➡️ provide markdown files for data, experimental design or data collection, and statistical analysis at minimum

Why Data Structure Matters for AI Analysis

Modern AI tools are incredibly powerful, but they rely on well-structured input data to provide accurate analysis. When your data is properly organized:

AI can understand your data structure without extensive manual explanation
Analysis is reproducible and can be verified by others
Errors are easier to catch because patterns are consistent
Collaboration is seamless across team members and institutions
Future analysis is simpler when you revisit the data months or years later

The key insight: Think of Excel as a database, not a notebook.

Core Principle: Database-First Thinking

What Does This Mean?

Instead of treating Excel like a paper notebook where you might add notes, merged cells, colors for meaning, and multiple types of data mixed together, treat each sheet (or CSV file) as a database table with these characteristics:

One row = one observation
One column = one variable
Every cell contains only one type of data
Column names are descriptive and consistent
No merged cells, no color-coded data, no formulas in data cells

The Problem with "Notebook-Style" Data

Here's what we often see in research data (and what to avoid):

❌ BAD EXAMPLE - Don't do this:

| Pig Study - Fall 2024 |          |          |          |
| Treatment A           | Rep 1    | Rep 2    | Rep 3    |
| Week 1 wt            | 25.3     | 26.1     | 24.8     |
| Week 2 wt            | 31.2     | 32.5     | 30.9     |
| Treatment B           | Rep 1    | Rep 2    | Rep 3    |
| Week 1 wt            | 24.9     | 25.5     | 25.2     |

Why is this bad?

Merged cells (the title)
Multiple data types mixed together (treatments, weeks, measurements)
Wide format makes it hard to add new weeks or treatments
No clear animal IDs
Can't easily filter or query this data
AI tools struggle to parse this structure

The Database Approach

Here's the same data structured properly:

✅ GOOD EXAMPLE - Do this:

animal_id | treatment | replicate | week | weight_kg | measurement_date
----------|-----------|-----------|------|-----------|------------------
A001      | A         | 1         | 1    | 25.3      | 2024-09-15
A002      | A         | 2         | 1    | 26.1      | 2024-09-15
A003      | A         | 3         | 1    | 24.8      | 2024-09-15
A001      | A         | 1         | 2    | 31.2      | 2024-09-22
A002      | A         | 2         | 2    | 32.5      | 2024-09-22
A003      | A         | 3         | 2    | 30.9      | 2024-09-22
B001      | B         | 1         | 1    | 24.9      | 2024-09-15
B002      | B         | 2         | 1    | 25.5      | 2024-09-15
B003      | B         | 3         | 1    | 25.2      | 2024-09-15

Why is this better?

Each row is one measurement
Each column is one variable
Easy to add new animals, weeks, or measurements
Can filter, sort, and query easily
AI tools can immediately understand the structure
Ready for statistical analysis

Long Format vs. Wide Format

Understanding when to use each format is crucial for effective data management.

Wide Format

When to use: When you have a manageable number of repeated measurements and want a compact view.

Example:

animal_id | treatment | weight_week1 | weight_week2 | weight_week3
----------|-----------|--------------|--------------|-------------
A001      | A         | 25.3         | 31.2         | 36.8
A002      | A         | 26.1         | 32.5         | 38.2

Advantages:

Compact and easy to read
Each animal is one row
Good for data entry

Disadvantages:

Hard to add new time points (must add new columns)
Difficult to analyze with most statistical software
Can't easily handle missing data

Long Format (Recommended for Analysis)

When to use: For data analysis, statistical modeling, and AI-assisted analysis.

Example:

animal_id | treatment | week | weight_kg
----------|-----------|------|----------
A001      | A         | 1    | 25.3
A001      | A         | 2    | 31.2
A001      | A         | 3    | 36.8
A002      | A         | 1    | 26.1
A002      | A         | 2    | 32.5
A002      | A         | 3    | 38.2

Advantages:

Flexible - easy to add new time points
Works naturally with statistical software (R, Python, SAS)
Handles missing data gracefully
Ready for mixed models and repeated measures analysis
AI tools can process this format efficiently

My Recommendation: Store your data in long format. You can always create wide format views when needed for presentations or reports.

The Three-Table Approach

For most research trials, organize your data into three primary tables:

1. Animal/Subject Metadata Table

This table contains one row per animal/subject with all the identifying information and fixed characteristics.

Example: animal_data.csv

animal_id | breed | sex  | birth_date | sire_id | dam_id | pen | treatment | initial_weight_kg
----------|-------|------|------------|---------|--------|-----|-----------|------------------
A001      | Duroc | M    | 2024-07-15 | S123    | D456   | 1   | A         | 20.5
A002      | Duroc | F    | 2024-07-16 | S124    | D457   | 1   | A         | 19.8
A003      | York  | M    | 2024-07-15 | S125    | D458   | 2   | B         | 20.1

What to include:

Unique identifier (animal_id, subject_id, plot_id)
Fixed characteristics (sex, breed, genotype, variety)
Treatment assignment
Pen/block/replicate assignment
Genealogy (if applicable)
Any baseline measurements

2. Measurement Data Table(s)

This table contains time-series measurements with one row per measurement event.

Example: weight_data.csv

animal_id | measurement_date | age_days | weight_kg | body_condition_score | technician
----------|------------------|----------|-----------|---------------------|------------
A001      | 2024-09-15       | 62       | 25.3      | 3                   | JD
A001      | 2024-09-22       | 69       | 31.2      | 3                   | JD
A001      | 2024-09-29       | 76       | 36.8      | 4                   | SM
A002      | 2024-09-15       | 61       | 26.1      | 3                   | JD
A002      | 2024-09-22       | 68       | 32.5      | 3                   | JD

What to include:

Animal/subject ID (links to metadata table)
Measurement date/time
All measurements taken at that time
Who collected the data
Any relevant conditions or notes

Pro tip: If you have very different types of measurements (e.g., weights vs. blood samples vs. behavioral observations), consider separate tables for each measurement type.

3. Pen/Group Level Data Table

If you collect data at the group level (like feed intake per pen), use a separate table.

**Example: pen_feed_data.csv

pen_id | measurement_date | feed_delivered_kg | feed_remaining_kg | water_consumption_L
-------|------------------|-------------------|-------------------|---------------------
1      | 2024-09-15       | 50.0              | 5.2               | 45.3
1      | 2024-09-16       | 50.0              | 4.8               | 47.1
2      | 2024-09-15       | 50.0              | 6.1               | 43.8
2      | 2024-09-16       | 50.0              | 5.5               | 44.2

What to include:

Pen/group ID
Date of measurement
All group-level measurements
Environmental data (temperature, humidity, etc.)

Column Naming Best Practices

Good column names make your data self-documenting and easier to work with.

Rules for Column Names

Use lowercase with underscores: animal_id not AnimalID or Animal ID
Be descriptive but concise: weight_kg not wt or animal_weight_in_kilograms
Include units when applicable: weight_kg, height_cm, temperature_celsius
Use consistent prefixes for related variables: feed_delivered_kg, feed_remaining_kg, feed_wasted_kg
Avoid special characters: No spaces, periods, or special characters except underscore
Start with a letter: Not a number or underscore
Avoid reserved words: Don't use date, time, group, if, for, etc. (programming keywords)

Examples

❌ Bad column names:
- Wt (not descriptive)
- Animal ID (has space)
- Weight(kg) (has parentheses)
- 1st_measurement (starts with number)
- data (reserved word, not specific)

✅ Good column names:
- weight_kg
- animal_id
- measurement_date
- initial_weight_kg
- feed_conversion_ratio

One Data Type Per Column

Each column should contain only ONE type of data in a consistent format.

Common Mistakes

❌ Mixing Text and Numbers

| weight |
|--------|
| 25.3   |
| 26.1   |
| sick   |  # Should be in a separate 'notes' column
| 24.8   |

❌ Mixing Units

| weight |
|--------|
| 25.3kg |
| 57lb   |  # Convert to one unit
| 26.1kg |

❌ Multiple Values in One Cell

| treatment |
|-----------|
| A, B      |  # Should be separate rows or columns

✅ Correct Approach

Use separate columns for different data types:

animal_id | weight_kg | status | notes
----------|-----------|--------|------------------
A001      | 25.3      | normal |
A002      | 26.1      | normal |
A003      |           | sick   | not weighed due to illness
A004      | 24.8      | normal |

The Three-Document System

To make your data truly ready for AI analysis, supplement your data files with three markdown documentation files:

1. DATA.md - Data Dictionary

This file describes every table and every column in detail.

Example structure:

# Data Dictionary
 
## animal_data.csv
 
Description: One row per animal with baseline characteristics and treatment assignments.
 
| Column Name        | Data Type | Description                                    | Values/Units              |
|-------------------|-----------|------------------------------------------------|---------------------------|
| animal_id         | string    | Unique identifier for each animal              | Format: A001, B002, etc.  |
| breed             | string    | Breed of animal                                | Duroc, Yorkshire, Landrace|
| sex               | string    | Sex of animal                                  | M (male), F (female)      |
| birth_date        | date      | Date of birth                                  | YYYY-MM-DD format         |
| sire_id           | string    | Unique identifier for sire                     | Format: S001, S002, etc.  |
| dam_id            | string    | Unique identifier for dam                      | Format: D001, D002, etc.  |
| pen               | integer   | Pen number animal was housed in                | 1-20                      |
| treatment         | string    | Treatment group assignment                     | A, B, C, Control          |
| initial_weight_kg | numeric   | Weight at start of trial (kilograms)          | Measured at day 0         |
 
## weight_data.csv
 
Description: Repeated weight measurements for each animal over time.
 
| Column Name             | Data Type | Description                          | Values/Units          |
|------------------------|-----------|--------------------------------------|----------------------|
| animal_id              | string    | Links to animal_data.csv             | Format: A001, etc.   |
| measurement_date       | date      | Date weight was measured             | YYYY-MM-DD format    |
| age_days               | integer   | Age of animal in days                | Calculated from birth|
| weight_kg              | numeric   | Body weight in kilograms             | Measured to 0.1 kg   |
| body_condition_score   | integer   | Body condition on 1-5 scale          | 1=thin, 5=fat        |
| technician             | string    | Initials of person collecting data   | JD, SM, etc.         |
 
## pen_feed_data.csv
 
Description: Daily feed delivery and consumption by pen.
 
| Column Name            | Data Type | Description                              | Values/Units    |
|-----------------------|-----------|------------------------------------------|----------------|
| pen_id                | integer   | Pen number                               | 1-20           |
| measurement_date      | date      | Date of measurement                      | YYYY-MM-DD     |
| feed_delivered_kg     | numeric   | Amount of feed added to feeder (kg)     | Daily total    |
| feed_remaining_kg     | numeric   | Amount of feed left at measurement (kg) | Manual measure |
| water_consumption_L   | numeric   | Water consumed by pen (liters)          | From meter     |

2. EXPERIMENTAL_DESIGN.md - Study Design Documentation

This file explains how the trial was designed and how data was collected.

Example structure:

# Experimental Design
 
## Study Title
Effects of Dietary Lysine Levels on Growth Performance in Finishing Pigs
 
## Objectives
1. Evaluate growth performance (ADG, ADFI, G:F) across four dietary lysine levels
2. Determine optimal lysine level for finishing pigs (100-130 kg BW)
3. Assess economic returns for each treatment
 
## Experimental Design
 
**Design Type:** Randomized Complete Block Design (RCBD)
 
**Blocking Factor:** Initial body weight
 
**Treatments:**
- Treatment A: 0.70% SID lysine (control)
- Treatment B: 0.85% SID lysine
- Treatment C: 1.00% SID lysine
- Treatment D: 1.15% SID lysine
 
**Experimental Units:** Individual pigs
 
**Replicates:** 8 pigs per treatment (32 pigs total)
 
**Housing:** 4 pigs per pen, 8 pens total
 
## Animals
 
- Species: Pigs (Sus scrofa domesticus)
- Genetics: Duroc x (Large White x Landrace)
- Initial weight: 100 ± 5 kg
- Initial age: ~150 days
- Source: University research farm
 
## Timeline
 
- **Day -7 to -1:** Acclimation period, all pigs on common diet
- **Day 0:** Initial weights, treatment assignment, start experimental diets
- **Day 0-56:** Treatment period
- **Day 56:** Final weights, trial end
 
## Measurements
 
### Weekly (Every 7 days)
- Individual body weight (kg)
- Body condition score (1-5 scale)
 
### Daily (by pen)
- Feed delivered (kg)
- Feed remaining (kg) - measured at same time each day
- Water consumption (L) - from automatic meter
 
### End of Trial
- Backfat depth (mm) - ultrasound
- Loin depth (mm) - ultrasound
 
## Data Collection Protocols
 
### Body Weights
- Collected between 7:00-9:00 AM
- Pigs fasted for 12 hours before weighing
- Scale calibrated daily with 50kg standard weight
- Two technicians present, one reading scale, one recording
 
### Feed Measurements
- Feed delivered at 6:00 AM daily
- Remaining feed measured at 5:00 PM (23 hours later)
- Feeders cleaned weekly (Sunday evening)
 
## Randomization
 
Animals were blocked by initial weight into 8 blocks. Within each block, 4 animals were randomly assigned to treatments A, B, C, D using a random number generator in R.
 
Random seed: 12345
 
## Power Analysis
 
Expected to detect a 0.10 kg/day difference in ADG with 80% power at α=0.05.
 
Calculation based on σ=0.15 kg/day from previous trials.

3. STATS.md - Analysis Plan

This file explains what analysis should be performed and what models to use.

Example structure:

# Statistical Analysis Plan
 
## Primary Analysis
 
### Response Variables (Y variables)
 
1. **Average Daily Gain (ADG)**
   - Calculate as: (final_weight_kg - initial_weight_kg) / days_on_trial
   - Units: kg/day
 
2. **Average Daily Feed Intake (ADFI)**
   - Calculate per pen, then average for individual pigs in pen
   - Units: kg/day
 
3. **Feed Conversion Ratio (FCR)**
   - Calculate as: ADFI / ADG
   - Units: kg feed / kg gain
 
### Primary Statistical Model
 
**Model type:** Analysis of Variance (ANOVA) / Linear Mixed Model
 
**Fixed Effects:**
- Treatment (4 levels: A, B, C, D)
 
**Random Effects:**
- Block (8 levels)
- Pen (nested within treatment)
 
**Model equation:**

Y_ijk = μ + Treatment_i + Block_j + Pen_ij + ε_ijk

where: Y_ijk = response for pig k in pen ij μ = overall mean Treatment_i = fixed effect of treatment i Block_j = random effect of block j Pen_ij = random effect of pen in treatment i and block j ε_ijk = residual error


**Assumptions to check:**
- Normality of residuals (QQ plot, Shapiro-Wilk test)
- Homogeneity of variance (Levene's test, residual plots)
- Independence of observations

### Pairwise Comparisons

If overall treatment effect is significant (p < 0.05):
- Compare all treatments using Tukey's HSD
- Report adjusted p-values

### Regression Analysis

Fit linear and quadratic regression models with lysine level (continuous) as predictor:

Y = β0 + β1lysine_level + ε (linear) Y = β0 + β1lysine_level + β2*lysine_level^2 + ε (quadratic)


Compare models using AIC to determine if quadratic term is needed.

## Secondary Analysis

### Growth Curves

Fit mixed model with repeated measures:

weight_kg = β0 + β1days + β2treatment + β3daystreatment + (1|animal_id) + (days|animal_id) + ε


This allows animal-specific intercepts and slopes.

### Feed Efficiency Over Time

Calculate weekly feed efficiency and test if it changes over time or differs by treatment.

## Software

Primary analysis in R using:
- `lme4` package for mixed models
- `emmeans` package for pairwise comparisons
- `ggplot2` for graphics

## Significance Level

α = 0.05 for all tests

## Missing Data

- If < 5% of data missing: complete case analysis
- If 5-20% missing: multiple imputation with 5 imputations
- If > 20% missing: investigate cause and consider sensitivity analysis

## Reporting

Report for each response variable:
- Least squares means ± SE for each treatment
- P-value for overall treatment effect
- Pairwise comparison results (if significant)
- Effect size (Cohen's d for pairwise comparisons)
- R² or pseudo-R² for variance explained

Include:
- ANOVA table
- Residual diagnostic plots
- Treatment means plot with error bars

Practical Tips for Implementation

Starting a New Trial

Before data collection starts:
- Create your DATA.md file with all planned tables and columns
- Write your EXPERIMENTAL_DESIGN.md documenting the study design
- Create your STATS.md file with the analysis plan
- Set up your Excel workbook or CSV files with proper column headers
During data collection:
- Enter data consistently - use the same format every time
- Use data validation in Excel to prevent errors (dropdown lists, number ranges)
- Keep a separate "issues log" for unusual events or deviations
- Back up your data regularly (ideally daily)
Quality checks:
- Regularly check for missing values
- Look for outliers and investigate them
- Verify that all dates are in correct format
- Ensure all animal IDs match between tables

Working with AI Tools

When you provide your data to AI assistants (like Claude):

Share your documentation first:
- Upload DATA.md, EXPERIMENTAL_DESIGN.md, and STATS.md
- This gives context for the analysis
Then share your data files:
- The AI can now understand the structure
- It knows what analysis you want
- It can spot issues or inconsistencies
Ask specific questions:
- "Run the primary analysis as specified in STATS.md"
- "Check my data for outliers and missing values"
- "Create visualization of growth curves by treatment"

Excel-Specific Tips

If you're using Excel:

Use separate sheets for each table
- Sheet 1: animal_data
- Sheet 2: weight_data
- Sheet 3: pen_feed_data
Freeze the header row (View → Freeze Top Row)
Use Tables (Ctrl+T or Cmd+T)
- Converts your data to a formal Table object
- Automatically extends formulas
- Makes filtering easier
Data validation
- Data → Data Validation
- Set up dropdown lists for categorical variables
- Set number ranges for numeric variables
Name your sheets clearly
- Match your CSV/table names
- No spaces in names
Avoid formulas in data cells
- Calculate derived variables in R/Python/SAS, not Excel
- If you must use formulas, put them in separate columns clearly labeled as "calculated"

Common Pitfalls to Avoid

1. Color-Coded Data

Don't do this:

Red text for outliers
Yellow cells for estimated values
Green for good data

Instead:

Add a column called data_quality with values: "measured", "estimated", "outlier"
Add a column called notes for explanations

2. Multiple Header Rows

Don't do this:

| Study 2024        |           |
| Pig Growth Trial  |           |
| animal_id | weight_kg         |
| A001      | 25.3              |

Instead: Put all metadata in a separate sheet or the documentation files.

3. Merged Cells for Repeated Values

Don't do this:

| animal_id | week | weight |
|-----------|------|--------|
| A001      | 1    | 25.3   |
|           | 2    | 31.2   |  # Merged cells for A001
|           | 3    | 36.8   |

Instead: Repeat the value in every row:

| animal_id | week | weight |
|-----------|------|--------|
| A001      | 1    | 25.3   |
| A001      | 2    | 31.2   |
| A001      | 3    | 36.8   |

4. Units in Data Cells

Don't do this:

| weight  |
|---------|
| 25.3 kg |
| 26.1 kg |

Instead: Put units in the column name:

| weight_kg |
|-----------|
| 25.3      |
| 26.1      |

5. Missing Data Indicators

Don't do this: Use "NA", "N/A", "-", "missing", blank cells inconsistently

Instead: Be consistent:

Leave cells truly blank for missing data
Or use NA consistently
Document what missing data means in DATA.md

Real-World Example: Complete Pig Growth Trial

Let me show you what a complete, properly structured dataset looks like for a pig growth trial.

animal_data.csv

animal_id,breed,sex,birth_date,sire_id,dam_id,pen,treatment,block,initial_weight_kg
A001,Duroc,M,2024-07-15,S123,D456,1,A,1,100.2
A002,Yorkshire,F,2024-07-16,S124,D457,1,B,1,101.5
A003,Duroc,M,2024-07-15,S125,D458,2,C,1,99.8
A004,Yorkshire,F,2024-07-17,S126,D459,2,D,1,100.9

weight_data.csv

animal_id,measurement_date,age_days,weight_kg,body_condition_score,technician,notes
A001,2024-09-15,62,100.2,3,JD,initial weight
A001,2024-09-22,69,107.5,3,JD,
A001,2024-09-29,76,115.2,3,SM,
A001,2024-10-06,83,122.8,4,JD,
A002,2024-09-15,61,101.5,3,JD,initial weight
A002,2024-09-22,68,109.2,3,JD,
A002,2024-09-29,75,117.8,3,SM,
A002,2024-10-06,82,126.1,4,JD,

pen_feed_data.csv

pen_id,date,feed_delivered_kg,feed_remaining_kg,feed_consumed_kg,water_consumption_L,temperature_c
1,2024-09-15,50.0,5.2,44.8,45.3,22.1
1,2024-09-16,50.0,4.8,45.2,47.1,22.5
1,2024-09-17,50.0,6.1,43.9,46.8,21.8
2,2024-09-15,50.0,6.1,43.9,43.8,22.1
2,2024-09-16,50.0,5.5,44.5,44.2,22.5
2,2024-09-17,50.0,7.2,42.8,45.1,21.8

With this structure:

Each file serves a clear purpose
All tables can be linked by animal_id or pen_id
Data is ready for statistical analysis
AI tools can immediately understand and work with it

Conclusion

Structuring your research data properly is an investment that pays dividends throughout your project:

During data collection: Reduces errors and makes data entry consistent
During analysis: Makes statistical analysis straightforward and reproducible
During review: Allows collaborators and reviewers to understand your data
During publication: Makes data sharing and archiving simple
For future use: Enables re-analysis with new methods or questions

Key Takeaways

Think database, not notebook - Structure data for machines to read
One observation per row - Use long format for analysis
One data type per column - Be consistent
Use the three-table approach - Separate metadata, measurements, and group data
Document everything - Create DATA.md, EXPERIMENTAL_DESIGN.md, and STATS.md
Follow naming conventions - lowercase_with_underscores
Be consistent - Use the same structure across all your trials

By following these principles, you'll make your data ready for modern AI-assisted analysis while also making it more accessible to your colleagues, reviewers, and your future self.

Additional Resources

Tools for Data Management

R packages: dplyr, tidyr for data manipulation
Python libraries: pandas for data handling
Excel features: Tables, Data Validation, Power Query
Database software: Access, SQLite for larger datasets

Questions or Feedback?

If you have questions about structuring your research data or want to discuss specific challenges with your trials, feel free to reach out on LinkedIn.

Last updated: November 20, 2024

Outline

Why Data Structure Matters for AI Analysis

Core Principle: Database-First Thinking

What Does This Mean?

The Problem with "Notebook-Style" Data

The Database Approach

Long Format vs. Wide Format

Wide Format

Long Format (Recommended for Analysis)

The Three-Table Approach

1. Animal/Subject Metadata Table

2. Measurement Data Table(s)

3. Pen/Group Level Data Table

Column Naming Best Practices

Rules for Column Names

Examples

One Data Type Per Column

Common Mistakes

❌ Mixing Text and Numbers

❌ Mixing Units

❌ Multiple Values in One Cell

✅ Correct Approach

The Three-Document System

1. DATA.md - Data Dictionary

2. EXPERIMENTAL_DESIGN.md - Study Design Documentation

3. STATS.md - Analysis Plan

Practical Tips for Implementation

Starting a New Trial

Working with AI Tools

Excel-Specific Tips

Common Pitfalls to Avoid

1. Color-Coded Data

2. Multiple Header Rows

3. Merged Cells for Repeated Values

4. Units in Data Cells

5. Missing Data Indicators

Real-World Example: Complete Pig Growth Trial

animal_data.csv

weight_data.csv

pen_feed_data.csv

Conclusion

Key Takeaways

Additional Resources

Tools for Data Management

Further Reading

Questions or Feedback?