ChatGPT for Data Analysis

Introduction

Most data analysis tools require you to write code, configure an environment, or know the right library. ChatGPT's Advanced Data Analysis mode — also known as Code Interpreter — removes all of that friction. You upload a file, describe what you want to understand, and ChatGPT writes and runs Python code on your behalf, shows you the output, and explains what it found.

This capability is transformative for anyone who works with data but does not have a programming background. It is equally powerful for developers who want to skip repetitive data wrangling and jump straight to insights. This chapter covers how to use Advanced Data Analysis effectively: uploading files, exploring datasets, generating charts, cleaning data, and running a complete sales analysis from end to end.

1. What Is Advanced Data Analysis (Code Interpreter)?

Advanced Data Analysis is a ChatGPT Plus feature available in GPT-4o. When enabled, ChatGPT can:

Accept file uploads (CSV, Excel, JSON, PDF, images, and more)
Write and execute Python code in a sandboxed environment
Display charts, tables, and computation results inline in the chat
Download processed files back to your computer

This is different from asking ChatGPT to "write pandas code" that you then run yourself. In Code Interpreter mode, ChatGPT actually runs the code and shows you the result.

How to Enable It

Start a new conversation in ChatGPT Plus (GPT-4o).
Click the paperclip icon in the message bar to attach a file.
Upload your CSV or Excel file.
Once the file is visible in the chat, you can start asking questions.

Code Interpreter activates automatically when you upload a file to a GPT-4o conversation.

2. Uploading Files and First Exploration

The first thing to do with any new dataset is understand its shape: how many rows, how many columns, what types of data, and whether there are obvious quality issues.

The Exploration Prompt Sequence

Once you have uploaded your file, run these prompts in order:

Step 1 — Basic overview:

Give me an overview of this dataset. How many rows
and columns does it have? What are the column names
and data types? Show me the first 5 rows.

Step 2 — Summary statistics:

Show me descriptive statistics for all numeric
columns: mean, median, min, max, and standard
deviation. Highlight any columns with unusually
high variance or suspicious values.

Step 3 — Missing data check:

Check for missing values in every column. Tell me:
- Which columns have missing values
- How many missing values in each
- What percentage of that column is missing
- Your recommendation for how to handle each case

Step 4 — Duplicate check:

Check for duplicate rows. If any exist, show me
examples and ask me whether I want them removed.

These four steps take under five minutes and give you a complete picture of your dataset's health before you do any analysis.

3. Common Dataset Formats and What to Upload

Format	How to Upload	Notes
CSV	Directly via attachment	Most reliable; UTF-8 encoding preferred
Excel (.xlsx)	Directly via attachment	ChatGPT reads the first sheet by default; specify sheet name if needed
JSON	Directly via attachment	Works well for flat JSON arrays
Google Sheets	Export as CSV, then upload	Direct Google Sheets integration is not available in ChatGPT
PDF with tables	Upload PDF	Works for simple tables; complex PDFs may need manual extraction

Tip: If your Excel file has multiple sheets, tell ChatGPT which one to use:

The file has 3 sheets. Please use the sheet named
"Sales_Data" for this analysis.

4. Generating Charts and Visualisations

ChatGPT Code Interpreter uses Python's matplotlib and seaborn libraries to generate charts. You do not need to know these libraries — just describe what you want.

Chart Types and When to Use Them

Chart Type	Best For	Example Prompt
Bar chart	Comparing categories	"Show me total sales by product category as a bar chart"
Line chart	Trends over time	"Plot monthly revenue as a line chart for 2024"
Pie chart	Share of a whole	"Show the share of each payment method used"
Scatter plot	Relationship between two variables	"Is there a correlation between discount percentage and order value?"
Histogram	Distribution of a numeric variable	"Show the distribution of order values"
Heatmap	Correlations between multiple variables	"Show a correlation heatmap for all numeric columns"
Box plot	Spread and outliers	"Show the distribution of delivery times by city"

Example Chart Prompts

Sales trend:

Create a line chart showing monthly total sales
revenue for 2024 and 2025 on the same plot.
Use different colours for each year.
Add a clear title and axis labels.

Category comparison:

Create a horizontal bar chart showing total
revenue by product category, sorted from
highest to lowest. Show the exact value at
the end of each bar.

Correlation analysis:

Create a scatter plot with "discount_percentage"
on the x-axis and "order_value" on the y-axis.
Add a trend line. Does a higher discount correlate
with higher or lower order values in this data?

5. Writing and Running Pandas Code

Even if you know pandas, it is faster to describe what you want in English and let ChatGPT write and run the code. The result appears instantly — no local environment setup needed.

Useful Operations to Request

Filtering:

Show me all orders where the order value was
above ₹5,000 and the customer city was either
Mumbai or Delhi.

Grouping and aggregation:

Group the data by product category and calculate:
total revenue, total units sold, average order
value, and number of unique customers.
Sort by total revenue descending.

Adding calculated columns:

Add a new column called "revenue" calculated as
units_sold multiplied by unit_price.
Then add a column called "profit" calculated as
revenue multiplied by the margin percentage
in the "margin" column.

Date-based operations:

The "order_date" column is in DD/MM/YYYY format.
Convert it to a proper datetime column. Then add
new columns for: year, month name, day of week,
and whether the order was placed on a weekend.

Merging datasets:

I have uploaded two files: "orders.csv" and
"customers.csv". Merge them on the "customer_id"
column. Keep all orders, and add the customer's
name, city, and tier (from the customers file)
to each row.

6. Cleaning Data

Real-world data is messy. Code Interpreter is excellent at cleaning — standardising formats, handling nulls, removing duplicates, and fixing inconsistencies.

Common Cleaning Tasks

Standardising text:

The "city" column has inconsistent capitalisation
and some entries have extra spaces (e.g., " mumbai",
"DELHI", "Bangalore "). Clean this column so all
city names are in Title Case with no extra spaces.

Handling missing values:

The "phone_number" column has 12% missing values.
Replace missing values with the string "Not Provided".
For the "order_value" column (3% missing), replace
missing values with the median order value.

Removing duplicates:

Remove duplicate rows based on "order_id" — keep
the first occurrence and drop the rest.
Tell me how many rows were removed.

Fixing data types:

The "order_value" column is stored as text with
the ₹ symbol (e.g., "₹1,299.00"). Convert it to
a numeric float column.

Outlier detection:

In the "order_value" column, identify any values
that are more than 3 standard deviations from
the mean. Show me those rows — they may be
data entry errors.

7. Worked Example — Analysing a Sales Dataset to Find the Top-Performing Product

Let us walk through a complete data analysis task from file upload to business insight.

Scenario: You are a business analyst at a D2C skincare brand in Pune. Your manager wants to know which product generated the most revenue in Q4 2024, whether any product is gaining momentum month-on-month, and which city has the highest average order value.

The dataset is an Excel file named q4_sales_2024.xlsx with columns:

order_id — unique identifier
order_date — date of purchase (DD/MM/YYYY)
product_name — name of the product
category — skincare / haircare / wellness
units_sold — number of units in the order
unit_price — price per unit in rupees
customer_city — city of the customer
payment_method — UPI / Credit Card / COD / EMI

Step 1 — Upload and Explore

Upload the file and run:

Give me an overview of this dataset.
Rows, columns, data types, first 5 rows,
and a check for missing values.

Suppose ChatGPT reports: 4,800 rows, 8 columns, the order_date column is a string (not datetime), and units_sold has 23 missing values.

Step 2 — Clean the Data

Fix these issues:
1. Convert order_date from string DD/MM/YYYY
   to a proper datetime column
2. For missing units_sold values, fill with 1
   (assume single-unit orders)
3. Add a "revenue" column = units_sold x unit_price

Step 3 — Find the Top-Performing Product

Which product generated the highest total revenue
in Q4 2024 (October, November, December)?
Show me the top 10 products by total revenue
as a table and as a bar chart.

ChatGPT runs the code, returns a table, and renders a bar chart. Suppose "Vitamin C Brightening Serum" is clearly number one with ₹18.4 lakh in revenue.

Step 4 — Analyse Month-on-Month Momentum

For the top 5 products by total revenue, show me
their monthly revenue for October, November, and
December 2024 as a line chart.
Which product showed the strongest growth trend
across these three months?

This reveals that while the Serum had the highest total revenue, the "Retinol Night Cream" grew 40% month-on-month — a potentially more important insight for future inventory planning.

Step 5 — City-Level Analysis

Which 5 cities have the highest average order value?
Show as a table with: city, total orders, total revenue,
and average order value. Sort by average order value
descending.

Step 6 — Export the Cleaned Data

Export the cleaned DataFrame (with the revenue column
and corrected date column) as a new CSV file called
"q4_sales_2024_cleaned.csv" so I can download it.

ChatGPT generates a download link for the processed file.

Summary of findings to present to your manager:

Write a 5-bullet executive summary of the key findings
from this analysis, suitable for a non-technical manager.
Use INR figures and specific product names.

This complete workflow — from raw Excel file to polished business insights — takes roughly 20–30 minutes in Code Interpreter, compared to several hours of manual pandas work.

8. Limitations of Code Interpreter

Understanding what Code Interpreter cannot do is as important as knowing what it can.

Limitation	Workaround
Session memory does not persist	Re-upload your file in a new session if needed
Cannot connect to live databases or APIs	Export data from your database to CSV first
Generated charts cannot be directly embedded in reports	Download the chart image and insert it manually
File size limit (approximately 512 MB)	Split large datasets into multiple files or aggregate upstream
Cannot run code outside Python (e.g., R, SQL against a live DB)	Ask for Python equivalents of R or SQL operations
Computations on very large datasets can time out	Aggregate or sample the data before uploading

Common Pitfalls

1. Not checking the first 5 rows before asking analysis questions. If your date column is formatted incorrectly or your numeric column has currency symbols, ChatGPT will either error or produce wrong results. Always explore first, clean second, analyse third.

2. Uploading the wrong sheet of an Excel file. If your workbook has multiple sheets, ChatGPT defaults to the first. Always specify the sheet name if your data is not on sheet 1.

3. Assuming ChatGPT knows your business context. "Which product is performing well?" requires context. What does "performing well" mean in your business — revenue, units sold, profit margin, growth rate? Always define your success metric.

4. Not downloading the cleaned file. After a data cleaning session, always ask ChatGPT to export the cleaned DataFrame as a CSV. Otherwise you will need to repeat the cleaning in the next session.

5. Trusting charts without checking the underlying numbers. A chart can look convincing but be built on a calculation error. After generating a chart, ask: "Show me the underlying data table that this chart is based on."

6. Ignoring outliers. An average order value of ₹850 sounds reasonable, but if 3 orders were ₹50,000 each, they are skewing the average. Always ask ChatGPT to check for outliers before reporting averages.

Practice Exercises

Create a simple CSV file in Excel or Google Sheets with at least 50 rows of made-up sales data (product, date, quantity, price). Upload it to ChatGPT Code Interpreter and ask for the full exploration sequence: overview, statistics, missing values, duplicates.
Ask ChatGPT to generate a correlation heatmap for the numeric columns in your dataset. Then ask it to interpret the heatmap in plain language: which variables are most strongly related?
Upload a real CSV export from any app you use (Zerodha portfolio export, Swiggy order history, Splitwise expenses). Ask ChatGPT to tell you 3 interesting things it notices in your data.
Practice data cleaning: introduce deliberate errors into your CSV — mixed case in a text column, currency symbols in a number column, a few blank rows. Upload it and ask ChatGPT to identify and fix all data quality issues.
Run the full worked example workflow on a dataset of your choice. End by asking ChatGPT to write a 5-bullet executive summary of the findings and a 3-point recommendation for a business decision based on the data.

Summary

Advanced Data Analysis (Code Interpreter) in ChatGPT Plus allows you to upload files and have ChatGPT write, run, and explain Python code on your behalf — no local environment needed.
Always follow the four-step exploration sequence before analysis: overview, summary statistics, missing value check, duplicate check.
Specify chart type, axes, labels, and groupings explicitly — the more precise your chart prompt, the more useful the output.
Code Interpreter handles the full data wrangling pipeline: type conversion, cleaning text, handling nulls, removing duplicates, adding calculated columns, and merging files.
The worked sales analysis (top product, growth trend, city-level AOV) is a repeatable template for any business dataset.
Always ask ChatGPT to export the cleaned file for download so your work is preserved beyond the session.
Check the underlying data table behind every chart and verify outliers before reporting aggregate statistics.