Chapter 15 of 20

Building Applications with the ChatGPT API

Learn streaming responses, conversation history management, function calling, JSON mode, and build a complete CLI chatbot in Python with production best practices.

Meritshot15 min read
ChatGPTAPIPythonFunction CallingStreamingProductionChatbot
All ChatGPT Chapters

Building Applications with the ChatGPT API

The previous chapter covered the fundamentals: getting a key, understanding tokens, and making single API calls. Now we go deeper. Real applications are more complex — they stream responses so users see text as it generates, they maintain multi-turn conversation history, they call external tools via function calling, and they need to handle errors and rate limits without crashing.

This chapter builds up to a complete, working CLI chatbot in Python that you can run on your own machine today.


1. Streaming Responses

By default, the API waits until the model has finished generating the entire response, then returns it all at once. For a short response this is fine. For a 500-word essay or a long piece of code, the user stares at a blank screen for several seconds before seeing anything. That is a poor experience.

Streaming solves this: the API sends tokens as they are generated, and your application displays them progressively — just like watching ChatGPT type in the web interface.

Enabling Streaming in Python

Add stream=True to your API call. The response is now an iterator of chunks rather than a single object:

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Explain the history of the Indian rupee in 3 paragraphs."}
    ],
    stream=True
)

# Print each token as it arrives
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

print()  # newline after stream ends

The flush=True argument to print() ensures the output appears immediately rather than buffering. Without it, Python may batch terminal output and the streaming effect is lost.

Streaming in a Web Application

In a web application (Flask, FastAPI, Django), you would return the streaming response as a Server-Sent Events (SSE) stream. The browser receives chunks and updates the DOM progressively — exactly how chat.openai.com works. The pattern is:

from flask import Flask, Response, stream_with_context
from openai import OpenAI

app = Flask(__name__)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

@app.route("/chat")
def chat():
    def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Tell me about Bengaluru's tech ecosystem."}],
            stream=True
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
    
    return Response(stream_with_context(generate()), mimetype="text/event-stream")

2. Managing Conversation History

The API is stateless. Every request you make starts fresh — the model has no memory of previous calls in the same session unless you explicitly include prior messages in the messages array.

This means your application is responsible for:

  1. Storing messages as the conversation progresses
  2. Appending both user messages and assistant responses to the history
  3. Sending the full history with each new API call

The Conversation History Pattern

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Start with a system prompt
conversation_history = [
    {
        "role": "system",
        "content": "You are a knowledgeable assistant about Indian mutual funds. Answer clearly and always recommend users consult a SEBI-registered advisor for investment decisions."
    }
]

def chat(user_message: str) -> str:
    # Add the new user message to history
    conversation_history.append({"role": "user", "content": user_message})
    
    # Call the API with the full history
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation_history
    )
    
    # Extract the assistant's reply
    assistant_message = response.choices[0].message.content
    
    # Append it to history for the next turn
    conversation_history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

# Example conversation
print(chat("What is the difference between ELSS and PPF?"))
print(chat("Which one is better for tax saving?"))
print(chat("Can you explain the lock-in period for the first one you mentioned?"))

The third message ("the first one you mentioned") works because the model has access to the full prior exchange in conversation_history.

The Context Window Problem

Every model has a maximum context window — the total number of tokens it can process in a single request (input + output combined). For gpt-4o-mini, this is 128,000 tokens. That sounds enormous, but a long conversation accumulates quickly, especially if responses are detailed.

When history grows too large, you have three options:

Option A — Truncation: Keep only the last N messages. Simple, but the model loses early context.

MAX_HISTORY = 20  # keep last 20 messages
if len(conversation_history) > MAX_HISTORY + 1:  # +1 for system message
    # Always keep system message + last MAX_HISTORY messages
    conversation_history = [conversation_history[0]] + conversation_history[-(MAX_HISTORY):]

Option B — Summarisation: Periodically ask the model to summarise the conversation so far, then replace the accumulated history with a single summary message.

Option C — Semantic search: Store messages in a vector database and retrieve only the most relevant prior messages for each new query (Retrieval-Augmented Generation). This is more complex but scales to very long conversations.


3. Function Calling (Tool Use)

Function calling is one of the most powerful API features. It allows you to define functions that the model can "call" when it determines that an external data source or action is needed to answer the user's question. The model does not actually execute your code — it generates a structured JSON object specifying which function to call and with what arguments. Your application then calls the real function and returns the result to the model.

The Flow

1. User asks a question
2. You send the question + function definitions to the API
3. Model responds with a "tool_call" instead of a text answer
4. Your application executes the real function with the model's arguments
5. You send the function's result back to the model
6. Model generates a final natural-language response using the result

Defining Tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price for an Indian company listed on NSE or BSE.",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker_symbol": {
                        "type": "string",
                        "description": "The NSE/BSE ticker symbol, e.g. RELIANCE, TCS, INFY"
                    },
                    "exchange": {
                        "type": "string",
                        "enum": ["NSE", "BSE"],
                        "description": "The stock exchange"
                    }
                },
                "required": ["ticker_symbol"]
            }
        }
    }
]

Handling a Tool Call

import json

def get_stock_price(ticker_symbol: str, exchange: str = "NSE") -> dict:
    # In a real app, this would call a market data API
    # Here we return mock data
    mock_prices = {"RELIANCE": 2950.50, "TCS": 3820.00, "INFY": 1680.25}
    price = mock_prices.get(ticker_symbol.upper(), None)
    if price:
        return {"ticker": ticker_symbol, "price": price, "currency": "INR", "exchange": exchange}
    return {"error": f"Ticker {ticker_symbol} not found"}

def chat_with_tools(user_message: str) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful stock market assistant for Indian investors."},
        {"role": "user", "content": user_message}
    ]
    
    # First call: model decides whether to use a tool
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto"  # model decides when to use tools
    )
    
    choice = response.choices[0]
    
    # Check if the model wants to call a function
    if choice.finish_reason == "tool_calls":
        tool_call = choice.message.tool_calls[0]
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)
        
        # Execute the real function
        function_result = get_stock_price(**function_args)
        
        # Add the assistant's tool call and our result to the message history
        messages.append(choice.message)
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(function_result)
        })
        
        # Second call: model generates a natural-language response using the result
        final_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return final_response.choices[0].message.content
    
    # No tool call — return the direct response
    return choice.message.content

print(chat_with_tools("What is the current price of TCS?"))

Function calling is the mechanism behind AI agents that can search the web, query databases, send emails, or call any API you define.


4. Structured Outputs — JSON Mode

Sometimes you need the model to return data in a specific format that your application can parse — not prose, but structured JSON. The API offers two mechanisms for this.

JSON Mode

Add response_format={"type": "json_object"} to your call. The model will return valid JSON, but you still control the schema through your prompt:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction assistant. Always respond with valid JSON."
        },
        {
            "role": "user",
            "content": """Extract the following details from this job posting and return as JSON:
            
            Job posting: "We are hiring a Senior Python Developer at our Pune office. 
            CTC: ₹18–24 LPA. Requirements: 5+ years Python, Django, PostgreSQL. 
            Apply by 31 July 2026."
            
            Extract: job_title, location, salary_range, required_skills (list), application_deadline"""
        }
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)

Output:

{
  "job_title": "Senior Python Developer",
  "location": "Pune",
  "salary_range": "₹18–24 LPA",
  "required_skills": ["Python", "Django", "PostgreSQL"],
  "application_deadline": "31 July 2026"
}

When to Use Structured Output

Structured output is essential when:

  • Your application needs to parse the model's response programmatically
  • You are feeding the model's output into a database, a UI component, or another system
  • You want consistent, predictable response shapes rather than free-form prose

5. Building a Complete CLI Chatbot in Python

Now let us combine everything — history management, streaming, and a system prompt — into a complete, working CLI chatbot. This is the kind of tool you could actually use for daily work.

Full Code

Save this as chatbot.py:

import os
import sys
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are a helpful assistant for Indian professionals and students.
You answer questions clearly and concisely. When asked about financial, legal, or medical topics,
provide helpful general information while recommending consultation with a qualified professional.
You are familiar with Indian context: rupees, Indian companies, Indian law, Indian education system."""

MAX_HISTORY_MESSAGES = 20  # beyond system prompt

def truncate_history(history: list) -> list:
    """Keep system message + last MAX_HISTORY_MESSAGES messages."""
    if len(history) <= MAX_HISTORY_MESSAGES + 1:
        return history
    return [history[0]] + history[-(MAX_HISTORY_MESSAGES):]

def stream_response(messages: list) -> str:
    """Stream the model's response and return the full text."""
    full_response = ""
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True,
        temperature=0.7,
        max_tokens=1000
    )
    
    print("\nAssistant: ", end="", flush=True)
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
            full_response += delta.content
    print("\n")
    
    return full_response

def main():
    print("=== ChatBot (type 'quit' or 'exit' to stop, 'clear' to reset) ===\n")
    
    conversation_history = [
        {"role": "system", "content": SYSTEM_PROMPT}
    ]
    
    while True:
        try:
            user_input = input("You: ").strip()
        except (KeyboardInterrupt, EOFError):
            print("\nGoodbye!")
            sys.exit(0)
        
        if not user_input:
            continue
        
        if user_input.lower() in ("quit", "exit"):
            print("Goodbye!")
            break
        
        if user_input.lower() == "clear":
            conversation_history = [{"role": "system", "content": SYSTEM_PROMPT}]
            print("Conversation cleared.\n")
            continue
        
        # Add user message
        conversation_history.append({"role": "user", "content": user_input})
        
        # Truncate if needed
        conversation_history = truncate_history(conversation_history)
        
        try:
            assistant_reply = stream_response(conversation_history)
            conversation_history.append({"role": "assistant", "content": assistant_reply})
        except Exception as e:
            print(f"\nError: {e}\n")
            # Remove the failed user message from history
            conversation_history.pop()

if __name__ == "__main__":
    main()

Run it:

python chatbot.py

You now have a streaming, multi-turn chatbot in the terminal with history management and graceful error handling.


6. Production Best Practices

Moving from a working script to a reliable production service requires handling the messy realities of the real world: API errors, rate limits, and unexpected inputs.

Rate Limits

OpenAI enforces rate limits on two dimensions:

  • RPM (Requests Per Minute) — how many calls you can make per minute
  • TPM (Tokens Per Minute) — how many tokens you can process per minute

Rate limits vary by tier. New accounts have lower limits; as you spend more, limits increase. When you exceed a rate limit, the API returns a 429 error.

Retry with Exponential Backoff

The standard pattern for handling rate limits and transient errors is exponential backoff — wait a short time after the first failure, longer after the second, and so on:

import time
import random
from openai import OpenAI, RateLimitError, APIError

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def call_with_retry(messages: list, max_retries: int = 5) -> str:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                max_tokens=500
            )
            return response.choices[0].message.content
        
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {wait_time:.1f} seconds...")
            time.sleep(wait_time)
        
        except APIError as e:
            if e.status_code in (500, 503) and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"API error {e.status_code}. Retrying in {wait_time:.1f} seconds...")
                time.sleep(wait_time)
            else:
                raise

Input Validation

Before sending user input to the API, validate and sanitise it:

MAX_INPUT_LENGTH = 4000  # characters

def validate_input(user_input: str) -> str:
    if not user_input or not user_input.strip():
        raise ValueError("Input cannot be empty.")
    if len(user_input) > MAX_INPUT_LENGTH:
        raise ValueError(f"Input too long. Maximum {MAX_INPUT_LENGTH} characters.")
    return user_input.strip()

Cost Controls

Set usage limits in your OpenAI account dashboard to prevent unexpected bills. In your application, log token usage per request and aggregate daily costs:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def call_with_cost_tracking(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    
    usage = response.usage
    # gpt-4o-mini pricing (approximate, check current rates)
    input_cost = (usage.prompt_tokens / 1_000_000) * 0.15
    output_cost = (usage.completion_tokens / 1_000_000) * 0.60
    total_cost = input_cost + output_cost
    
    logger.info(
        f"Tokens: {usage.prompt_tokens} in / {usage.completion_tokens} out | "
        f"Cost: ${total_cost:.6f}"
    )
    
    return response.choices[0].message.content

Model Fallback

If your primary model is unavailable or too slow, fall back to a faster, cheaper model:

def call_with_fallback(messages: list) -> str:
    for model in ["gpt-4o", "gpt-4o-mini"]:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=10.0  # 10-second timeout
            )
            return response.choices[0].message.content
        except Exception as e:
            logger.warning(f"Model {model} failed: {e}. Trying next.")
    raise RuntimeError("All models failed.")

Environment-Based Configuration

Avoid hardcoding model names, temperature values, or token limits. Use environment variables or a config file:

import os

MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
TEMPERATURE = float(os.environ.get("OPENAI_TEMPERATURE", "0.7"))
MAX_TOKENS = int(os.environ.get("OPENAI_MAX_TOKENS", "1000"))

This lets you change behaviour across environments (development, staging, production) without code changes.


Common Pitfalls

Pitfall 1 — Not handling the stateless API correctly. Forgetting to append the assistant's response to the conversation history means the model loses context on the next turn. Every turn must append both the user message and the resulting assistant message.

Pitfall 2 — Allowing unbounded history growth. Without truncation or summarisation, long conversations will eventually exceed the context window, causing errors. Implement a history management strategy from day one.

Pitfall 3 — No retry logic in production. The OpenAI API occasionally returns 429 (rate limit) or 5xx (server error) responses. Without retry logic, your application fails on these transient errors. Exponential backoff is the standard solution.

Pitfall 4 — Treating function call results as trusted. When you define tools, the model generates the function arguments. Validate those arguments before passing them to real functions, especially if they interact with databases or external APIs. A maliciously crafted user message could attempt prompt injection to manipulate the arguments.

Pitfall 5 — Not setting a timeout. API calls can occasionally hang. Set a timeout parameter to prevent your application from waiting indefinitely.

Pitfall 6 — Ignoring finish_reason in streaming. In a streaming response, the final chunk includes the finish_reason. If it is length, the response was cut off. Your application should handle this gracefully rather than presenting a truncated answer as complete.

Pitfall 7 — Over-engineering for day one. You do not need vector databases, caching layers, and model fallback on your first prototype. Build simple, observe real usage patterns, then optimise what actually causes problems.


Practice Exercises

  1. Extend the CLI chatbot from section 5 to display the token usage and estimated cost in rupees at the end of each response. Use the approximate rate of ₹85 per USD.

  2. Add a /summarise command to the CLI chatbot that, when typed, asks the model to summarise the conversation so far and replaces the full history with the summary as a single system-level context message.

  3. Implement function calling for a simple use case: define a get_weather function (return mock data) and a convert_currency function (convert USD to INR at a fixed rate). Build a chatbot that uses these tools when relevant.

  4. Build a batch processing script that reads 20 customer support emails from a text file (one per line), classifies each as "Billing", "Technical", "Returns", or "General" using the API with temperature=0 and response_format="json_object", and writes the results to a CSV file.

  5. Implement the exponential backoff retry function from section 6 and test it by temporarily setting an invalid API key to trigger errors, then a valid one. Confirm that the retry mechanism behaves correctly.


Summary

  • Streaming responses (stream=True) sends tokens progressively to the user, eliminating blank-screen wait times and significantly improving perceived performance.
  • Conversation history is managed entirely by your application — append both user messages and assistant responses to the messages array after each turn.
  • History truncation (keeping only the last N messages or summarising) is necessary for long conversations to stay within the model's context window.
  • Function calling allows the model to request execution of your application's functions — the model specifies which function and with what arguments, your code executes it and returns the result, and the model uses that result in its final response.
  • JSON mode (response_format="json_object") forces the model to return valid, parseable JSON, essential for data extraction and any integration with downstream systems.
  • The complete CLI chatbot demonstrates all these patterns together: history management, streaming, error handling, and a graceful command loop.
  • Production deployments require retry logic with exponential backoff for rate limit and server errors, input validation, cost tracking, and configuration via environment variables rather than hardcoded values.
  • Build simply first — add complexity (vector search, caching, fallback models) only when real usage reveals the need.