DataFrames & Series
1. What is Pandas and why is it used?
Pandas is an open-source Python library built on top of NumPy that provides fast, flexible, and expressive data structures for working with structured (tabular) and time series data. It is used by data analysts and scientists for data manipulation, cleaning, exploration, and analysis. Its two primary data structures are the Series (one-dimensional labelled array) and the DataFrame (two-dimensional labelled table), which make working with real-world datasets intuitive.
2. What is the difference between a Series and a DataFrame?
A Series is a one-dimensional labelled array that can hold any data type — integers, strings, floats, or Python objects — with an associated index. A DataFrame is a two-dimensional labelled data structure with rows and columns, similar to a spreadsheet or SQL table. A DataFrame can be thought of as a dictionary of Series objects sharing the same index, where each column is a Series.
3. How do you create a DataFrame from a dictionary?
A DataFrame is created from a dictionary using pd.DataFrame(data), where data is a dictionary whose keys become column names and values (lists or arrays of equal length) become column data. For example, pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]}) creates a two-column DataFrame. You can also create DataFrames from lists of dictionaries, NumPy arrays, CSV files, Excel files, and SQL query results.
4. What is the difference between loc and iloc?
loc is label-based indexing that selects rows and columns by their index labels and column names, inclusive of both endpoints when slicing, for example df.loc[0:2, 'name':'age']. iloc is integer position-based indexing that selects rows and columns by their integer positions (zero-based), exclusive of the end when slicing, for example df.iloc[0:2, 0:2]. Use loc when working with named labels and iloc when working with positional indices.
5. How do you select multiple columns from a DataFrame?
Multiple columns are selected by passing a list of column names inside double brackets: df[['col1', 'col2', 'col3']]. This returns a new DataFrame containing only the specified columns. Single brackets df['col1'] return a Series. You can also use df.loc[:, ['col1', 'col2']] for label-based selection or df.iloc[:, [0, 1]] for position-based selection when you need to combine row and column selection.
6. What is the index in a Pandas DataFrame?
The index is a set of labels that uniquely identifies each row in a DataFrame or element in a Series. By default, Pandas creates a RangeIndex (0, 1, 2, …) when a DataFrame is created. The index can be set to any column using df.set_index('column_name') and reset back to the default using df.reset_index(). A meaningful index (such as dates or IDs) improves readability and enables faster label-based lookups.
7. How do you check the shape, data types, and summary of a DataFrame?
df.shape returns a tuple (rows, columns). df.dtypes returns the data type of each column. df.info() provides a concise summary including column names, non-null counts, and data types. df.describe() generates descriptive statistics (count, mean, standard deviation, min, quartiles, max) for numerical columns. df.head(n) and df.tail(n) show the first and last n rows respectively.
8. How do you rename columns in a DataFrame?
Columns are renamed using df.rename(columns={'old_name': 'new_name'}, inplace=True) to rename specific columns. To rename all columns at once, you can assign a new list directly: df.columns = ['col1', 'col2', 'col3']. The inplace=True parameter modifies the DataFrame in place; without it, a new DataFrame is returned. You can also use df.add_prefix('prefix_') or df.add_suffix('_suffix') to bulk-rename columns.
9. How do you add a new column to a DataFrame?
A new column is added by assigning a value directly: df['new_col'] = df['col1'] + df['col2']. You can also use df.assign(new_col=lambda x: x['col1'] * 2), which returns a new DataFrame without modifying the original. insert() allows adding a column at a specific position: df.insert(2, 'new_col', values). New columns can be created from expressions, constants, NumPy functions, or values from another Series.
10. What is the difference between copy() and a view in Pandas?
When you slice a DataFrame, Pandas may return a view (a reference to the original data) or a copy (a new independent object), depending on the operation. Modifying a view modifies the original DataFrame, which can cause the SettingWithCopyWarning. To avoid ambiguity, explicitly use df.copy() to create an independent copy: new_df = df[mask].copy(). This is especially important when you intend to modify a subset of a DataFrame.
Data Cleaning
11. How do you detect and handle missing values?
Missing values are detected using df.isnull().sum() to count NaNs per column or df.isnull() to get a boolean mask. They are handled by dropping rows or columns with df.dropna(), filling with a constant using df.fillna(value), forward-filling with df.ffill(), backward-filling with df.bfill(), or filling with the column mean/median/mode using df.fillna(df.mean()). The strategy depends on the proportion of missing values and domain context.
12. How do you drop duplicate rows?
Duplicate rows are identified with df.duplicated(), which returns a boolean Series marking duplicates. All duplicates including the first occurrence are found with df.duplicated(keep=False). Duplicates are removed using df.drop_duplicates(), which by default keeps the first occurrence. You can specify which columns to consider using subset=['col1', 'col2'] and control which duplicate to keep with the keep parameter ('first', 'last', or False).
13. How do you convert data types in a DataFrame?
Data types are converted using df['col'].astype(dtype) — for example, df['age'].astype(int) or df['price'].astype(float). pd.to_numeric(df['col'], errors='coerce') converts to numeric and sets non-convertible values to NaN. pd.to_datetime(df['date']) parses date strings into datetime objects. df['category'].astype('category') creates a category dtype which saves memory for columns with low cardinality. df.convert_dtypes() infers the best dtype for each column automatically.
14. How do you apply a function to a column?
Functions are applied to a column using df['col'].apply(func), where func can be a lambda or a named function. For element-wise operations on multiple columns, df.apply(func, axis=1) applies the function row-wise. df['col'].map(dict_or_func) is used for element-wise substitution using a dictionary or function. df.applymap(func) applies a function element-wise across the entire DataFrame. Vectorised operations (arithmetic, string methods) are always faster than apply.
15. How do you handle outliers in Pandas?
Outliers are identified using the IQR method: Q1 = df['col'].quantile(0.25), Q3 = df['col'].quantile(0.75), IQR = Q3 - Q1, then flagging values outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR]. They can also be detected using z-scores. Outliers are handled by capping (clipping) values using df['col'].clip(lower, upper), removing the rows, transforming the column with log or square root, or imputing with the median.
16. How do you strip whitespace and standardise string columns?
String operations are accessed via the .str accessor. Whitespace is removed with df['col'].str.strip(). Case is standardised with .str.lower() or .str.upper(). Substrings are replaced with .str.replace('old', 'new', regex=False). Characters are split with .str.split(', '). Patterns are matched with .str.contains('pattern'). All .str methods are vectorised and much faster than applying Python string methods with apply().
17. How do you replace values in a DataFrame?
Values are replaced using df.replace(old_value, new_value) for a specific value or df.replace({'col': {'old': 'new'}}) for column-specific replacements. A dictionary can map multiple replacements at once: df['status'].replace({'Y': 1, 'N': 0}). df.where(condition, other) replaces values where the condition is False with other. df.mask(condition, other) replaces values where the condition is True. These are used for standardising categorical values and fixing data entry errors.
18. What is the difference between isnull() and isna()?
isnull() and isna() are aliases — they are identical in behaviour and both return a boolean DataFrame or Series where True indicates a missing (NaN or None) value. Similarly, notnull() and notna() are aliases that return True for non-missing values. There is no practical difference between them; isna() was added later to mirror the naming convention used in other parts of the Pandas API.
19. How do you handle categorical variables for machine learning?
Categorical variables are encoded for machine learning using label encoding (df['col'].map({'A': 0, 'B': 1}) or LabelEncoder) for ordinal categories, or one-hot encoding for nominal categories using pd.get_dummies(df, columns=['col'], drop_first=True). The drop_first=True parameter drops the first dummy variable to avoid multicollinearity. pd.Categorical creates memory-efficient category dtypes. For high-cardinality columns, target encoding or frequency encoding may be preferred.
20. How do you sort a DataFrame?
A DataFrame is sorted by one or more columns using df.sort_values(by='col', ascending=False). Multiple columns are sorted with df.sort_values(by=['col1', 'col2'], ascending=[True, False]). Sorting by the index uses df.sort_index(). The inplace=True parameter modifies the original DataFrame, while without it a new sorted DataFrame is returned. nlargest(n, 'col') and nsmallest(n, 'col') return the n rows with the largest or smallest values.
Filtering & Selection
21. How do you filter rows based on a condition?
Rows are filtered using boolean indexing: df[df['age'] > 30] returns all rows where the age column is greater than 30. Multiple conditions are combined using & (AND), | (OR), and ~ (NOT): df[(df['age'] > 30) & (df['city'] == 'Mumbai')]. Each condition must be wrapped in parentheses. df.query('age > 30 and city == "Mumbai"') provides a more readable string-based alternative for complex filters.
22. What is the isin() method?
isin() checks whether each element of a Series is contained in a given list of values, returning a boolean Series. It is used to filter rows where a column's value belongs to a specific set: df[df['city'].isin(['Mumbai', 'Delhi', 'Bangalore'])]. The inverse filter uses ~df['city'].isin(...). It is equivalent to using multiple | conditions but is more concise and readable for large lists of values.
23. How do you use the query() method?
df.query('expression') filters rows using a string-based query expression, which is more readable than boolean indexing for complex conditions. For example, df.query('age > 30 and salary > 50000'). Column names with spaces are referenced using backticks: df.query('first name == "Alice"'). External Python variables are referenced with @: min_age = 25; df.query('age > @min_age'). query() is evaluated using numexpr for performance on large DataFrames.
24. How do you select rows based on the index?
Rows are selected by index label using df.loc['label'] for a single row or df.loc['label1':'label2'] for a range. Integer position-based selection uses df.iloc[0] for a single row or df.iloc[0:5] for a range. df.loc[[0, 2, 5]] selects specific rows by label. When the index is a datetime, df.loc['2024-01'] selects all rows from January 2024. df.at['label', 'col'] and df.iat[row, col] access single scalar values efficiently.
25. What is the where() and mask() method?
df.where(condition, other=NaN) keeps values where the condition is True and replaces values where False with other. df.mask(condition, other=NaN) is the inverse — it replaces values where the condition is True. Both return a new DataFrame by default. For example, df['salary'].where(df['salary'] > 50000, 0) sets salaries below 50,000 to 0. These are vectorised alternatives to conditional replacement.
26. How do you sample rows from a DataFrame?
df.sample(n=10) randomly selects 10 rows without replacement. df.sample(frac=0.2) selects 20% of rows. df.sample(n=10, replace=True) samples with replacement (bootstrap). Setting random_state ensures reproducibility. df.sample(n=10, weights='column') samples rows with probability proportional to the values in the specified column. Sampling is used for exploratory analysis on large datasets and creating train-test splits.
27. How do you use str.contains() for text filtering?
df[df['col'].str.contains('pattern')] filters rows where the column contains a specific substring or regex pattern. By default it is case-sensitive; case=False makes it case-insensitive. na=False treats NaN values as False to avoid errors. regex=True (default) allows regular expression patterns. To filter for rows that do NOT contain the pattern, use ~df['col'].str.contains('pattern'). This is useful for searching text data such as product names or descriptions.
28. How do you use between() for range filtering?
df[df['col'].between(lower, upper)] selects rows where the column values fall within a range, inclusive of both endpoints by default. For example, df[df['age'].between(25, 35)] returns rows where age is between 25 and 35 inclusive. The inclusive parameter controls whether the endpoints are included: 'both' (default), 'neither', 'left', or 'right'. It is more readable than combining two comparison conditions with &.
29. What does at and iat do?
df.at[row_label, col_label] provides label-based access to a single scalar value and is significantly faster than df.loc for accessing individual elements. df.iat[row_pos, col_pos] provides integer position-based access to a single scalar value and is faster than df.iloc for single-element access. Both are optimised for single-cell reads and writes and should be preferred over loc and iloc when accessing or setting individual cell values in a loop.
30. How do you use nlargest() and nsmallest()?
df.nlargest(n, 'col') returns the n rows with the highest values in the specified column, sorted in descending order. df.nsmallest(n, 'col') returns the n rows with the lowest values, sorted in ascending order. Both are more efficient than sorting the entire DataFrame with sort_values and then slicing. Multiple columns can be specified for breaking ties. They are useful for finding top performers, identifying outliers, or creating leaderboards.
Aggregation & GroupBy
31. How does groupby() work?
df.groupby('col') splits the DataFrame into groups based on the unique values of the specified column, then you apply an aggregation function to each group. For example, df.groupby('department')['salary'].mean() calculates the mean salary per department. Multiple columns can be grouped: df.groupby(['dept', 'year'])['sales'].sum(). The result is a new DataFrame with the grouping columns as the index. reset_index() is used to convert the group keys back to regular columns.
32. What aggregation functions are commonly used with groupby()?
Common aggregation functions include sum(), mean(), median(), min(), max(), count(), std(), var(), and nunique() (count of unique values). agg() allows applying multiple functions at once: df.groupby('dept').agg({'salary': ['mean', 'max'], 'age': 'median'}). Custom functions can also be passed: df.groupby('dept').agg(lambda x: x.quantile(0.9)). transform() returns a result with the same shape as the original DataFrame, useful for adding group statistics back.
33. What is the difference between agg(), transform(), and apply() in groupby?
agg() reduces each group to a scalar value (e.g., mean, sum), returning a summary DataFrame with one row per group. transform() returns a result with the same shape as the original DataFrame, broadcasting the group result back to each row — useful for adding group means to each row. apply() is the most flexible but slowest option, allowing the user to define arbitrary functions that receive a group DataFrame and return a scalar, Series, or DataFrame.
34. How do you calculate the running total or cumulative sum?
A cumulative sum is calculated using df['col'].cumsum(), which returns a Series with the running total. Within groups, use df.groupby('group')['col'].cumsum(). Similarly, cumprod() computes the running product, cummax() the running maximum, and cummin() the running minimum. These are useful for tracking running sales totals, cumulative returns in financial analysis, or any metric where the accumulated value over time is meaningful.
35. How do you use pivot_table()?
pd.pivot_table(df, values='sales', index='region', columns='product', aggfunc='sum', fill_value=0) creates a spreadsheet-style pivot table from a DataFrame. values specifies the column to aggregate, index sets the row grouping, columns sets the column grouping, aggfunc specifies the aggregation function, and fill_value replaces NaNs in the result. Multiple aggregation functions can be specified with a list. pivot_table() is the standard way to cross-tabulate and summarise data.
36. What is crosstab()?
pd.crosstab(df['col1'], df['col2']) computes a frequency table of the joint distribution of two or more categorical variables. Unlike pivot_table(), it always uses count as the default aggregation but can compute proportions with normalize=True. margins=True adds row and column totals. It is useful for examining the relationship between categorical variables and is commonly used in chi-square tests of independence and exploratory data analysis.
37. How do you resample time series data?
For DatetimeIndex data, df.resample('M').sum() resamples to monthly frequency and sums within each period. Common offset aliases include 'D' (daily), 'W' (weekly), 'M' (month-end), 'Q' (quarterly), and 'A' or 'Y' (annually). Any aggregation function can follow: mean(), max(), ohlc() (for financial data). resample() is equivalent to groupby() but specialised for time-based grouping and can handle upsampling with forward fill or interpolation.
38. How do you use rolling() and expanding()?
df['col'].rolling(window=7).mean() computes a 7-period moving average, where each result is the mean of the current and previous 6 values. min_periods controls the minimum observations required for a valid result. df['col'].expanding().mean() computes the cumulative mean using all data points up to the current row (equivalent to cumsum() / cumcount()). Rolling windows are used for smoothing time series data, detecting trends, and computing moving averages in financial analysis.
39. How do you rank values within groups?
df.groupby('group')['col'].rank(method='dense', ascending=False) assigns ranks within each group. The method parameter controls how ties are handled: 'average' assigns the mean rank to tied values, 'min' assigns the minimum rank, 'max' the maximum, 'first' assigns ranks in order of appearance, and 'dense' assigns consecutive integer ranks without gaps. pct=True returns percentage ranks (percentiles). Ranking within groups is useful for finding top-N per category.
40. What does value_counts() do?
df['col'].value_counts() returns a Series containing the counts of unique values in descending order. normalize=True returns relative frequencies (proportions). dropna=False includes NaN in the count. bins=5 creates equal-width bins for numerical data, similar to a histogram. df['col'].value_counts().reset_index() converts the result to a DataFrame with columns for the value and its count. It is the quickest way to understand the distribution of categorical or discrete variables.
Merging & Advanced
41. What is the difference between merge(), join(), and concat()?
pd.merge(df1, df2, on='key', how='inner') combines two DataFrames based on common column values, supporting inner, left, right, and outer joins. df1.join(df2, how='left') joins on the index by default. pd.concat([df1, df2], axis=0) stacks DataFrames vertically (rows) or axis=1 horizontally (columns) without considering key columns. merge() is the most flexible and is used for SQL-style joins, while concat() is for appending or combining DataFrames with similar structure.
42. What are the different join types in merge()?
An inner join (how='inner') returns only rows with matching keys in both DataFrames. A left join (how='left') returns all rows from the left DataFrame and matching rows from the right, filling NaN for non-matches on the right. A right join (how='right') is the mirror of a left join. A full outer join (how='outer') returns all rows from both DataFrames, filling NaN for non-matches on either side. The default is inner join.
43. How do you melt and unpivot a DataFrame?
pd.melt(df, id_vars=['id', 'name'], value_vars=['q1', 'q2', 'q3']) transforms a wide-format DataFrame (multiple columns for the same variable) into a long-format DataFrame with a single column for variable names and another for values. The var_name and value_name parameters rename the variable and value columns. df.stack() is an alternative that moves column labels to the innermost row index level. melt() is the inverse of pivot().
44. How do you pivot a DataFrame from long to wide format?
df.pivot(index='date', columns='product', values='sales') reshapes the DataFrame from long to wide format, moving unique values in the columns column to become new column headers. It requires unique (index, column) combinations; if there are duplicates, pivot_table() must be used instead as it handles duplicates by aggregating. df.unstack() moves the innermost index level to columns. Pivoting is the inverse of melt().
45. How do you handle datetime data in Pandas?
Datetime columns are parsed using pd.to_datetime(df['date']). Once converted, date components are accessed via the .dt accessor: df['date'].dt.year, .dt.month, .dt.day, .dt.dayofweek, .dt.hour. Arithmetic between datetime columns produces a Timedelta. Setting the datetime column as the index enables time-based slicing: df.loc['2024-01':'2024-06']. dt.floor('D') rounds to the day and dt.strftime('%Y-%m') formats datetimes as strings.
46. What is a MultiIndex and how do you work with it?
A MultiIndex (hierarchical index) allows multiple levels of indexing along an axis. It is created using pd.MultiIndex.from_tuples(), pd.MultiIndex.from_product(), or as the result of groupby(), pivot_table(), or stack(). Data is accessed using df.loc[('level0', 'level1')] or df.xs('level0', level=0). reset_index() flattens a MultiIndex back to columns. swaplevel() swaps two index levels and sort_index() sorts by index levels.
47. How do you use cut() and qcut() for binning?
pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['Youth', 'Young Adult', 'Middle Aged', 'Senior']) divides a continuous column into fixed-width bins with custom labels. pd.qcut(df['salary'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) divides into quantile-based bins of equal frequency. cut() is used when bin boundaries are meaningful (age groups, score ranges), while qcut() is used when equal-frequency bins are needed regardless of value ranges.
48. How do you read and write files in Pandas?
CSV files are read with pd.read_csv('file.csv') and written with df.to_csv('file.csv', index=False). Excel files use pd.read_excel('file.xlsx', sheet_name='Sheet1') and df.to_excel('file.xlsx', index=False). JSON uses pd.read_json() and df.to_json(). SQL uses pd.read_sql('SELECT * FROM table', conn) with a SQLAlchemy connection. Parquet (column-oriented format) uses pd.read_parquet() and is preferred for large datasets due to better compression and read performance.
49. How do you optimise memory usage in Pandas?
Memory is reduced by downcasting numeric types using pd.to_numeric(df['col'], downcast='integer') or assigning smaller dtypes manually: df['age'].astype('int8'). Converting low-cardinality string columns to 'category' dtype significantly reduces memory. Reading large CSVs in chunks using pd.read_csv('file.csv', chunksize=10000) avoids loading everything into memory. Specifying usecols in read_csv() reads only needed columns. df.memory_usage(deep=True) shows memory consumption per column.
50. What is pipe() and when is it used?
df.pipe(func, *args, **kwargs) calls func(df, *args, **kwargs) and is used to chain multiple DataFrame operations in a readable, functional style. It enables method chaining without nested function calls: (df.pipe(clean_data).pipe(add_features).pipe(normalise_columns)) is cleaner than normalise_columns(add_features(clean_data(df))). pipe() is especially useful for building modular, testable data transformation pipelines and is considered a best practice in production Pandas code.