user
In Python how to iterate on dataset rows?
alphonsio

Iterating over rows in a dataset in Python can be done efficiently using the Pandas library. Here are a few methods to iterate over the rows of a DataFrame:

# Method 1: Using iterrows()
This function is used to generate an iterator that yields index and row data as pairs. It's a straightforward way to iterate through DataFrame rows.

import pandas as pd

# Example DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

for index, row in df.iterrows():
    print(f"Index: {index}, Name: {row['Name']}, Age: {row['Age']}")

Note: iterrows() returns each row as a Pandas Series which can be slightly slower if processing large data sets because of type conversion.

Method 2: Using itertuples()
This method is generally faster than iterrows() because it returns named tuples of the rows and does not convert the data type.

for row in df.itertuples():
    print(f"Index: {row.Index}, Name: {row.Name}, Age: {row.Age}")

Method 3: Vectorized operations (Recommended for performance)

Instead of iterating over rows, it's often more efficient to use vectorized operations when working with Pandas to perform calculations across rows. This takes advantage of Pandas' and NumPy's optimization and is much faster than row-wise iteration.

# Applying a function to a column to create a new column
df['New_Age'] = df['Age'] + 10
print(df)

Method 4: Applying Functions
For more complex operations or when you must apply a function row-wise, you can use the apply() method with axis=1.

def process(row):
    return f"{row['Name']} is {row['Age']} years old."

df['Description'] = df.apply(process, axis=1)
print(df)

Conclusion

While iterating over rows can be done with iterrows() or itertuples(), for larger datasets or operations needing high performance, it's advisable to use vectorized operations provided by Pandas or using apply() with a function.

Choose the method based on your specific scenario, considering the balance between code readability and execution performance. For large datasets, focus on vectorized operations for best performance.