Python for Data Science
Python has become the de facto language for data science. In this chapter, we'll learn the essential Python concepts and libraries for data science.
Setting Up Your Environment
Installing Python and Required Tools
# Check Python version
python --version
# Install pip if not installed
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
# Install essential packages
pip install numpy pandas matplotlib seaborn jupyter scikit-learn
Essential Python Libraries for Data Science
NumPy - Numerical Computing
import numpy as np
# Creating arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Basic operations
print(arr.mean()) # Mean
print(arr.std()) # Standard deviation
print(matrix.shape) # Array dimensions
Pandas - Data Manipulation
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 22, 35],
'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
# Basic operations
print(df.head()) # View first few rows
print(df.describe()) # Statistical summary
Working with Jupyter Notebooks
Jupyter Notebooks are interactive computing environments perfect for data science:
- Start Jupyter:
jupyter notebook
- Create a new notebook
- Write and execute code cells
- Add markdown documentation
Best Practices
Code Organization
# Imports at the top
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Constants
DATA_PATH = "data/dataset.csv"
RANDOM_SEED = 42
# Functions
def load_data(path):
"""Load and preprocess data."""
return pd.read_csv(path)
def analyze_data(df):
"""Perform basic analysis."""
return df.describe()
Data Science Workflow
- Import required libraries
- Load and inspect data
- Clean and preprocess
- Analyze and visualize
- Document findings
Exercises
- Create a NumPy array and perform basic operations
- Load a CSV file into a Pandas DataFrame
- Create a simple data visualization
- Write a function to clean data
In the next chapter, we'll dive deeper into data analysis fundamentals.