Introduction
Data preprocessing is a crucial step in building an effective stock market prediction AI system. The quality and nature of the data directly impact the model’s performance and accuracy. This article delves into various techniques and considerations for preprocessing stock market data, providing sample code to illustrate key concepts.
Gathering Data
The first step in data preprocessing is gathering relevant stock market data. This data can be obtained from various sources, such as financial APIs (e.g., Alpha Vantage, Yahoo Finance), stock exchanges, or proprietary databases.
Sample Code for Data Collection
Here’s an example of how to fetch historical stock data using the Alpha Vantage API:
import requests
import pandas as pd
def fetch_stock_data(symbol, api_key, outputsize='full'):
url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={symbol}&outputsize={outputsize}&apikey={api_key}&datatype=csv'
response = requests.get(url)
data = pd.read_csv(pd.compat.StringIO(response.text))
data['timestamp'] = pd.to_datetime(data['timestamp'])
data.set_index('timestamp', inplace=True)
return data
api_key = 'your_alpha_vantage_api_key'
symbol = 'AAPL'
data = fetch_stock_data(symbol, api_key)
print(data.head())
Cleaning Data
Once the data is collected, the next step is cleaning it. This involves handling missing values, removing duplicates, and ensuring the data is consistent.
Handling Missing Values
Missing values can be handled in several ways, such as filling them with the mean, median, or using interpolation.
# Fill missing values with the previous day's value
data.fillna(method='ffill', inplace=True)
Removing Duplicates
Ensure there are no duplicate records in the dataset.
# Remove duplicate rows
data.drop_duplicates(inplace=True)
Feature Engineering
Feature engineering is the process of creating new features from the existing data to better capture the underlying patterns. For stock market prediction, this might involve creating technical indicators such as moving averages, RSI (Relative Strength Index), or MACD (Moving Average Convergence Divergence).
Technical Indicators
- Moving Average
# Calculate the 50-day moving average
data['50_MA'] = data['close'].rolling(window=50).mean()
- Relative Strength Index (RSI)
def calculate_rsi(data, window=14):
delta = data['close'].diff(1)
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(window=window).mean()
avg_loss = loss.rolling(window=window).mean()
rs = avg_gain / avg_loss
rsi = 100 - (100 / (1 + rs))
return rsi
data['RSI'] = calculate_rsi(data)
- Moving Average Convergence Divergence (MACD)
def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
short_ema = data['close'].ewm(span=short_window, adjust=False).mean()
long_ema = data['close'].ewm(span=long_window, adjust=False).mean()
macd = short_ema - long_ema
signal = macd.ewm(span=signal_window, adjust=False).mean()
return macd, signal
data['MACD'], data['Signal_Line'] = calculate_macd(data)
Normalization and Scaling
To ensure the model performs well, it’s essential to normalize and scale the data. This process adjusts the range of the data so that all features contribute equally to the model.
Min-Max Scaling
Min-Max scaling transforms the data to a fixed range, typically [0, 1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data[['close', '50_MA', 'RSI', 'MACD', 'Signal_Line']])
scaled_df = pd.DataFrame(scaled_data, index=data.index, columns=['close', '50_MA', 'RSI', 'MACD', 'Signal_Line'])
Splitting Data
Splitting the data into training and testing sets is crucial for evaluating the model’s performance. Typically, a common split is 80% for training and 20% for testing.
train_size = int(len(scaled_df) * 0.8)
train_data = scaled_df[:train_size]
test_data = scaled_df[train_size:]
Creating Sequences
For time series data like stock prices, creating sequences of past values to predict future values is common. This is particularly useful for models like LSTMs.
def create_sequences(data, sequence_length=50):
x = []
y = []
for i in range(sequence_length, len(data)):
x.append(data[i-sequence_length:i])
y.append(data[i, 0])
return np.array(x), np.array(y)
sequence_length = 50
x_train, y_train = create_sequences(train_data.values, sequence_length)
x_test, y_test = create_sequences(test_data.values, sequence_length)
Sample Code for Data Preprocessing
Here’s the complete code for preprocessing stock market data:
import requests
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Fetch stock data
def fetch_stock_data(symbol, api_key, outputsize='full'):
url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={symbol}&outputsize={outputsize}&apikey={api_key}&datatype=csv'
response = requests.get(url)
data = pd.read_csv(pd.compat.StringIO(response.text))
data['timestamp'] = pd.to_datetime(data['timestamp'])
data.set_index('timestamp', inplace=True)
return data
api_key = 'your_alpha_vantage_api_key'
symbol = 'AAPL'
data = fetch_stock_data(symbol, api_key)
# Data cleaning
data.fillna(method='ffill', inplace=True)
data.drop_duplicates(inplace=True)
# Feature engineering
data['50_MA'] = data['close'].rolling(window=50).mean()
def calculate_rsi(data, window=14):
delta = data['close'].diff(1)
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(window=window).mean()
avg_loss = loss.rolling(window=window).mean()
rs = avg_gain / avg_loss
rsi = 100 - (100 / (1 + rs))
return rsi
data['RSI'] = calculate_rsi(data)
def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
short_ema = data['close'].ewm(span=short_window, adjust=False).mean()
long_ema = data['close'].ewm(span=long_window, adjust=False).mean()
macd = short_ema - long_ema
signal = macd.ewm(span=signal_window, adjust=False).mean()
return macd, signal
data['MACD'], data['Signal_Line'] = calculate_macd(data)
# Data normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data[['close', '50_MA', 'RSI', 'MACD', 'Signal_Line']])
scaled_df = pd.DataFrame(scaled_data, index=data.index, columns=['close', '50_MA', 'RSI', 'MACD', 'Signal_Line'])
# Split data
train_size = int(len(scaled_df) * 0.8)
train_data = scaled_df[:train_size]
test_data = scaled_df[train_size:]
# Create sequences
def create_sequences(data, sequence_length=50):
x = []
y = []
for i in range(sequence_length, len(data)):
x.append(data[i-sequence_length:i])
y.append(data[i, 0])
return np.array(x), np.array(y)
sequence_length = 50
x_train, y_train = create_sequences(train_data.values, sequence_length)
x_test, y_test = create_sequences(test_data.values, sequence_length)
print(f'x_train shape: {x_train.shape}, y_train shape: {y_train.shape}')
print(f'x_test shape: {x_test.shape}, y_test shape: {y_test.shape}')
Conclusion
Data preprocessing is a critical step in developing a stock market prediction AI system. By carefully gathering, cleaning, and transforming data, you can significantly improve the performance of your predictive models. The techniques and sample code provided in this article serve as a foundation for your preprocessing tasks, enabling you to build more accurate and reliable stock market prediction models.