Data Preprocessing for Stock Market Prediction AI

Table of Contents

Introduction

Data preprocessing is a crucial step in building an effective stock market prediction AI system. The quality and nature of the data directly impact the model’s performance and accuracy. This article delves into various techniques and considerations for preprocessing stock market data, providing sample code to illustrate key concepts.

Gathering Data

The first step in data preprocessing is gathering relevant stock market data. This data can be obtained from various sources, such as financial APIs (e.g., Alpha Vantage, Yahoo Finance), stock exchanges, or proprietary databases.

Sample Code for Data Collection

Here’s an example of how to fetch historical stock data using the Alpha Vantage API:

import requests
import pandas as pd

def fetch_stock_data(symbol, api_key, outputsize='full'):
    url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={symbol}&outputsize={outputsize}&apikey={api_key}&datatype=csv'
    response = requests.get(url)
    data = pd.read_csv(pd.compat.StringIO(response.text))
    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data.set_index('timestamp', inplace=True)
    return data

api_key = 'your_alpha_vantage_api_key'
symbol = 'AAPL'
data = fetch_stock_data(symbol, api_key)
print(data.head())

Cleaning Data

Once the data is collected, the next step is cleaning it. This involves handling missing values, removing duplicates, and ensuring the data is consistent.

Handling Missing Values

Missing values can be handled in several ways, such as filling them with the mean, median, or using interpolation.

# Fill missing values with the previous day's value
data.fillna(method='ffill', inplace=True)

Removing Duplicates

Ensure there are no duplicate records in the dataset.

# Remove duplicate rows
data.drop_duplicates(inplace=True)

Feature Engineering

Feature engineering is the process of creating new features from the existing data to better capture the underlying patterns. For stock market prediction, this might involve creating technical indicators such as moving averages, RSI (Relative Strength Index), or MACD (Moving Average Convergence Divergence).

Technical Indicators

Moving Average

# Calculate the 50-day moving average
data['50_MA'] = data['close'].rolling(window=50).mean()

Relative Strength Index (RSI)

def calculate_rsi(data, window=14):
    delta = data['close'].diff(1)
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    avg_gain = gain.rolling(window=window).mean()
    avg_loss = loss.rolling(window=window).mean()
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

data['RSI'] = calculate_rsi(data)

Moving Average Convergence Divergence (MACD)

def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['close'].ewm(span=short_window, adjust=False).mean()
    long_ema = data['close'].ewm(span=long_window, adjust=False).mean()
    macd = short_ema - long_ema
    signal = macd.ewm(span=signal_window, adjust=False).mean()
    return macd, signal

data['MACD'], data['Signal_Line'] = calculate_macd(data)

Normalization and Scaling

To ensure the model performs well, it’s essential to normalize and scale the data. This process adjusts the range of the data so that all features contribute equally to the model.

Min-Max Scaling

Min-Max scaling transforms the data to a fixed range, typically [0, 1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data[['close', '50_MA', 'RSI', 'MACD', 'Signal_Line']])
scaled_df = pd.DataFrame(scaled_data, index=data.index, columns=['close', '50_MA', 'RSI', 'MACD', 'Signal_Line'])

Splitting Data

Splitting the data into training and testing sets is crucial for evaluating the model’s performance. Typically, a common split is 80% for training and 20% for testing.

train_size = int(len(scaled_df) * 0.8)
train_data = scaled_df[:train_size]
test_data = scaled_df[train_size:]

Creating Sequences

For time series data like stock prices, creating sequences of past values to predict future values is common. This is particularly useful for models like LSTMs.

def create_sequences(data, sequence_length=50):
    x = []
    y = []
    for i in range(sequence_length, len(data)):
        x.append(data[i-sequence_length:i])
        y.append(data[i, 0])
    return np.array(x), np.array(y)

sequence_length = 50
x_train, y_train = create_sequences(train_data.values, sequence_length)
x_test, y_test = create_sequences(test_data.values, sequence_length)

Sample Code for Data Preprocessing

Here’s the complete code for preprocessing stock market data:

import requests
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Fetch stock data
def fetch_stock_data(symbol, api_key, outputsize='full'):
    url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={symbol}&outputsize={outputsize}&apikey={api_key}&datatype=csv'
    response = requests.get(url)
    data = pd.read_csv(pd.compat.StringIO(response.text))
    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data.set_index('timestamp', inplace=True)
    return data

api_key = 'your_alpha_vantage_api_key'
symbol = 'AAPL'
data = fetch_stock_data(symbol, api_key)

# Data cleaning
data.fillna(method='ffill', inplace=True)
data.drop_duplicates(inplace=True)

# Feature engineering
data['50_MA'] = data['close'].rolling(window=50).mean()

def calculate_rsi(data, window=14):
    delta = data['close'].diff(1)
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    avg_gain = gain.rolling(window=window).mean()
    avg_loss = loss.rolling(window=window).mean()
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

data['RSI'] = calculate_rsi(data)

def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['close'].ewm(span=short_window, adjust=False).mean()
    long_ema = data['close'].ewm(span=long_window, adjust=False).mean()
    macd = short_ema - long_ema
    signal = macd.ewm(span=signal_window, adjust=False).mean()
    return macd, signal

data['MACD'], data['Signal_Line'] = calculate_macd(data)

# Data normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data[['close', '50_MA', 'RSI', 'MACD', 'Signal_Line']])
scaled_df = pd.DataFrame(scaled_data, index=data.index, columns=['close', '50_MA', 'RSI', 'MACD', 'Signal_Line'])

# Split data
train_size = int(len(scaled_df) * 0.8)
train_data = scaled_df[:train_size]
test_data = scaled_df[train_size:]

# Create sequences
def create_sequences(data, sequence_length=50):
    x = []
    y = []
    for i in range(sequence_length, len(data)):
        x.append(data[i-sequence_length:i])
        y.append(data[i, 0])
    return np.array(x), np.array(y)

sequence_length = 50
x_train, y_train = create_sequences(train_data.values, sequence_length)
x_test, y_test = create_sequences(test_data.values, sequence_length)

print(f'x_train shape: {x_train.shape}, y_train shape: {y_train.shape}')
print(f'x_test shape: {x_test.shape}, y_test shape: {y_test.shape}')

Conclusion

Data preprocessing is a critical step in developing a stock market prediction AI system. By carefully gathering, cleaning, and transforming data, you can significantly improve the performance of your predictive models. The techniques and sample code provided in this article serve as a foundation for your preprocessing tasks, enabling you to build more accurate and reliable stock market prediction models.

Post Views: 564