Machine Learning Algorithm Simple Strategy Applied to Brazilian Oil Stock

5 min readJan 30, 2024

This is my first machine learning-based trading algorithm experience, so I’ve decided to keep things simple.

The Strategy

The strategy consists of using as features other ETFs that are expected to be highly influential/correlated with our target stock price. I’ve chosen a Brazilian oil company producer as the target (PetroRio, ticker: PRIO3), and just two base features, Brent and WTI prices. we can take a look at the yearly rolling correlation of the stock with the oil prices:

We can see that there's a similar behavior with the features, and also that correlation can be a misleading measure, as it varies from negative to positive correlation depending on the period. But, most of the time it seems that there’s a high positive correlation between the PRIO stock price and these features. (really simplistic analysis here).

With this data in hand, we want to train a model to detect peaks in prices to trade the paper, so, we construct our signal based on the three-barrier method:

we calculate the volatility of the stock, and check if in the next x days (I’m using 10 days) the stock price reached 2 volatilities high. if so, this is mapped as an entry signal, as in the code below:

def apply_triple_barrier(price_series, profit_take, holding_period, dates=None):
    events = []

    vol_df = get_daily_volatility(price_series)
    price_series = price_series.reindex(vol_df.index)
    n = len(price_series)

    for i in price_series.index:
        t1 = i  # Set t1 as the current time index
        
        # Calculate volatility as the standard deviation of returns over a lookback window
        #volatility = price_series[i - lookback_window:i].std()
        volatility = vol_df[i]
        # Define upper barrier based on volatility
        upper_barrier = price_series[i] + profit_take * volatility * price_series[i]
        
        # Define the vertical barrier as the specified holding period
        t2 = min(i + holding_period, n - 1)
        
        # Check if price reaches the upper barrier during the holding period
        if price_series[t1+1:t2].max() >= upper_barrier:
            label = 1  # Take-profit label


        else:
            label = 0  # No profit label

        returns = (price_series[t2] - price_series[t1])/price_series[t1]

        

        events.append([i, t2, label, returns, volatility, price_series[i], upper_barrier])  # Storing both i and t2
        
        events_df = pd.DataFrame(events, columns=['i', 't2', 'label', 'returns',"volatility", "buy_price", "upper_barrier"])

    return events_df

with the labels available, we need to adjust the features format to feed the algorithm. I’m using as train and test set data available from yfinance until 2022, and a separate validation set from 2022 to now, to perform the backtest.

backtest_df = df.loc["2022":].reset_index().copy(deep=True)
df = df.loc[:"2021"].reset_index().copy(deep=True)

We are going to build simple moving averages (I’m using 2, 7, 20, and 50 days moving averages) and the volatility of each feature, in a way that each row of the final dataframe has from the past 50 days all the SMAs of and volatilities, like that:

Baseline Model

I’ve chosen as baseline model a Randon Forest classifier, with the following parameters:

model = RandomForestClassifier(n_estimators=100, min_samples_split=100, max_depth=5, random_state=1)

fitting the model to the data and evaluating on test set, we got a precision of 0.83 and recall also 0.83, a pretty good result, and a good-looking confusion matrix:

the amount of false positives is really low, which is exactly what we need in an algorithm like that, we can afford the false negatives because this means we are not entering some of the signals, but if we have a lot of false positives, that is, signals to buy the stock when we shouldn't, we probably gonna lose money.

Xgboost model

Now we move to Xgboost model, personally my favorite ML algorithm. to this project I kept the model training really simple, just setting an early stopping and the metric of evaluation as AUC:

xgboost_params = {
    'objective': 'binary:logistic',  # Binary classification
    'eval_metric': 'auc',  # AUC as the evaluation metric
    'early_stopping_rounds': 10,  # Early stopping
    'seed': 42,  # Random seed
    'verbosity': 1, 
}

xgboost_model = xgb.XGBClassifier(**xgboost_params)

xgboost_model.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    verbose=200, 
)

and, the confusion matrix looks like that:

The Xgboost outperformed the Random Forest, reducing both false positives and false negatives. also, the feature importances of the model look interesting (first 20 most important features)

Backtest

after training the model, I’ve adjusted the holdout data (2022-now) in the correct format and applied the algorithm to generate the signals:

 if idx_entry <= n:
            buy_price = price_df.iloc[idx_entry].STOCK_PRICE
            volatility = vol_df.loc[idx_entry].STOCK_PRICE
            upper_barrier = buy_price +  profit_take*volatility * buy_price
            stop_loss = buy_price - volatility*buy_price
            if price_df[idx_entry:idx_entry + holding_period].STOCK_PRICE.max() >= upper_barrier:
                sell_price = price_df[idx_entry:idx_entry + holding_period].STOCK_PRICE.max()  
                sell_date = price_df[price_df.STOCK_PRICE == sell_price].Date.values[0]
                trade_type = 'reach_barrier'
            
            elif (price_df[idx_entry:idx_entry + holding_period].STOCK_PRICE <= stop_loss).any():
                
                msk = price_df[idx_entry:idx_entry + holding_period].STOCK_PRICE <= stop_loss
                aux_stop = price_df[idx_entry:idx_entry + holding_period]
                print(msk)
                stop = aux_stop[msk].reset_index()
                sell_price = stop.STOCK_PRICE.values[0]
                sell_date = stop.Date.values[0]
                trade_type = 'stop_loss' 

            else:
                sell_price = price_df.iloc[idx_entry + holding_period].STOCK_PRICE
                sell_date = price_df[price_df.STOCK_PRICE == sell_price].Date.values[0]
                trade_type = 'timeout'
            trades.append([entry_date, buy_price, sell_date, sell_price, trade_type])

we can get three types of results, a “reach_barrier” means a successful trade, “stop_loss” it reaches a stop, and we are losing money, and also a “time_out”. The final data frame with the results looks like:

we can check the cumulative returns of the backtest:

were getting an astonishing 126% of profit in the period! looks really good. Actually, too good, maybe? I’ve also run cross-validation after that to start investigating overfitting, getting these results:

array([0.86575875, 0.85992218, 0.86159844, 0.85380117, 0.88304094])

these are the accuracy results for each fold, and look very consistent.

Conclusion and Next Steps

My first impression of the strategy is very positive, and I’m excited to put the algorithm to trade in real life to see what we get. Keep in mind that this was a first experience and a lot can be done to better adjust the model and also evaluate the results more carefully. For my next steps, I want to build code to trade in real life and also experiment with other stocks and features, intraday data, and implement risk control and hedging methods.