Machine Learning for Finance
Machine Learning for Asset Managers
Financial problems pose a particular challenge to those legacy methods, because economic systems exhibit a degree of complexity that is beyond the grasp of classical statistical tools (López de Prado 2019b). As a consequence, machine learning (ML) plays an increasingly important role in finance. Only a few years ago, it was rare to find ML applications outside short-term price prediction, trade execution, and setting of credit ratings. Today, it is hard to find a use case where ML is not being deployed in some form. This trend is unlikely to change, as larger data sets, greater computing power, and more efficient algorithms all conspire to unleash a golden age of financial ML. The ML revolution creates opportunities for dynamic firms and challenges for antiquated asset managers. Firms that resist this revolution will likely share Kodak’s fate.
Machine Learning for Asset Managers is concerned with answering a different challenge: how can we use ML to build better financial theories? This is not a philosophical or rhetorical question. Whatever edge you aspire to gain in finance, it can only be justified in terms of someone else making a systematic mistake from which you benefit. Without a testable theory that explains your edge, the odds are that you do not have an edge at all. A historical simulation of an investment strategy’s performance (backtest) is not a theory; it is a (likely unrealistic) simulation of a past that never happened (you did not deploy that strategy years ago; that is why you are backtesting it!). Only a theory can pin down the clear cause–effect mechanism that allows you to extract profits against the collective wisdom of the crowds – a testable theory that explains factual evidence as well as counterfactual cases (x implies y, and the absence of y implies the absence of x). Asset managers should focus their efforts on researching theories, not backtesting trading rules. ML is a powerful tool for building financial theories.
1.7.2 ML Is a Black Box
This is perhaps the most widespread myth surrounding ML. Every research laboratory in the world uses ML to some extent, so clearly ML is compatible with the scientific method. Not only is ML not a black box, but as Section 6 explains, ML-based research tools can be more insightful than traditional statistical methods (including econometrics). ML models can be interpreted through a number of procedures, such as PDP, ICE, ALE, Friedman’s H-stat, MDI, MDA, global surrogate, LIME, and Shapley values, among others. See Molnar (2019) for a detailed treatment of ML interpretability. Whether someone applies ML as a black box or as a white-box is a matter of personal choice. The same is true of many other technical subjects. I personally do not care much about how my car works, and I must confess that I have never lifted the hood to take a peek at the engine (my thing is math, not mechanics). So, my car remains a black box to me. I do not blame the engineers who designed my car for my lack of curiosity, and I am aware that the mechanics who work at my garage see my car as a white box. Likewise, the assertion that ML is a black box reveals how some people have chosen to apply ML, and it is not a universal truth.
Marcos M. López de Prado, "Machine Learning for Asset Managers"(Elements in Quantitative Finance), Cambridge University Press, Apr 30, 2020.
Understanding Artificial Intelligence
and Its Capabilities
The Surprisingly Tricky Problem of Defining AI
We can start to put some clarity around the idea of AI by dividing it into two high-level categories: broad AI, sometimes also called ‘general AI’ or ‘strong AI’, and narrow AI, sometimes called ‘weak AI’. General AI is the stuff of science fiction, the idea of a machine that fuses a human being’s ability to perform a wide variety of tasks, conduct highly generalized reasoning, apply common sense, and solve problems creatively with a computer’s ability to apply rapid computation to vast stores of data. While an exciting, and to some frightening, idea, most experts expect broad AI to remain the exclusive purview of films and novels for at least the next few decades.
By contrast, narrow AI is capable of effectively addressing a highly specific problem, such as playing chess at a high level, or recognizing if there is a cat in a photo. Significant advancements have been made in this realm over the past 50 years, with particularly notable advancements having taken place over the course of the past decade. Given that this text is intended to serve as a practical companion for readers interested in the impact of new technologies on financial services, we will not conduct further explorations of theoretical applications of broad AI, and instead confine our discussion exclusively to narrow AI and its applications.
Why Artificial Intelligence Matters
To put it simply, AI matters because it is the only tool at our disposal that will enable us to detect patterns, derive insights, and drive actions from the truly staggering quantities of data that are being created every day. Even with the most powerful traditional computing techniques, no human or group of humans could effectively analyze these streams of data.
Selected AI Techniques
It may be useful to readers to have a passing familiarity with some of the most commonly used AI techniques. Here we will explore three such techniques: machine learning, neural networks (including deep learning), and genetic and evolutionary algorithms.
1. Machine Learning
Much like AI itself, the term machine learning, which was coined in 1959 by Arthur Samuel, is the subject of many definitional arguments and incorporates under its banner a range of techniques. For example, several variants of machine learning exist using supervised, unsupervised, and reinforcement learning, each of which is suited to different tasks.
2. Neural Networks and Deep Learning
AI implementations using a neural network structure borrow on insights from the human brain. The basic unit of the human nervous system is the neuron, a simple cell capable of transmitting an electrical signal in response to stimuli. Neurons in the human brain are connected to each other via junctions called synapses. The process of an individual learning is a shifting of the strength of the connections at those synapse points. A neural network seeks to replicate aspects of this densely interconnected system of neurons, as well as the process of learning through adjusting the strength of their connections to one another, all within the digital world.
3. Genetic and Evolutionary Algorithms
The final AI approach that we will consider are genetic and evolutionary algorithms. This approach applies the principles of evolution found in nature to the process of training an AI model by incorporating features such as Darwinian natural selection and the randomness of mutations. Under this process, a population of different possible models is created and then winnowed down through natural selection, with only those that produce the best results surviving. A new generation of algorithms is then created alongside the surviving ones through ‘mutations’ which introduce randomized changes to portions of the code of the surviving models. The natural selection process can then be run again and again to produce a truly best of breed algorithm.

Henri Arslanian and Fabrice Fischer, “The Future of Finance: The Impact of FinTech, AI, and Crypto on Financial Services”, Springer International Publishing; Palgrave Macmillan, 2019. https://doi.org/10.1007/978-3-030-14533-0.


Hariom Tatsat, Sahil Puri and Brad Lookabaugh, “Machine Learning and Data Science Blueprints for Finance: From Building Trading Strategies to Robo-Advisors Using Python”, 1st Edition, O'Reilly Media, December, 2020.
Prediction
​
Classification and Regression
​
There are two major types of supervised machine learning problems, called classification and regression.
In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities. In Chapter 1 we used the example of classifying irises into one of three possible species. Classification is sometimes separated into binary classification, which is the special case of distinguishing between exactly two classes, and multiclass classification, which is classification between more than two classes. You can think of binary classification as trying to answer a yes/no question. Classifying emails as either spam or not spam is an example of a binary classification problem. In this binary classification task, the yes/no question being asked would be “Is this email spam?”
For regression tasks, the goal is to predict a continuous number, or a floating-point number in programming terms (or real number in mathematical terms). Predicting a person’s annual income from their education, their age, and where they live is an example of a regression task. When predicting income, the predicted value is an amount, and can be any number in a given range. Another example of a regression task is predicting the yield of a corn farm given attributes such as previous yields, weather, and number of employees working on the farm. The yield again can be an arbitrary number.
Generalization, Overfitting, and Underfitting
In supervised learning, we want to build a model on the training data and then be able to make accurate predictions on new, unseen data that has the same characteristics as the training set that we used. If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set. We want to build a model that is able to generalize as accurately as possible.
Building a model that is too complex for the amount of information we have, as our novice data scientist did, is called overfitting. Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data. On the other hand, if your model is too simple—say, “Everybody who owns a house buys a boat”—then you might not be able to capture all the aspects of and variability in the data, and your model will do badly even on the training set. Choosing too simple a model is called underfitting.
The trade-off between overfitting and underfitting is illustrated in Figure 2-1.

Andreas Müller and Sarah Guido, “Introduction to Machine Learning with Python: A Guide for Data Scientists”, O'Reilly Media, 1st edition, November 15, 2016.

Stuart J. Russell and Peter Norvig, “Artificial Intelligence - A Modern Approach”, Fourth Edition, Global Edition, by Pearson Education, 2022.

Andreas Müller and Sarah Guido, "Introduction to Machine Learning with Python: A Guide for Data Scientists", O'Reilly Media, 1st edition, November 15, 2016.
Current and Future Machine Learning Applications in Finance
Let’s take a look at some promising machine learning applications in finance. The case studies presented in this book cover all the applications mentioned here.
Algorithmic Trading
Algorithmic trading (or simply algo trading) is the use of algorithms to conduct trades autonomously. With origins going back to the 1970s, algorithmic trading (sometimes called Automated Trading Systems, which is arguably a more accurate description) involves the use of automated preprogrammed trading instructions to make extremely fast, objective trading decisions.
Machine learning stands to push algorithmic trading to new levels. Not only can more advanced strategies be employed and adapted in real time, but machine learning–based techniques can offer even more avenues for gaining special insight into market movements.
Most hedge funds and financial institutions do not openly disclose their machine learning–based approaches to trading (for good reason), but machine learning is playing an increasingly important role in calibrating trading decisions in real time.
Portfolio Management and Robo-Advisors
Asset and wealth management firms are exploring potential artificial intelligence (AI) solutions for improving their investment decisions and making use of their troves of historical data.
One example of this is the use of robo-advisors, algorithms built to calibrate a financial portfolio to the goals and risk tolerance of the user. Additionally, they provide automated financial guidance and service to end investors and clients. A user enters their financial goals (e.g., to retire at age 65 with $250,000 in savings), age, income, and current financial assets. The advisor (the allocator) then spreads investments across asset classes and financial instruments in order to reach the user’s goals.
The system then calibrates to changes in the user’s goals and real-time changes in the market, aiming always to find the best fit for the user’s original goals. Robo-advisors have gained significant traction among consumers who do not need a human advisor to feel comfortable investing.
Fraud Detection
Fraud is a massive problem for financial institutions and one of the foremost reasons to leverage machine learning in finance.
Loans/Credit Card/Insurance Underwriting
Underwriting could be described as a perfect job for machine learning in finance, and
indeed there is a great deal of worry in the industry that machines will replace a large
swath of underwriting positions that exist today.
Automation and Chatbots
Automation is patently well suited to finance. It reduces the strain that repetitive, low-value tasks put on human employees. It tackles the routine, everyday processes, freeing up teams to finish their high-value work. In doing so, it drives enormous time and cost savings.
Adding machine learning and AI into the automation mix adds another level of support for employees. With access to relevant data, machine learning and AI can provide an in-depth data analysis to support finance teams with difficult decisions. In some cases, it may even be able to recommend the best course of action for employees to approve and enact.
Risk Management
Machine learning techniques are transforming how we approach risk management. All aspects of understanding and controlling risk are being revolutionized through the growth of solutions driven by machine learning. Examples range from deciding how much a bank should lend a customer to improving compliance and reducing model risk.
Asset Price Prediction
Asset price prediction is considered the most frequently discussed and most sophisticated area in finance. Predicting asset prices allows one to understand the factors that drive the market and speculate asset performance. Traditionally, asset price prediction was performed by analyzing past financial reports and market performance to determine what position to take for a specific security or asset class. However, with a tremendous increase in the amount of financial data, the traditional approaches for analysis and stock-selection strategies are being supplemented with ML-based techniques.
Derivative Pricing
Recent machine learning successes, as well as the fast pace of innovation, indicate that ML applications for derivatives pricing should become widely used in the coming years. The world of Black-Scholes models, volatility smiles, and Excel spreadsheet models should wane as more advanced methods become readily available.
The classic derivative pricing models are built on several impractical assumptions to reproduce the empirical relationship between the underlying input data (strike price, time to maturity, option type) and the price of the derivatives observed in the market. Machine learning methods do not rely on several assumptions; they just try to estimate a function between the input data and price, minimizing the difference between the results of the model and the target.
The faster deployment times achieved with state-of-the-art ML tools are just one of the advantages that will accelerate the use of machine learning in derivatives pricing.
Sentiment Analysis
Sentiment analysis involves the perusal of enormous volumes of unstructured data, such as videos, transcriptions, photos, audio files, social media posts, articles, and business documents, to determine market sentiment. Sentiment analysis is crucial for all businesses in today’s workplace and is an excellent example of machine learning in finance.
The most common use of sentiment analysis in the financial sector is the analysis of financial news—in particular, predicting the behaviors and possible trends of markets. The stock market moves in response to myriad human-related factors, and the hope is that machine learning will be able to replicate and enhance human intuition about financial activity by discovering new trends and telling signals. However, much of the future applications of machine learning will be in understanding social media, news trends, and other data sources related to predicting the sentiments of customers toward market developments. It will not be limited to predicting stock prices and trades.
Trade Settlement
Trade settlement is the process of transferring securities into the account of a buyer and cash into the seller’s account following a transaction of a financial asset. Despite the majority of trades being settled automatically, and with little or no interaction by human beings, about 30% of trades need to be settled manually. The use of machine learning not only can identify the reason for failed trades, but it also can analyze why the trades were rejected, provide a solution, and predict which trades may fail in the future. What usually would take a human being five to ten minutes to fix, machine learning can do in a fraction of a second.
Money Laundering
A United Nations report estimates that the amount of money laundered worldwide per year is 2%–5% of global GDP. Machine learning techniques can analyze internal, publicly existing, and transactional data from a client’s broader network in an attempt to spot money laundering signs.

Hariom Tatsat, Sahil Puri and Brad Lookabaugh, “Machine Learning and Data Science Blueprints for Finance: From Building Trading Strategies to Robo-Advisors Using Python”, 1st Edition, O'Reilly Media, December, 2020.
