Machine Learning: the “So What” of Crypto Data Ecosystems

8 min readSep 30, 2024

The past few years have seen significant advances in high performance tooling and frameworks to extract archive data from ETH and ETH-compatible chains. Frameworks such as Cryo and Alloy / Foundry (all extensions / buildouts from the RETH community) and fast frameworks such as Polars have catalyzed a convergence of the ETH data community around high performance data tooling.

These frameworks have streamlined cryo data engineering tools, but what is the end game? What are the killer apps that can be enabled from superior data tooling? We at Davanti believe that machine learning / AI models and feature stores that extend from a mature data ecosystem are a large part of the “so what” and end game narrative. Predictive modeling and inference to produce actionable guidance for consumers and decision makers is the catalyst for better applications and wider adoption.

In the below, we build out a full stack ML engineering pipeline on UniswapV2 data. Of course, any model is only as good as its underlying data: we show how we extracted the log data via the tools above (i.e. Cryo). We also introduce Shadow, a product that potentially changes the game for systematic, context-rich data engineering on-chain.

Building Predictive Models on UniswapV2

We aim to build a simple machine learning pipeline to (a) predict future long-term price action within liquidity pools, and (b) discover features that describe systematic drivers of risk. Potential applications of studies like this could be substantial in quantitative finance, including:

Systematic trading strategies
Risk and attribution of returns
Factors for quantitative portfolio management
Correlations with macroeconomic indicators and traditional assets

Through this buildout, we explore requirements for a production-ready ML operational pipeline, from data engineering to systematic feature generation, to model training and model performance monitoring. As any model is constrained by the quality of data, that’s where we’ll start.

On-Chain Data Extraction

We ran RETH archive nodes locally, utilizing Cryo to extract log events on UniswapV2 Swap events.

RETH is an open-source, rust-based ETH node implementation created by the Paradigm team. We’ve found the RETH node to be the easiest to set up and run locally, with a total archive database size of < 2TB at the time of this case study. With full archive snapshots from Merkle.io, archive and full nodes can be set up quickly on without having to sync from the genesis block.
Cryo is a high performance data extraction framework, built on Ethers.rs (again, from Paradigm) that can extract archive data from any ETH node (not necessarily RETH). It leverages parallelization, parquet storage and polars to execute as fast as the ETH node will allow.

This combination of local RETH and Cryo enabled us to extract Swap (and other) log events at lightning-fast speed. Individual event types are extracted via Cryo, and grouped by transaction hash and combined. These include simple user-initiated swaps as well as more complicated, multi-leg swap transactions. Token meta-information is extracted orthogonally, as tokens are created prior to its corresponding Uniswap liquidity pool. The model can be trained on a historical dataset, but online inference / predictions are run as new blocks are introduced.

The diagram below illustrates at high level the extraction and merge process; Each node in the DAG below represents a single step in the process of building the final golden transactional dataset.

This final dataset is the primary input to the steps in the ML pipeline that follows i.e. [a] feature generation / feature store, [b] model training / inference, [c] feature and model monitoring. The introduction of a new block would kick off the components of the DAG below.

Data pipeline using RETH + Cryo (+ DAG framework)

Shadow: Context-Rich Log Events

The RETH + Cryo configuration, while streamlined, fast and performant, is still in many ways similar to a traditional configuration for on-chain data engineering / data extraction.

Shadow, however, changes the landscape for on-chain data engineering. Built on Reth Execution Extensions (ExEx), Shadow enables engineers to modify and extend on-chain contracts to emit new events with richer contextual information through access to state at the time of a transaction’s execution. Previously, data engineers would have to extract the additional information separately and merge (if possible). If contextual information is not emitted at all in the original contract, that information would not have been available for use.

To learn more about how Shadow works behind the scenes, see this blog post as well as a recent technical talk.

The revised data flow (below) using Shadow RETH significantly simplifies the data engineering pipeline, compared to an ex-post extraction of log events. The new Shadow Trade event for Uniswapv2 is emitted inline with each transaction execution, so we no longer need to maintain an intricate DAG to extract, compress and merge the various logs / events. Shadow also can stream log information to a growing set of cloud databases.

With some basic solidity and study of the particular structure of the contract of interest, an engineer can now write, configure and deploy a full end-to-end streaming ETL to their cloud database.

ML Feature Store / Model Buildout

With the final datasets and data extraction processes in place, we have the foundations to build systematic predictive models. We’ll briefly discuss feature engineering and feature generation on the base dataset, and then turn to model evaluation and interpretability.

Calculation of Price Impact to Assess Generalizable Targets

Developing an accurate and generalizable predictive model requires us to properly quantify price impact.

Price impact is the effect of a trade on the market price of an asset, caused by the trade itself. The degree of the price impact is a function of the trade’s size relative to the available liquidity. Swapping a large amount of tokens within a low liquidity pool can cause significant dislocations in the executed price as the exchange formula adjusts to available liquidity. A predictive model must quantify that price impact to prevent leakage.

Using the token reserves before the trade and the amount received from the trade / swap, we can separate the price impact and accurately calculate predictive changes in price moving forward on a generalizable basis utilizing the concepts:

Token Purchased Reserve Amount — (Constant Product / (Token Sold Reserve Amount + Token Sold Amount))

Feature Engineering / Feature Generation

The data engineering pipelines produced the golden datasets, but training generalizable models on this dataset involves building pipelines that facilitate parameter choices and constraints, as well as feature definitions to feed into a target machine learning model.

At a glance, this particular case study involved the following steps at high-level:

Define transaction types on WETH contract
Merge pool features
Filter out stablecoin LPs and sell transactions
Filter MEV and multi-leg LP (i.e. aggregation) transactions
Filter small transaction and small pools / pools with low liquidity
Feature engineering: create longitudinal features for LPs (i.e. price/outcomes)
Feature engineering: create categorical features for wallets
Training / evaluation splits

The feature engineering and training steps are expressed as a directed acyclic graph (i.e. DAG, below). For this case study we utilized Dagster (which has both local and hosted options), but any DAG execution framework will work (e.g. Airflow, Kubeflow, etc). The main takeaway is the need to execute a series of repeatable / systematic steps in the flow.

This DAG was executed for a batch of blocks to produce the training set for a single model. However, one could imagine a system that is continuously running and compiling / updating features as new blocks are produced and processed by the data engineering pipeline. The ML feature engineering and model training pipeline is an extension of the data pipeline, but optimized for systematic machine learning model buildout.

Machine Learning Model Evaluation and Interpretability

The longitudinal LP features and characteristics produced in the feature engineering / generation pipeline are used as inputs in the predictive ML model. The output of the model evaluation produces both trained predictions as well as the strength of each feature utilized in the training process. The strength of the features can be utilized in model inference or as factors in risk and attribution models.

The outputs below from the training process shows the marginal importance of the top features (i.e. the features most important to predictive ability). The specific model algorithm utilized (i.e. linear models, categorical models, ensemble methods etc) is less important than the outcomes and feature importance scores — these can be utilized in inference and factor analysis:

We can see how LP characteristics interact with engineered time series based features in the predictive Machine Learning model:

We can also derive how interactions play out in the Machine Learning model between LP characteristics and time series engineered features:

Ultimately, the most significant features found by the model include LP characteristics (i.e. reserve values) as well as Wallet characteristics. These attributes can be verified in Etherscan or any block explorer of your choice.

Conclusions: So What?

The data pipeline, ML pipeline and trained model was a full-stack solution specific to UniswapV2 LPs, but this should illustrate what is possible with robust machine learning and feature engineering pipelines. This same (or similar) study can be produced for other DEX protocols or lending protocols etc to generate both systematic insights as well as features that could be utilized cross-functionally.

Efficient and performant extraction of data from protocols is only the beginning. We imagine the following enabled by robust machine learning engineering ecosystems:

Parallel features across different protocols, all accessible from a network of feature stores
An ecosystem to define new features on top of golden on-chain datasets
Overlay of real-world time series (e.g.: macroeconomic or hyperlocal data) as connected features
Systematic machine learning pipelines executed on standardized on-chain cross-functional features
A marketplace for monetizing protocol-specific datasets and ML feature sets

Machine learning / predictive modeling will catalyze the next generation of innovations as well as greater interest and onboarding of customers and decision makers. This “so what” extends from a mature and vibrant data ecosystem and will lead to both increased adoption and real-world relevance.

Davanti (davanti.xyz) is a research company focused on AI and machine learning for decentralized financial markets.