Utilizing Causal Graphs to reply causal questions | by Ryan O’Sullivan

Causal AI, exploring the mixing of causal reasoning into machine studying

This text offers a sensible introduction to the potential of causal graphs.

It’s aimed toward anybody who desires to know extra about:

What causal graphs are and the way they work
A labored case examine in Python illustrating the right way to construct causal graphs
How they examine to ML
The important thing challenges and future concerns

The total pocket book might be discovered right here:

Causal graphs assist us disentangle causes from correlations. They’re a key a part of the causal inference/causal ML/causal AI toolbox and can be utilized to reply causal questions.

Also known as a DAG (directed acyclic graph), a causal graph comprises nodes and edges — Edges hyperlink nodes which might be causally associated.

There are two methods to find out a causal graph:

Professional area data
Causal discovery algorithms

For now, we are going to assume we have now skilled area data to find out the causal graph (we are going to cowl causal discovery algorithms additional down the road).

The target of ML is to categorise or predict as precisely as attainable given some coaching knowledge. There isn’t any incentive for an ML algorithm to make sure the options it makes use of are causally linked to the goal. There isn’t any assure that the route (constructive/detrimental impact) and power of every characteristic will align with the true knowledge producing course of. ML gained’t keep in mind the next conditions:

Spurious correlations — Two variables having a spurious correlation after they have a standard trigger e.g. Excessive temperatures growing the variety of ice cream gross sales and shark assaults.
Confounders — A variable is affecting your therapy and consequence e.g. Demand affecting how a lot we spend on advertising and marketing and what number of new clients enroll.
Colliders — A variable that’s affected by two impartial variables e.g. High quality of buyer care -> Consumer satisfaction <- Dimension of firm
Mediators — Two variables being (not directly) linked by way of a mediator e.g. Common train -> Cardiovascular health (the mediator) -> Total well being

Due to these complexities and the black-box nature of ML, we are able to’t be assured in its potential to reply causal questions.

Given a identified causal graph and noticed knowledge, we are able to prepare a structural causal mannequin (SCM). An SCM might be regarded as a sequence of causal fashions, one per node. Every mannequin makes use of one node as a goal, and its direct mother and father as options. If the relationships in our noticed knowledge are linear, an SCM will probably be a sequence of linear equations. This might be modelled by a sequence of linear regression fashions. If the relationships in our noticed knowledge are non-linear, this might be modelled with a sequence of boosted bushes.

The important thing distinction to conventional ML is that an SCM fashions causal relationships and accounts for spurious correlations, confounders, colliders and mediators.

It is not uncommon to make use of an additive noise mannequin (ANM) for every non-root node (which means it has not less than one guardian). This permits us to make use of a spread of machine studying algorithms (plus a noise time period) to estimate every non-root node.

Y := f(X) + N

Root nodes can modelled utilizing a stochastic mannequin to explain the distribution.

An SCM might be seen as a generative mannequin as can to generate new samples of knowledge — This permits it to reply a spread of causal questions. It generates new knowledge by sampling from the basis nodes after which propagating knowledge by way of the graph.

The worth of an SCM is that it permits us to reply causal questions by calculating counterfactuals and simulating interventions:

Counterfactuals: Utilizing traditionally noticed knowledge to calculate what would have occurred to y if we had modified x. e.g. What would have occurred to the variety of clients churning if we had diminished name ready time by 20% final month?
Interventions: Similar to counterfactuals (and infrequently used interchangeably) however interventions simulate what what would occur sooner or later e.g. What is going to occur to the variety of clients churning if we cut back name ready time by 20% subsequent 12 months?

There are a number of KPIs that the customer support staff displays. Considered one of these is name ready occasions. Growing the variety of name centre employees will lower name ready occasions.

However how will lowering name ready time affect buyer churn ranges? And can this offset the price of extra name centre employees?

The Information Science staff is requested to construct and consider the enterprise case.

The inhabitants of curiosity is clients who make an inbound name. The next time-series knowledge is collected each day:

On this instance, we use time-series knowledge however causal graphs also can work with customer-level knowledge.

On this instance, we use skilled area data to find out the causal graph.

# Create node lookup for channels
node_lookup = {0: 'Demand',
1: 'Name ready time',
2: 'Name deserted', 
3: 'Reported issues',                   
4: 'Low cost despatched',
5: 'Churn'                                                                             
}total_nodes = len(node_lookup)
# Create adjacency matrix - that is the bottom for our graph
graph_actual = np.zeros((total_nodes, total_nodes))
# Create graph utilizing skilled area data
graph_actual[0, 1] = 1.0 # Demand -> Name ready time
graph_actual[0, 2] = 1.0 # Demand -> Name deserted
graph_actual[0, 3] = 1.0 # Demand -> Reported issues
graph_actual[1, 2] = 1.0 # Name ready time -> Name deserted
graph_actual[1, 5] = 1.0 # Name ready time -> Churn
graph_actual[2, 3] = 1.0 # Name deserted -> Reported issues
graph_actual[2, 5] = 1.0 # Name deserted -> Churn
graph_actual[3, 4] = 1.0 # Reported issues -> Low cost despatched
graph_actual[3, 5] = 1.0 # Reported issues -> Churn
graph_actual[4, 5] = 1.0 # Low cost despatched -> Churn

Subsequent, we have to generate knowledge for our case examine.

We need to generate some knowledge which can enable us to match calculating counterfactuals utilizing causal graphs vs ML (to maintain issues easy, ridge regression).

As we recognized the causal graph within the final part, we are able to use this information to create a data-generating course of.

def data_generator(max_call_waiting, inbound_calls, call_reduction):
'''
A knowledge producing operate that has the flexibleness to cut back the worth of node 0 (Name ready time) - this allows us to calculate floor reality counterfactualsArgs:
max_call_waiting (int): Most name ready time in seconds
inbound_calls (int): Complete variety of inbound calls (observations in knowledge)
call_reduction (float): Discount to use to name ready time
Returns:
DataFrame: Generated knowledge
'''
df = pd.DataFrame(columns=node_lookup.values())
df[node_lookup[0]] = np.random.randint(low=10, excessive=max_call_waiting, dimension=(inbound_calls)) # Demand
df[node_lookup[1]] = (df[node_lookup[0]] * 0.5) * (call_reduction) + np.random.regular(loc=0, scale=40, dimension=inbound_calls) # Name ready time
df[node_lookup[2]] = (df[node_lookup[1]] * 0.5) + (df[node_lookup[0]] * 0.2) + np.random.regular(loc=0, scale=30, dimension=inbound_calls) # Name deserted
df[node_lookup[3]] = (df[node_lookup[2]] * 0.6) + (df[node_lookup[0]] * 0.3) + np.random.regular(loc=0, scale=20, dimension=inbound_calls) # Reported issues
df[node_lookup[4]] = (df[node_lookup[3]] * 0.7) + np.random.regular(loc=0, scale=10, dimension=inbound_calls) # Low cost despatched
df[node_lookup[5]] = (0.10 * df[node_lookup[1]] ) + (0.30 * df[node_lookup[2]]) + (0.15 * df[node_lookup[3]]) + (-0.20 * df[node_lookup[4]]) # Churn
return df

# Generate knowledge
np.random.seed(999)
df = data_generator(max_call_waiting=600, inbound_calls=10000, call_reduction=1.00)sns.pairplot(df)

We now have an adjacency matrix which represents our causal graph and a few knowledge. We use the gcm module from the dowhy Python package deal to coach an SCM.

It’s vital to consider what causal mechanism to make use of for the basis and non-root nodes. In the event you take a look at our knowledge generator operate, you will note the entire relationships are linear. Subsequently selecting ridge regression must be enough.

# Setup graph
graph = nx.from_numpy_array(graph_actual, create_using=nx.DiGraph)
graph = nx.relabel_nodes(graph, node_lookup)# Create SCM
causal_model = gcm.InvertibleStructuralCausalModel(graph)
causal_model.set_causal_mechanism('Demand', gcm.EmpiricalDistribution()) # Root node
causal_model.set_causal_mechanism('Name ready time', gcm.AdditiveNoiseModel(gcm.ml.create_ridge_regressor())) # Non-root node
causal_model.set_causal_mechanism('Name deserted', gcm.AdditiveNoiseModel(gcm.ml.create_ridge_regressor())) # Non-root node
causal_model.set_causal_mechanism('Reported issues', gcm.AdditiveNoiseModel(gcm.ml.create_ridge_regressor())) # Non-root node
causal_model.set_causal_mechanism('Low cost despatched', gcm.AdditiveNoiseModel(gcm.ml.create_ridge_regressor())) # Non-root 
causal_model.set_causal_mechanism('Churn', gcm.AdditiveNoiseModel(gcm.ml.create_ridge_regressor())) # Non-root 
gcm.match(causal_model, df)

You would additionally use the auto task operate to robotically assign the causal mechanisms as a substitute of manually assigning them.

For more information on the gcm package deal see the docs:

We additionally use ridge regression to assist create a baseline comparability. We are able to look again on the knowledge generator and see that it accurately estimates the coefficients for every variable. Nevertheless, along with straight influencing churn, name ready time not directly influences churn by way of deserted calls, reported issues and reductions despatched.

In relation to estimating counterfactuals it’s going to be attention-grabbing to see how the SCM compares to ridge regression.

# Ridge regression
y = df['Churn'].copy()
X = df.iloc[:, 1:-1].copy()
mannequin = RidgeCV()
mannequin = mannequin.match(X, y)
y_pred = mannequin.predict(X)print(f'Intercept: {mannequin.intercept_}')
print(f'Coefficient: {mannequin.coef_}')
# Floor reality[0.10 0.30 0.15 -0.20]

Picture by writer

Earlier than we transfer on to calculating counterfactuals utilizing causal graphs and ridge regression, we want a floor reality benchmark. We are able to use our knowledge generator to create counterfactual samples after we have now diminished name ready time by 20%.

We couldn’t do that with real-world issues however this technique permits us to evaluate how efficient the causal graph and ridge regression is.

# Set name discount to twenty%
cut back = 0.20
call_reduction = 1 - cut back# Generate counterfactual knowledge
np.random.seed(999)
df_cf = data_generator(max_call_waiting=600, inbound_calls=10000, call_reduction=call_reduction)

We are able to now estimate what would have occurred if we had of decreased the decision ready time by 20% utilizing our 3 strategies:

Floor reality (from the information generator)
Ridge regression
Causal graph

We see that ridge regression underestimates the affect on churn considerably while the causal graph could be very near the bottom reality.

# Floor reality counterfactual
ground_truth = spherical((df['Churn'].sum() - df_cf['Churn'].sum()) / df['Churn'].sum(), 2)# Causal graph counterfactual
df_counterfactual = gcm.counterfactual_samples(causal_model, {'Name ready time': lambda x: x*call_reduction}, observed_data=df)
causal_graph = spherical((df['Churn'].sum() - df_counterfactual['Churn'].sum()) / (df['Churn'].sum()), 3)
# Ridge regression counterfactual
ridge_regression = spherical((df['Call waiting time'].sum() * 1.0 * mannequin.coef_[0] - (df['Call waiting time'].sum() * call_reduction * mannequin.coef_[0])) / (df['Churn'].sum()), 3)

This was a easy instance to start out you desirous about the facility of causal graphs.

For extra complicated conditions, a number of challenges that would want some consideration:

What assumptions are made and what’s the affect of those being violated?
What about if we don’t have the skilled area data to establish the causal graph?
What if there are non-linear relationships?
How damaging is multi-collinearity?
What if some variables have lagged results?
How can we take care of high-dimensional datasets (numerous variables)?

All of those factors will probably be coated in future blogs.

In case your taken with studying extra about causal AI, I extremely suggest the next assets:

“Meet Ryan, a seasoned Lead Information Scientist with a specialised give attention to using causal strategies inside enterprise contexts, spanning Advertising, Operations, and Buyer Service. His proficiency lies in unraveling the intricacies of cause-and-effect relationships to drive knowledgeable decision-making and strategic enhancements throughout various organizational features.”