Credit Risk Scoring with Machine Learning: A B2B Guide

Q: How do you avoid proxy discrimination in a credit scoring model?

Dropping protected attributes (gender, nationality) is not enough. Variables like postal code, payment channel, or sector can act as proxies. The correct approach is to run a disparate impact analysis across protected groups, use fairness-aware training constraints if needed, and document your methodology. Under the EU AI Act and GDPR Article 22, automated credit decisions affecting individuals must be explainable and challengeable.

Credit risk scoring machine learning dashboard showing customer solvency probability and feature importance

Machine learning-based credit risk scoring lets you assign a default probability to each B2B customer or counterparty, automatically, before you extend trade credit, renew a contract, or issue a policy. When your data is usable, it consistently outperforms manual review and rule-based scorecards.

The condition is important. A credit scoring model trained on too few defaults, or on data that does not truly reflect payment behavior, will give you false confidence. This guide explains what works, what does not, and what the EU AI Act now requires.

Note on scope: this article covers B2B solvency and default risk (trade credit, insurance, supplier risk). If you are looking to rank prospects by conversion likelihood, that is a different problem covered in our article on AI lead scoring from MQL to SQL. The models, features, and objectives are distinct.

How credit risk scoring machine learning actually works

The core task is a binary classification problem: given a set of features describing a customer, predict whether they will default or pay significantly late within a defined horizon (typically 90 days past due or more).

In practice, the pipeline unfolds in five steps.

Data collection

Payment history, financial ratios, external signals

Label definition

Define "default": 90 DPD, write-off, court filing

Feature engineering

DPD trends, DSO, sector dummies, financial health

Model training

XGBoost / LightGBM with cross-validation

Scoring and explainability

SHAP values per customer, score into your ERP or CRM

What machine learning adds over a classical scorecard is the ability to detect non-linear interactions. A customer with a high DSO (days sales outstanding) who belongs to a resilient sector and whose payment velocity is actually improving is very different from one with the same DSO and a deteriorating trend. A gradient boosting model captures this. A linear model does not.

From the field

"The biggest surprise for credit teams is always the same," says Anas Rabhi, founder of Tensoria. "The most predictive feature is rarely the one they thought it was. Payment velocity trend over the last three months consistently outranks the current DSO in terms of predictive power. A customer paying slowly but accelerating is far safer than one paying at terms but slowing down."

What data do you need to assess customer solvency with machine learning

The minimum viable dataset for a reliable credit scoring model has three components. Skipping any of them forces you into rule-based approximations, which are still useful but less discriminant.

Internal payment data (non-negotiable)

Data field	Minimum requirement	What it enables
Invoice history	2 to 3 years, per customer	DPD distribution, seasonal payment patterns
Payment dates	Actual vs. due date per invoice	Computes days past due (DPD) at invoice level
Default labels	At least 50 to 100 confirmed defaults	The training target without which the model cannot learn
Customer identifier	SIRET, VAT number, or unique ID	Allows joining with external sources

External financial data (strongly recommended)

Balance sheet ratios: debt-to-equity, current ratio, interest coverage. Available via providers like Altares, Ellisphere, Bureau van Dijk (Orbis), or Infogreffe filings.
Sector classification: NAF/NACE code. Sector default rates vary by a factor of 3 to 5 across industries.
Legal events: court filings, payment injunctions (injonctions de payer), safeguard proceedings. These are the most powerful leading indicators, typically appearing 60 to 90 days before actual default.
Company age and size: employee count, years in operation. Young, small companies in cyclical sectors carry structurally higher risk.

Behavioral and soft signals (optional, high marginal value)

Dispute frequency: customers who regularly dispute invoices often have cash flow stress.
Communication responsiveness: declining response time to payment reminders is an early warning signal extractable from your CRM logs.
Order pattern changes: sudden drops in order volume or average basket size.

When machine learning is not yet worth it

If your portfolio has fewer than 50 confirmed defaults in the training window, a supervised model will be statistically unreliable. The class imbalance (typically 2 to 5% default rate) combined with a low absolute count makes the model sensitive to noise rather than signal. In this case, a well-structured rule-based scorecard combined with simple logistic regression is a more honest starting point, and can be upgraded once your labeled dataset grows.

Which algorithms to use for a B2B credit scoring model

The choice of algorithm matters less than the quality of your features and labels. That said, some algorithms are consistently better suited to structured credit data.

Logistic regression with scorecard binning

Baseline / Regulatory reference

The statistical backbone of traditional credit scoring (Basel II/III scorecards). Fully interpretable, easy to audit, and accepted by regulators as a reference. Gini coefficient typically 0.35 to 0.50 on well-structured data.

Best for: regulatory documentation, small datasets, first POC

Recommended for most cases

Gradient boosting: XGBoost, LightGBM, CatBoost

Production standard

The de facto standard for tabular credit data. Captures non-linear feature interactions, handles missing values natively (LightGBM, CatBoost), and pairs with SHAP for per-prediction explainability. Gini coefficient typically 0.55 to 0.75.

Best for: portfolios with 500+ labeled examples, production deployment, explainability requirements

Survival models: Cox proportional hazards, DeepHit

Advanced

Predict when a customer will default, not just whether they will. Valuable for portfolio provisioning and expected credit loss (ECL) calculations under IFRS 9. Requires more data science expertise to deploy and maintain.

Best for: IFRS 9 ECL modeling, insurance risk pricing, long-horizon credit portfolios

Neural networks and deep tabular models

Rarely justified for credit scoring

TabNet and similar architectures occasionally beat gradient boosting on very large, high-dimensional credit datasets. In practice, the explainability cost and calibration complexity rarely justify the marginal AUC gain over XGBoost for B2B portfolios.

Best for: consumer credit at scale (millions of accounts), not standard B2B portfolios

A key metric to require from any provider: the Gini coefficient (or equivalent AUC-ROC) measured on a holdout set, not on training data, and compared to your current rule-based baseline. A model with Gini 0.60 is not automatically "good" if your existing scorecard already achieves 0.55 with half the operational complexity.

Explainability and the EU AI Act: what you must implement now

Credit scoring AI is explicitly listed in Annex III of the EU AI Act as a high-risk AI system when used to evaluate the creditworthiness of natural persons. High-risk obligations, including human oversight, explainability, and conformity assessment, apply from August 2026 (with a possible extension to December 2027 under the Digital Omnibus proposal).

For B2B scoring of legal entities, the classification is currently grayer, but the regulatory direction is clear: documenting your model and providing decision explanations is no longer optional in any serious deployment.

SHAP values: the practical standard

SHAP (SHapley Additive exPlanations) is the tool that makes gradient boosting models auditable. For each customer, SHAP computes the marginal contribution of every feature to the predicted score. A credit manager can then see a statement like this:

Example SHAP output for a credit decision

Customer: Dupont Logistics SAS. Risk score: 38/100 (high risk).
Main downward factors: DPD trend increasing over 3 months (score impact: -21), debt-to-equity ratio above 2.5 (impact: -14), sector (road freight) in 90th percentile for default rate (impact: -8).
Mitigating factors: 4-year relationship with zero past write-offs (+9), order volume stable (+6).
Recommended action: reduce credit limit by 30%, flag for monthly review.

Non-discrimination: proxy variables are the real risk

Simply dropping protected attributes (gender, nationality, postal code of residence) is insufficient. A 2024 Stanford Law working paper on AI discrimination in creditworthiness assessment documents how variables like postal code, payment channel, and sector codes carry strong statistical correlations with protected characteristics. A fairness audit is a prerequisite for any deployment affecting individuals.

For a full overview of how the EU AI Act affects your AI projects, see our guide on EU AI Act compliance for SMEs.

Human override remains mandatory

Under both the EU AI Act and good credit practice, the model score should feed into a decision, not replace the decision. The credit manager must be able to override the model with a documented reason. Build this into your workflow from day one, not as an afterthought.

The most predictive features in B2B credit scoring

Based on both published research and our own work on credit portfolios, these are the feature families that consistently carry the most predictive signal, ranked by typical importance.

Feature family	Key variables	Typical importance rank
Payment behavior trend	DPD change over 3/6/12 months, DSO trend	1st (most predictive)
Legal and court signals	Payment injunctions, safeguard proceedings, liens	2nd
Financial ratios	Debt-to-equity, current ratio, interest coverage, Altman Z-score	3rd
Sector and size	NAF/NACE code, employee count, company age	4th
Relationship history	Years as customer, previous disputes, concentration of orders	5th
Behavioral / CRM signals	Response time to reminders, dispute rate, order pattern changes	6th (high marginal value if available)

A 2023 ScienceDirect study on machine learning and SMB credit risk found that payment behavior variables contributed over 40% of total feature importance in gradient boosting models trained on SMB portfolios, outperforming financial statement variables by a wide margin. This aligns with what we observe in practice: companies often have outdated balance sheets but very current payment data.

Credit risk scoring vs. lead scoring: why the distinction matters

Both are ML-based scoring models. Both assign a number to a customer. The resemblance stops there.

Credit risk scoring

Question: will this customer pay?
Target: default or late payment (binary)
Features: payment history, financial ratios, legal signals
User: credit manager, CFO, finance team
Regulation: EU AI Act high-risk (Annex III), GDPR Art. 22
Wrong call cost: bad debt, provisioning, write-offs

Lead scoring (sales)

Question: will this prospect convert?
Target: MQL to SQL conversion (binary)
Features: firmographic data, website behavior, CRM activity
User: SDR, account executive, RevOps
Regulation: GDPR, but not EU AI Act high-risk
Wrong call cost: wasted sales time, missed pipeline

Using the same model for both, or conflating the two use cases in a project brief, leads to poor feature selection and misaligned success metrics. We cover the sales-side architecture in detail in our article on AI lead scoring from MQL to SQL.

Similarly, credit risk scoring is distinct from ML-based fraud and anomaly detection, which targets transactional outliers rather than long-horizon solvency predictions.

How to implement a credit scoring model: timeline and scope

A realistic implementation unfolds in three phases. Each phase has a clear deliverable so you can assess value before committing to the next.

Data audit and feasibility (2 to 3 weeks)

Profile your payment history: coverage, completeness, default count, class imbalance. Define the default label. Assess whether supervised ML is feasible or whether a rule-based scorecard is the right first step. Deliverable: a data readiness report with a go/no-go recommendation and feature roadmap.

Model development and validation (4 to 6 weeks)

Feature engineering, model training (gradient boosting baseline + comparison models), backtesting on holdout set, Gini / KS statistic measurement, SHAP explainability layer. Deliverable: a validated model with documented performance metrics and an explainability dashboard.

Production integration and monitoring (3 to 4 weeks)

Score push into your ERP or credit management tool (Sage, SAP, Oracle, Salesforce), human override workflow, alert rules for score degradation, monthly model monitoring. Deliverable: live scoring with drift detection and quarterly retraining schedule.

Total timeline from kickoff to production: 9 to 13 weeks for a standard B2B portfolio. A first usable score on existing customers can often be delivered at the end of Phase 2, before the full production integration, allowing early validation by the credit team.

Before committing to Phase 2, running a structured AI audit on your data and use case is the most reliable way to verify feasibility and size the expected Gini improvement over your current method.

Talk to an engineer

Want to know if your payment data is sufficient to build a credit scoring model? We will tell you in one call.

Book a call

FAQ: credit risk scoring machine learning

At minimum, you need 2 to 3 years of payment history per counterparty, invoice-level data (amounts, due dates, days past due), and a label identifying customers who defaulted or paid late. Enriching with financial data (balance sheet ratios, Altman Z-score inputs) and external signals (sector, size, court filings) significantly improves accuracy. You need at least 500 to 1,000 labeled examples with a sufficient default rate (typically above 3 to 5%) to train a reliable model.

Yes. AI systems used to evaluate the creditworthiness of natural persons or to establish their credit score are classified as high-risk under the EU AI Act (Annex III). High-risk obligations including human oversight, explainability, and conformity assessment apply from August 2026, with a proposed extension to December 2027 under the Digital Omnibus amendment. B2B scoring of legal entities currently sits in a grayer zone, but documenting your methodology and providing decision explanations is best practice regardless.

Lead scoring predicts conversion probability: will this prospect become a customer? Credit risk scoring predicts default or late-payment probability: will this customer pay on time? The two models use different features, different training labels, and serve different teams (sales vs. finance or credit control). They are complementary, not substitutable.

Gradient boosting models (XGBoost, LightGBM, CatBoost) consistently outperform logistic regression and neural networks on structured credit data, both on Gini coefficient and on calibration. They also pair naturally with SHAP values for explainability. Logistic regression remains useful as a regulatory baseline or when the dataset is small. Neural networks rarely add value over gradient boosting on tabular credit data unless you have very large datasets with hundreds of features.

SHAP (SHapley Additive exPlanations) is the standard tool. It computes the marginal contribution of each feature to a specific prediction, so you can tell a credit manager: "This customer's score is 42 out of 100. The main downward drivers are days past due above 30 (score impact: -18) and a debt-to-equity ratio above 2 (impact: -12)." This satisfies both the credit team's need for actionable reasons and the EU AI Act's explainability requirements for high-risk AI systems.

Simply dropping protected attributes (gender, nationality, postal code of residence) is insufficient. Variables like postal code, payment channel, or sector can act as proxies. The correct approach is to run a disparate impact analysis across protected groups, use fairness-aware training constraints if needed, and document your methodology. Under the EU AI Act and GDPR Article 22, automated credit decisions affecting individuals must be explainable and challengeable.

It depends on the default count, not the customer count. If you have fewer than 50 confirmed defaults in your history, a supervised ML model will be unreliable. In that case, rule-based scoring (payment behavior rules + financial ratio thresholds) combined with a simple logistic regression is a more honest and maintainable starting point. As your portfolio grows and defaults accumulate, you can migrate to a gradient boosting model.

A well-calibrated model typically reduces bad debt by 20 to 40% compared to manual or rule-based assessment, and cuts credit review time by 50 to 70%. ROI depends heavily on your current bad debt rate and portfolio volume. Projects with an annual bad debt exposure above 200,000 EUR generally show payback within 6 to 12 months. Below that threshold, a lighter rule-based approach may deliver better cost-adjusted returns.