Machine learning-based credit risk scoring lets you assign a default probability to each B2B customer or counterparty, automatically, before you extend trade credit, renew a contract, or issue a policy. When your data is usable, it consistently outperforms manual review and rule-based scorecards.
The condition is important. A credit scoring model trained on too few defaults, or on data that does not truly reflect payment behavior, will give you false confidence. This guide explains what works, what does not, and what the EU AI Act now requires.
Note on scope: this article covers B2B solvency and default risk (trade credit, insurance, supplier risk). If you are looking to rank prospects by conversion likelihood, that is a different problem covered in our article on AI lead scoring from MQL to SQL. The models, features, and objectives are distinct.
How credit risk scoring machine learning actually works
The core task is a binary classification problem: given a set of features describing a customer, predict whether they will default or pay significantly late within a defined horizon (typically 90 days past due or more).
In practice, the pipeline unfolds in five steps.
Payment history, financial ratios, external signals
Define "default": 90 DPD, write-off, court filing
DPD trends, DSO, sector dummies, financial health
XGBoost / LightGBM with cross-validation
SHAP values per customer, score into your ERP or CRM
What machine learning adds over a classical scorecard is the ability to detect non-linear interactions. A customer with a high DSO (days sales outstanding) who belongs to a resilient sector and whose payment velocity is actually improving is very different from one with the same DSO and a deteriorating trend. A gradient boosting model captures this. A linear model does not.
From the field
"The biggest surprise for credit teams is always the same," says Anas Rabhi, founder of Tensoria. "The most predictive feature is rarely the one they thought it was. Payment velocity trend over the last three months consistently outranks the current DSO in terms of predictive power. A customer paying slowly but accelerating is far safer than one paying at terms but slowing down."
What data do you need to assess customer solvency with machine learning
The minimum viable dataset for a reliable credit scoring model has three components. Skipping any of them forces you into rule-based approximations, which are still useful but less discriminant.
Internal payment data (non-negotiable)
| Data field | Minimum requirement | What it enables |
|---|---|---|
| Invoice history | 2 to 3 years, per customer | DPD distribution, seasonal payment patterns |
| Payment dates | Actual vs. due date per invoice | Computes days past due (DPD) at invoice level |
| Default labels | At least 50 to 100 confirmed defaults | The training target without which the model cannot learn |
| Customer identifier | SIRET, VAT number, or unique ID | Allows joining with external sources |
External financial data (strongly recommended)
- Balance sheet ratios: debt-to-equity, current ratio, interest coverage. Available via providers like Altares, Ellisphere, Bureau van Dijk (Orbis), or Infogreffe filings.
- Sector classification: NAF/NACE code. Sector default rates vary by a factor of 3 to 5 across industries.
- Legal events: court filings, payment injunctions (injonctions de payer), safeguard proceedings. These are the most powerful leading indicators, typically appearing 60 to 90 days before actual default.
- Company age and size: employee count, years in operation. Young, small companies in cyclical sectors carry structurally higher risk.
Behavioral and soft signals (optional, high marginal value)
- Dispute frequency: customers who regularly dispute invoices often have cash flow stress.
- Communication responsiveness: declining response time to payment reminders is an early warning signal extractable from your CRM logs.
- Order pattern changes: sudden drops in order volume or average basket size.
When machine learning is not yet worth it
If your portfolio has fewer than 50 confirmed defaults in the training window, a supervised model will be statistically unreliable. The class imbalance (typically 2 to 5% default rate) combined with a low absolute count makes the model sensitive to noise rather than signal. In this case, a well-structured rule-based scorecard combined with simple logistic regression is a more honest starting point, and can be upgraded once your labeled dataset grows.
Which algorithms to use for a B2B credit scoring model
The choice of algorithm matters less than the quality of your features and labels. That said, some algorithms are consistently better suited to structured credit data.
Logistic regression with scorecard binning
Baseline / Regulatory referenceThe statistical backbone of traditional credit scoring (Basel II/III scorecards). Fully interpretable, easy to audit, and accepted by regulators as a reference. Gini coefficient typically 0.35 to 0.50 on well-structured data.
Gradient boosting: XGBoost, LightGBM, CatBoost
Production standardThe de facto standard for tabular credit data. Captures non-linear feature interactions, handles missing values natively (LightGBM, CatBoost), and pairs with SHAP for per-prediction explainability. Gini coefficient typically 0.55 to 0.75.
Survival models: Cox proportional hazards, DeepHit
AdvancedPredict when a customer will default, not just whether they will. Valuable for portfolio provisioning and expected credit loss (ECL) calculations under IFRS 9. Requires more data science expertise to deploy and maintain.
Neural networks and deep tabular models
Rarely justified for credit scoringTabNet and similar architectures occasionally beat gradient boosting on very large, high-dimensional credit datasets. In practice, the explainability cost and calibration complexity rarely justify the marginal AUC gain over XGBoost for B2B portfolios.
A key metric to require from any provider: the Gini coefficient (or equivalent AUC-ROC) measured on a holdout set, not on training data, and compared to your current rule-based baseline. A model with Gini 0.60 is not automatically "good" if your existing scorecard already achieves 0.55 with half the operational complexity.
Explainability and the EU AI Act: what you must implement now
Credit scoring AI is explicitly listed in Annex III of the EU AI Act as a high-risk AI system when used to evaluate the creditworthiness of natural persons. High-risk obligations, including human oversight, explainability, and conformity assessment, apply from August 2026 (with a possible extension to December 2027 under the Digital Omnibus proposal).
For B2B scoring of legal entities, the classification is currently grayer, but the regulatory direction is clear: documenting your model and providing decision explanations is no longer optional in any serious deployment.
SHAP values: the practical standard
SHAP (SHapley Additive exPlanations) is the tool that makes gradient boosting models auditable. For each customer, SHAP computes the marginal contribution of every feature to the predicted score. A credit manager can then see a statement like this:
Example SHAP output for a credit decision
Customer: Dupont Logistics SAS. Risk score: 38/100 (high risk).
Main downward factors: DPD trend increasing over 3 months (score impact: -21), debt-to-equity ratio above 2.5 (impact: -14), sector (road freight) in 90th percentile for default rate (impact: -8).
Mitigating factors: 4-year relationship with zero past write-offs (+9), order volume stable (+6).
Recommended action: reduce credit limit by 30%, flag for monthly review.
Non-discrimination: proxy variables are the real risk
Simply dropping protected attributes (gender, nationality, postal code of residence) is insufficient. A 2024 Stanford Law working paper on AI discrimination in creditworthiness assessment documents how variables like postal code, payment channel, and sector codes carry strong statistical correlations with protected characteristics. A fairness audit is a prerequisite for any deployment affecting individuals.
For a full overview of how the EU AI Act affects your AI projects, see our guide on EU AI Act compliance for SMEs.
Human override remains mandatory
Under both the EU AI Act and good credit practice, the model score should feed into a decision, not replace the decision. The credit manager must be able to override the model with a documented reason. Build this into your workflow from day one, not as an afterthought.
The most predictive features in B2B credit scoring
Based on both published research and our own work on credit portfolios, these are the feature families that consistently carry the most predictive signal, ranked by typical importance.
| Feature family | Key variables | Typical importance rank |
|---|---|---|
| Payment behavior trend | DPD change over 3/6/12 months, DSO trend | 1st (most predictive) |
| Legal and court signals | Payment injunctions, safeguard proceedings, liens | 2nd |
| Financial ratios | Debt-to-equity, current ratio, interest coverage, Altman Z-score | 3rd |
| Sector and size | NAF/NACE code, employee count, company age | 4th |
| Relationship history | Years as customer, previous disputes, concentration of orders | 5th |
| Behavioral / CRM signals | Response time to reminders, dispute rate, order pattern changes | 6th (high marginal value if available) |
A 2023 ScienceDirect study on machine learning and SMB credit risk found that payment behavior variables contributed over 40% of total feature importance in gradient boosting models trained on SMB portfolios, outperforming financial statement variables by a wide margin. This aligns with what we observe in practice: companies often have outdated balance sheets but very current payment data.
Credit risk scoring vs. lead scoring: why the distinction matters
Both are ML-based scoring models. Both assign a number to a customer. The resemblance stops there.
Credit risk scoring
- Question: will this customer pay?
- Target: default or late payment (binary)
- Features: payment history, financial ratios, legal signals
- User: credit manager, CFO, finance team
- Regulation: EU AI Act high-risk (Annex III), GDPR Art. 22
- Wrong call cost: bad debt, provisioning, write-offs
Lead scoring (sales)
- Question: will this prospect convert?
- Target: MQL to SQL conversion (binary)
- Features: firmographic data, website behavior, CRM activity
- User: SDR, account executive, RevOps
- Regulation: GDPR, but not EU AI Act high-risk
- Wrong call cost: wasted sales time, missed pipeline
Using the same model for both, or conflating the two use cases in a project brief, leads to poor feature selection and misaligned success metrics. We cover the sales-side architecture in detail in our article on AI lead scoring from MQL to SQL.
Similarly, credit risk scoring is distinct from ML-based fraud and anomaly detection, which targets transactional outliers rather than long-horizon solvency predictions.
How to implement a credit scoring model: timeline and scope
A realistic implementation unfolds in three phases. Each phase has a clear deliverable so you can assess value before committing to the next.
Data audit and feasibility (2 to 3 weeks)
Profile your payment history: coverage, completeness, default count, class imbalance. Define the default label. Assess whether supervised ML is feasible or whether a rule-based scorecard is the right first step. Deliverable: a data readiness report with a go/no-go recommendation and feature roadmap.
Model development and validation (4 to 6 weeks)
Feature engineering, model training (gradient boosting baseline + comparison models), backtesting on holdout set, Gini / KS statistic measurement, SHAP explainability layer. Deliverable: a validated model with documented performance metrics and an explainability dashboard.
Production integration and monitoring (3 to 4 weeks)
Score push into your ERP or credit management tool (Sage, SAP, Oracle, Salesforce), human override workflow, alert rules for score degradation, monthly model monitoring. Deliverable: live scoring with drift detection and quarterly retraining schedule.
Total timeline from kickoff to production: 9 to 13 weeks for a standard B2B portfolio. A first usable score on existing customers can often be delivered at the end of Phase 2, before the full production integration, allowing early validation by the credit team.
Before committing to Phase 2, running a structured AI audit on your data and use case is the most reliable way to verify feasibility and size the expected Gini improvement over your current method.
Talk to an engineer
Want to know if your payment data is sufficient to build a credit scoring model? We will tell you in one call.
FAQ: credit risk scoring machine learning
Further reading
- EU AI Act Compliance for SMEs: Concrete obligations, key deadlines, and a compliance checklist for businesses using AI in regulated contexts.
- AI Lead Scoring: From MQL to SQL: How ML-based scoring works for sales qualification, distinct from credit and solvency risk.
- ML Fraud and Anomaly Detection: Transactional outlier detection as a complement to long-horizon credit risk modeling.
- Enterprise Data Readiness for AI: How to assess whether your data is ready before launching a machine learning project.
- Why AI Projects Fail: The most common root causes, including poor label quality and insufficient default counts in credit scoring projects.
- AI Audit: Method and Cost: How to scope and evaluate an AI project before committing to build.
- AI audit service: Structured review of your data, use case, and business case for a credit scoring project before any build investment.