What if you could predict how your customers will respond to your product or service offering with a quantifiably greater level of confidence? Using 1st party data (buyer history, clickstream data, etc.) and 3rd party data (demographics, consumer segmentation attributes etc.) companies are now building powerful predictive modeling and persona segmentation capabilities to better reach their customers with a message that converts. We cover the various types of possible models in our blog post titled Marketing Science: Predictive Modeling for Today's Marketer. Knowing what your ideal customer looks like allows you to prioritize customers and prospects for better conversion. Intelligence can be as fast as real time. Modeling not only helps with prioritization of sales and marketing budget, resources, and people power but it can also help with acquisition and defining the right channels to pursue.

This blog will provide 1) a primer on propensity modeling, 2) types of lead scoring, 3) some finer details on the modeling process, 4) testing a model for effectiveness, and 5) implementing a model.

Propensity Modeling

Propensity modeling is predicting the likelihood – or propensity – of a lead, person, or target company to convert on your product or service offering. A propensity model can increase your marketing efficiency and sales conversions because it quantifiably scores a lead, prospect, or customer as ranking mathematically similar/dissimilar from the ideal customer or prospect. A recent Demand Gen Report survey asked marketing executives about their demand generation priorities for 2018. The most important was a focus on lead quality over quantity. A strong lead scoring program can help your business realize efficiencies by reducing wasted dollars on advertising. Or focus your salespeople from selling to the whole universe of prospects to targeting only the prospects likely to buy. Or even prioritizing time and resources of your marketing department from spray and pray campaigns to targeting audiences that are more likely to convert.

Types of Data

Utilizing both 1st and 3rd party data sources, you can enhance the efficacy of your lead scoring program. First party data (aka your business’ data) can be highly predictive and tell you a lot about where your customers are coming from, who is buying the most, and provide insight into what you have done right and wrong as a business. Third party consumer data can tell you things you don’t know about your customers while also providing an entirely new set of quality data points to be used in a predictive model. There are many beneficial types and forms of data within your business and from partners. However, it’s important to know what data to use and why. You can’t just throw data at a business problem and expect it to fix everything. Curating data is a critical component of the data science process. Solving big problems only becomes possible with quality data.

Types of Lead Scoring

There are two primary methods of lead scoring: 1) additive and 2) modeling.

The Additive Method

The additive method is typically a good first step because its relatively simple. You can start by assigning specific values for the milestones in your sales process. The leads with the highest score are often considered most engaged.

An example is assigning 1 point for each webpage visit, 5 points for filling out a lead form, 10 points for using the online chat function, and so on. As a lead continues to engage, the score increases. Recency is critical, and you should consider degrading the lead score by a percentage as time passes. To make the additive method even more accurate, the point structure can be based upon attribution analyses. It is easy to see how this method is a great start but can quickly become very complex. As the additive lead score becomes more complex, you should consider the modeling method.

The Modeling Method

The modeling method is more complex but also highly accurate. Assignment of a conversion probability to every lead record is the goal of this method. Since a lead scoring model assigns the probability of a binary outcome (conversion vs no conversion), regression is the most common method for creating a lead score model. While linear regression is easier to understand, it’s not the optimal method. Linear regression simply plots a straight line that best fits the available data, where the resulting probability fits somewhere along the line. Logistic regression is a better method as it allows for a more accurate probability due to using the logarithmic function. Note that logistic regression can only be used when there is one variable being used to predict the outcome. As lead scoring should take numerous variables into account for better predictability, multiple logistic regression is needed.

 Visual Overview Linear Regression vs Logistic Regression



Modeling Process

To create a statistical propensity model, you need to split your data into “train” and “test” datasets. The exact split of the data varies, though it is recommended to train your model on between 50%-75% of your available data. The remaining 25%-50% is the holdout and is used to validate – or test – your model.

Businesses have massive amounts of data. Third party datasets can add additional fidelity to your data, ensuring you have greater coverage and accuracy. This results in another issue. How do you select the fields to use in the model with hundreds of possible options? This is where feature selection can help. Feature selection is an iterative process of modeling where you select the features – or variables – to use as predictors in the model. A quick note: the modeler should work closely with the Subject Matter Expert (SME) – which is in many cases a business leader such as a strategic marketing or sales leader – to understand important varieties of data that are being or can be collected by the organization.

Effective modeling comes from understanding and trust in the model. This trust can be sewn by avoiding collinearity. Collinearity is the use multiple customer data points that are correlated (i.e., measure the same thing). As an example, demographic values such as level of education and income are often related since a higher level of education typically relates to an increased salary. Another example might be purchase history and income as these can be correlated depending on the product or service you offer.

Evaluating the Model

Once the model has been “trained” (developed), it must be tested against the holdout (“test”) data referenced earlier. Since you already know the individual conversion of each record in the test set, as well as the overall conversion rate, you can see how well the model performs against holdout data it hasn’t seen. Basically, the model scores the holdout data which you then compare against the actual conversion data to see how the model performed.

There are two common methods of visualizing the effectiveness of the model: 1) the Confusion Matrix and 2) the Receiver Operating Characteristic (ROC) curve. Both methods analyze the four possible classification outcomes, best represented as the Confusion Matrix:

 Confusion Matrix Example

The Confusion Matrix provides a grid that lets you calculate the ratio of accurate predictions (both True Positive and True Negative) as well as inaccurate predictions (False Positive and False Negative). With this information, it is easy see how often the model accurately predicts lead conversion. After the model is put into production, it is still useful to evaluate effectiveness of the model with the addition of additional conversions. Investigating the inaccurately predicted conversions provides additional insight into future model improvement.

The other visual representation of model accuracy is the ROC curve, which plots the true positive rate against the false positive rate of test data classifications. A diagonal line from the lower left to the upper right represents an equal 50/50 split – or even chance. It is often included in the graphic for reference. The better the model, the more the plotted line should pull to the top left of the plot, maximizing the area under the curve. The faster the line rises, the better the model.

ROC Curve Example


Implementing and Maintaining the Model

Once the model has been trained and tested, it can be moved into production. Typically, old leads will be scored to determine any hot leads that have been left out of the pool. As new leads enter the system or current leads engage in new activity, the lead score will be updated in near-real time to prioritize and qualify the leads. Again, the goal is to surface leads with the highest likelihood of conversion as quickly as possible.

Return on investment can be tentatively calculated off the test data using purchase history and pipeline volume. A true ROI can be calculated once the model has been implemented and enough leads have converted. As with any sales and marketing ROI calculation, it’s important to test and refine the model over time as results and new data sources come into the organization.

The benefits of utilizing multiple logistic regression for lead scoring should now be clear. Once any model is put into production new business developments may alter the efficacy of the model, requiring the model be maintained. It’s important to have continued effectiveness to ensure accurate classification and in turn, stronger sales and marketing messages going to the right target audience. Determining the timing of model updates can be driven by several factors, including seasonality, introduction of new data feeds, industry needs, or even a drop in ROI. Monitoring and maintenance of the lead scoring model will ensure continued reliability and success.

Finally, depending on factors such as industry, types of customers, and purchase frequency, it may be important to update a model more or less frequently. There are resource and price constraints to real time data, but the returns may outweigh the costs.


Predictive modeling can help sales, marketing, and other functions of the business (e.g. finance, HR) better plan resources. Spending time doing the right things versus the wrong things for their business can have a dramatic effect on the bottom line. Having the experience, technical knowhow, data attribution resources, and go to market strategy can be hard to put together. But with the right partner you can put an impressive, high performing model in place. It’s important to take inventory of your gaps for effective business intelligence, data processing, data quality, data visualization, and data modeling to roadmap your future as a data driven organization.

Contributing Writer:

Matt Hendrickson

Principal Data Analyst & Consultant for Cruz Street Digital