Connecting Technology and Business.

Digital Transformation helps Microsoft weed out fake marketing leads

Microsoft has showcased how it solved the Fake leads problem as a Leader in Digital Transformation

“Fake leads” is the problem to tackle

When people sign up via online forms, they sometimes give a fake name, company name, email, or phone number. They may submit randomly typed characters (keyboard gibberish) or use profanity. Or, they may accidentally make a small typographical error, but otherwise the name is real—so we don’t want to classify the lead as junk.

The abundance of fake lead names across Microsoft subsidiaries results in:

·         Lost productivity for our global marketers and sellers. Fake names waste an enormous amount of time since sellers rely on accurate information to follow up with leads.

·         Lost revenue opportunities. Among thousands of fake lead names, there could be one legitimate opportunity.

Each day, thousands of people sign up using thousands of web forms. But, in any month, many of the lead names—whether a company or a person—are fake.

The solution to tackle “Fake leads”

Improving data quality is critical. To do that, and to determine if names are real or fake, Microsoft built a machine learning solution that uses:

·         Microsoft Machine Learning Server (previously Microsoft R Server).

·         A data quality service that integrates machine learning models. When a company name enters the marketing system, the system calls their data quality service, which immediately checks if it’s a fake name.

So far, machine learning has reduced the number of fake company names that enter Microsoft’s marketing system, at scale. Their solution has prevented thousands of names from being routed to marketers and sellers. Filtering out junk leads has made their marketing and sales teams more efficient, allowing them to focus on real leads and help customers better.

Microsoft Machine Learning Server

Microsoft needed a scalable way to eliminate fake names across millions of records and to build and operationalize their machine learning model—in other words, they wanted a systematic, automated approach with measurable gains. They chose Machine Learning Server, in part, because:

·         It can handle their large datasets—which enables them to train and score their model.

·         It has the computing power that they need.

·         They can control how they scale their model and operationalize for high-volume business requests.

·         Access is based on user name and password, which are securely stored in Azure Key Vault.

·         It helps expose the model as a secure API that can be integrated with other systems and improved separately.

The difference between rule-based model to Machine Learning


Experts create static rules to cover common scenarios. As new scenarios occur, new rules are written. A static, rules-based model can make it hard to capture varying types of keyboard gibberish (like akljfalkdjg). With static rules, Microsoft’s marketers must waste time sorting through the fake leads and deciphering misleading or confusing information.

Machine Learning

Algorithms are used to train the model and make intelligent predictions. Algorithms help build and train the model by labeling and classifying data at the beginning of the process. Then, as data enters the model, the algorithm categorizes the data correctly—saving valuable time. Microsoft used the Naive Bayes classifier algorithm to categorize names as real/fake. This algorithm is influenced by how LinkedIn detects spam names in their social networks.

Scenarios where the model is used

Microsoft’s business team identified their subsidiaries worldwide that are most affected by fake names. Now, they are weeding out fake names so that marketers and sellers don’t have to. Going forward, they plan to:

·         Create a lead data quality metric with more lead-related signals and other machine learning models that allow them to stack-rank their leads. The goal is to give a list to their sellers and marketers that suggests which leads to call first and which to call next.

·         Make contact information visible to their sellers and marketers when they’re talking on the phone with leads. For example, if the phone number that someone gave in an online form is real, but the company name isn’t, their seller can ask the lead to confirm the company name.

Choosing the technology

Microsoft incorporated the following technologies into their solution:

·         The programming language R and the Naive Bayes classifier algorithm for training and building the model are based, in part, on the approach that LinkedIn uses.

·         Machine Learning Server with machine learning, R, and artificial intelligence (AI) capabilities help them build and operationalize their model.

·         Their data quality service, which integrates with the machine learning models to determine if a name is fake – person or company.

Designing the approach

Microsoft designed their overall architecture and process to work as follows:

1.       Marketing leads enter their data quality and enrichment service, where their team does fake-name detection, data matching, validation, and enrichment. They combine these data activities using a 590-megabyte model. Their training data consists of about 1.5 million real company names and about 208,312 fake (profanity and gibberish) company names. Before they train the model, they remove commonly used company suffixes such as Private, Ltd., and Inc.

2.       They generate n-grams—combinations of contiguous letters—of three to seven characters and calculate probabilities that each n-gram belongs to the real/fake name dataset in the model. For example, an n-gram that shows three sequenced letters of the name “Microsoft” would look like “Mic,” “icr” “cro” and so on. The training process computes how often the n-grams occur in real/fake company names and stores the computation in the model.

3.       They have four virtual machines that run Machine Learning Server. One serves as a web node and three serve as compute nodes. They have more compute nodes so that they can scale to handle the volume of requests that they have. The architecture gives them the ability to scale up or down by adding/removing compute nodes as needed based on the volume of requests. The provider calls a web API hosted on the web node, with company name as input.

4.       The web API calls the scoring function on the compute node. This scoring function generates n-grams from the input company name and calculates the frequencies of these n-grams in the real/fake training dataset.

5.       To determine whether the input company name is real or fake, the predict function in R uses these calculated n gram frequencies stored in the model, along with the Naive Bayes rule.

To summarize, the scoring function that’s used during prediction generates the n-grams. It uses the frequencies of each n-gram in the real/fake name dataset that’s stored in the model to compute the probability of the company name belonging to the real/fake name dataset. Then, it uses these computed probabilities to determine if the company name is fake.

What Microsoft learned about Business, technical, and design considerations

·         Ideally, the business problem should be solved within your organization itself rather than outsourcing it. Your organization will have deeper historical knowledge of the business domain, which helps to design the most relevant solution.

·         Having good training and test data is crucial. Most of the work Microsoft did was labeling their test data, analyzing how Naive Bayes performed compared to rxLogisticRegression and rxFastTrees algorithms, determining how accurate their model was, and updating their model where needed.

·         When you design a machine learning model, it’s important to identify how to effectively label the raw data. Unlabeled data has no information to explain or categorize it. Microsoft labels the names as fake/real and apply the machine learning model. This model takes new, unlabeled data and predicts a likely label for it.

·         Even in machine learning, you risk having false positives and negatives, so you need to keep analyzing predictions and retraining the model. Crowdsourcing is an effective way to analyze whether the predictions from the model are correct; otherwise, these can be time-consuming tasks. In Microsoft’s case, due to certain constraints they faced, they didn’t use crowdsourcing, but they plan to do so in the future.

Operationalizing with Machine Learning Server vs. other Microsoft technologies

Some other technical and design considerations included deciding which Microsoft technologies to use for creating machine learning models. Microsoft offers great options such as Machine Learning Server, SQL Server 2017 Machine Learning Services (previously SQL Server 2016 R Services), and Azure Machine Learning Studio. Here are some tips to help you decide which to use for creating and operationalizing your model:

·         If you don’t depend on SQL Server for your model, Machine Learning Server is a great option. You can use the libraries in R and Python to build the model, and you can easily operationalize R and Python models. This option allows you to scale out as needed and lets you control the version of R packages that you want to use for modeling.

·         If you have training data in SQL Server and want to build a model that’s close to your training data, SQL Server 2017 Machine Learning Services works well—but there are dependencies on SQL Server and limits on model size.

·         If your model is simple, you could build it in SQL Server as a stored procedure without using libraries. This option works well for simpler models that aren’t hard to code. You can get good accuracy and use fewer resources, which saves money.

·         If you’re doing experiments and want quick learning, Azure Machine Learning Studio is a great choice. As your training dataset grows and you want to scale your models for high-volume requests, consider Machine Learning Server and SQL Server 2017 Machine Learning Services.

Challenges and roadblocks Microsoft faced

·         Having good training data. High-quality training data begins with a collection of company names that are clearly classified as real or fake—ideally, from companies around the world. Microsoft feeds that information into their model for it to start learning the patterns of real or fake company names. It takes a while to build and refine this data, and it’s an iterative process.

·         Identifying and manually labeling the training and test dataset. Microsoft manually labeled thousands of records as real or fake, which takes a lot of time and effort. Instead, one can take advantage of crowdsourcing services if possible, to avoid manual labeling. With these services, one can submit company names through a secure API and a human says if the company name is real or fake.

·         Deciding which product to use for operationalizing the model. Microsoft tried different technologies, but found computing limitations and versioning dependencies between the R Naive Bayes package they used and what was available in Azure Machine Learning Studio at the time. Microsoft chose Machine Learning Server because it addressed those issues, had the computing power they needed, and helped them easily scale out their model.

·         Configuring load balance. If Microsoft’s Machine Learning Server web node gets lots of requests, it randomly chooses which of the three compute nodes to send the request to. This can result in one node that’s overutilized while another is underutilized. They like to use a round-robin approach, where all nodes are used equally to better distribute the load. This can be achieved by using an Azure load balancer in between the web and compute node.

Measurable benefits Microsoft has seen so far

The gains Microsoft has made thus far are just the beginning. So far, Machine Learning Server has helped them in the following ways:

·         With the machine learning model, their system tags about 5 to 9 percent more fake records than the static model. This means the system prevented 5 to 9 percent more fake names from going to marketers and sellers. Over time, this represents a vast number of fake names that their sellers do not have to sort through. As a result, marketer and seller productivity is enhanced.

·         They have captured more gibberish data and most profanities, with fewer fake positives and fake negatives. They have a high degree of accuracy, with an error rate of +/– 0.2 percent.

·         Their time to respond to requests has improved. With 10,000 data classifications of real/fake in 16 minutes and 200,000 classifications in 3 hours 13 minutes, they have ensured that their data quality service meets service level agreements for performance and response time. They plan to improve response time by slightly modifying the algorithm in Python.

Next steps

Microsoft is excited about how their digital transformation journey has already enabled them to innovate and be more efficient. They will build on this momentum by learning more about business needs and delivering other machine learning solutions. Their roadmap includes:

·         Ensuring that their machine learning model delivers value end-to-end. Machine learning is just one link in the chain that reaches all the way to sellers and marketers around the world. The whole chain needs to work well.

·         Expanding their set of models and making business processes and lead quality more AI-driven vs. rule-driven.

·         Operationalizing other machine learning models, so that they get a holistic view of a lead.

·         Addressing issues created from sites that create fake registrations.

By improving data quality at scale, Microsoft is enabling marketers and sellers to focus on customers and to sell their products, services, and subscriptions more efficiently.