Connecting Technology and Business.

Digital Transformation helps Microsoft weed out fake marketing leads

Microsoft has showcased how it solved the Fake leads problem as a Leader in Digital Transformation

“Fake leads” is the problem to tackle

When people sign up via online forms, they sometimes give a fake name, company name, email, or phone number. They may submit randomly typed characters (keyboard gibberish) or use profanity. Or, they may accidentally make a small typographical error, but otherwise the name is real—so we don’t want to classify the lead as junk.

The abundance of fake lead names across Microsoft subsidiaries results in:

·         Lost productivity for our global marketers and sellers. Fake names waste an enormous amount of time since sellers rely on accurate information to follow up with leads.

·         Lost revenue opportunities. Among thousands of fake lead names, there could be one legitimate opportunity.

Each day, thousands of people sign up using thousands of web forms. But, in any month, many of the lead names—whether a company or a person—are fake.

The solution to tackle “Fake leads”

Improving data quality is critical. To do that, and to determine if names are real or fake, Microsoft built a machine learning solution that uses:

·         Microsoft Machine Learning Server (previously Microsoft R Server).

·         A data quality service that integrates machine learning models. When a company name enters the marketing system, the system calls their data quality service, which immediately checks if it’s a fake name.

So far, machine learning has reduced the number of fake company names that enter Microsoft’s marketing system, at scale. Their solution has prevented thousands of names from being routed to marketers and sellers. Filtering out junk leads has made their marketing and sales teams more efficient, allowing them to focus on real leads and help customers better.

Microsoft Machine Learning Server

Microsoft needed a scalable way to eliminate fake names across millions of records and to build and operationalize their machine learning model—in other words, they wanted a systematic, automated approach with measurable gains. They chose Machine Learning Server, in part, because:

·         It can handle their large datasets—which enables them to train and score their model.

·         It has the computing power that they need.

·         They can control how they scale their model and operationalize for high-volume business requests.

·         Access is based on user name and password, which are securely stored in Azure Key Vault.

·         It helps expose the model as a secure API that can be integrated with other systems and improved separately.

The difference between rule-based model to Machine Learning


Experts create static rules to cover common scenarios. As new scenarios occur, new rules are written. A static, rules-based model can make it hard to capture varying types of keyboard gibberish (like akljfalkdjg). With static rules, Microsoft’s marketers must waste time sorting through the fake leads and deciphering misleading or confusing information.

Machine Learning

Algorithms are used to train the model and make intelligent predictions. Algorithms help build and train the model by labeling and classifying data at the beginning of the process. Then, as data enters the model, the algorithm categorizes the data correctly—saving valuable time. Microsoft used the Naive Bayes classifier algorithm to categorize names as real/fake. This algorithm is influenced by how LinkedIn detects spam names in their social networks.

Scenarios where the model is used

Microsoft’s business team identified their subsidiaries worldwide that are most affected by fake names. Now, they are weeding out fake names so that marketers and sellers don’t have to. Going forward, they plan to:

·         Create a lead data quality metric with more lead-related signals and other machine learning models that allow them to stack-rank their leads. The goal is to give a list to their sellers and marketers that suggests which leads to call first and which to call next.

·         Make contact information visible to their sellers and marketers when they’re talking on the phone with leads. For example, if the phone number that someone gave in an online form is real, but the company name isn’t, their seller can ask the lead to confirm the company name.

Choosing the technology

Microsoft incorporated the following technologies into their solution:

·         The programming language R and the Naive Bayes classifier algorithm for training and building the model are based, in part, on the approach that LinkedIn uses.

·         Machine Learning Server with machine learning, R, and artificial intelligence (AI) capabilities help them build and operationalize their model.

·         Their data quality service, which integrates with the machine learning models to determine if a name is fake – person or company.

Designing the approach

Microsoft designed their overall architecture and process to work as follows:

1.       Marketing leads enter their data quality and enrichment service, where their team does fake-name detection, data matching, validation, and enrichment. They combine these data activities using a 590-megabyte model. Their training data consists of about 1.5 million real company names and about 208,312 fake (profanity and gibberish) company names. Before they train the model, they remove commonly used company suffixes such as Private, Ltd., and Inc.

2.       They generate n-grams—combinations of contiguous letters—of three to seven characters and calculate probabilities that each n-gram belongs to the real/fake name dataset in the model. For example, an n-gram that shows three sequenced letters of the name “Microsoft” would look like “Mic,” “icr” “cro” and so on. The training process computes how often the n-grams occur in real/fake company names and stores the computation in the model.

3.       They have four virtual machines that run Machine Learning Server. One serves as a web node and three serve as compute nodes. They have more compute nodes so that they can scale to handle the volume of requests that they have. The architecture gives them the ability to scale up or down by adding/removing compute nodes as needed based on the volume of requests. The provider calls a web API hosted on the web node, with company name as input.

4.       The web API calls the scoring function on the compute node. This scoring function generates n-grams from the input company name and calculates the frequencies of these n-grams in the real/fake training dataset.

5.       To determine whether the input company name is real or fake, the predict function in R uses these calculated n gram frequencies stored in the model, along with the Naive Bayes rule.

To summarize, the scoring function that’s used during prediction generates the n-grams. It uses the frequencies of each n-gram in the real/fake name dataset that’s stored in the model to compute the probability of the company name belonging to the real/fake name dataset. Then, it uses these computed probabilities to determine if the company name is fake.

What Microsoft learned about Business, technical, and design considerations

·         Ideally, the business problem should be solved within your organization itself rather than outsourcing it. Your organization will have deeper historical knowledge of the business domain, which helps to design the most relevant solution.

·         Having good training and test data is crucial. Most of the work Microsoft did was labeling their test data, analyzing how Naive Bayes performed compared to rxLogisticRegression and rxFastTrees algorithms, determining how accurate their model was, and updating their model where needed.

·         When you design a machine learning model, it’s important to identify how to effectively label the raw data. Unlabeled data has no information to explain or categorize it. Microsoft labels the names as fake/real and apply the machine learning model. This model takes new, unlabeled data and predicts a likely label for it.

·         Even in machine learning, you risk having false positives and negatives, so you need to keep analyzing predictions and retraining the model. Crowdsourcing is an effective way to analyze whether the predictions from the model are correct; otherwise, these can be time-consuming tasks. In Microsoft’s case, due to certain constraints they faced, they didn’t use crowdsourcing, but they plan to do so in the future.

Operationalizing with Machine Learning Server vs. other Microsoft technologies

Some other technical and design considerations included deciding which Microsoft technologies to use for creating machine learning models. Microsoft offers great options such as Machine Learning Server, SQL Server 2017 Machine Learning Services (previously SQL Server 2016 R Services), and Azure Machine Learning Studio. Here are some tips to help you decide which to use for creating and operationalizing your model:

·         If you don’t depend on SQL Server for your model, Machine Learning Server is a great option. You can use the libraries in R and Python to build the model, and you can easily operationalize R and Python models. This option allows you to scale out as needed and lets you control the version of R packages that you want to use for modeling.

·         If you have training data in SQL Server and want to build a model that’s close to your training data, SQL Server 2017 Machine Learning Services works well—but there are dependencies on SQL Server and limits on model size.

·         If your model is simple, you could build it in SQL Server as a stored procedure without using libraries. This option works well for simpler models that aren’t hard to code. You can get good accuracy and use fewer resources, which saves money.

·         If you’re doing experiments and want quick learning, Azure Machine Learning Studio is a great choice. As your training dataset grows and you want to scale your models for high-volume requests, consider Machine Learning Server and SQL Server 2017 Machine Learning Services.

Challenges and roadblocks Microsoft faced

·         Having good training data. High-quality training data begins with a collection of company names that are clearly classified as real or fake—ideally, from companies around the world. Microsoft feeds that information into their model for it to start learning the patterns of real or fake company names. It takes a while to build and refine this data, and it’s an iterative process.

·         Identifying and manually labeling the training and test dataset. Microsoft manually labeled thousands of records as real or fake, which takes a lot of time and effort. Instead, one can take advantage of crowdsourcing services if possible, to avoid manual labeling. With these services, one can submit company names through a secure API and a human says if the company name is real or fake.

·         Deciding which product to use for operationalizing the model. Microsoft tried different technologies, but found computing limitations and versioning dependencies between the R Naive Bayes package they used and what was available in Azure Machine Learning Studio at the time. Microsoft chose Machine Learning Server because it addressed those issues, had the computing power they needed, and helped them easily scale out their model.

·         Configuring load balance. If Microsoft’s Machine Learning Server web node gets lots of requests, it randomly chooses which of the three compute nodes to send the request to. This can result in one node that’s overutilized while another is underutilized. They like to use a round-robin approach, where all nodes are used equally to better distribute the load. This can be achieved by using an Azure load balancer in between the web and compute node.

Measurable benefits Microsoft has seen so far

The gains Microsoft has made thus far are just the beginning. So far, Machine Learning Server has helped them in the following ways:

·         With the machine learning model, their system tags about 5 to 9 percent more fake records than the static model. This means the system prevented 5 to 9 percent more fake names from going to marketers and sellers. Over time, this represents a vast number of fake names that their sellers do not have to sort through. As a result, marketer and seller productivity is enhanced.

·         They have captured more gibberish data and most profanities, with fewer fake positives and fake negatives. They have a high degree of accuracy, with an error rate of +/– 0.2 percent.

·         Their time to respond to requests has improved. With 10,000 data classifications of real/fake in 16 minutes and 200,000 classifications in 3 hours 13 minutes, they have ensured that their data quality service meets service level agreements for performance and response time. They plan to improve response time by slightly modifying the algorithm in Python.

Next steps

Microsoft is excited about how their digital transformation journey has already enabled them to innovate and be more efficient. They will build on this momentum by learning more about business needs and delivering other machine learning solutions. Their roadmap includes:

·         Ensuring that their machine learning model delivers value end-to-end. Machine learning is just one link in the chain that reaches all the way to sellers and marketers around the world. The whole chain needs to work well.

·         Expanding their set of models and making business processes and lead quality more AI-driven vs. rule-driven.

·         Operationalizing other machine learning models, so that they get a holistic view of a lead.

·         Addressing issues created from sites that create fake registrations.

By improving data quality at scale, Microsoft is enabling marketers and sellers to focus on customers and to sell their products, services, and subscriptions more efficiently.

A free ticket to kickstart your Digital Transformation journey with Microsoft

Microsoft Azure

You can start your digital transformation journey today - your first mile is free.

Access a number of services available in Microsoft Azure without paying a penny (or rupee). Some are available for free for the first 12 months while many are always free. Added to this, you also get a pocket money of ₹13,300 to spend for the first month of your journey.

Let us first look at what services are always offered for free by Microsoft

1.       Do you want to quickly create powerful cloud apps using a fully-managed platform? Get 10 web, mobile or API apps with Azure App Service with 1 GB storage

2.       Wish to build apps faster with serverless architecture? You can now send 1 million requests and get4,00,000 GBs of resource consumption with Azure Functions Service

3.       Are you looking for simplifying the deployment, management and operations of Kubernetes - an open-source system for automating deployment, scaling, and management of containerized applications to groups containers that make up an application into logical units – for easy management and discovery? Use Azure Container service to cluster virtual machines.

4.       Are you planning for Identity and Access Management on the Cloud for your organization? Store 50,000 objects with Azure Active Directory with Single Sign-On (SSO) for 10 apps per user.

5.       Do you want to try managing Identity and access of your customers? 50, 000 monthly stored users and 50,000 authentications per month with Azure Active Directory B2C

6.       You can build and operate always-on, scalable and distributed microservice apps using Azure Service Fabric

7.       Do you want to complement your IDE to share code, track work and ship software for any language – all in a single pack? List first 5 users free with Visual Studio Team Services

8.       Get actionable insights through application performance management and instant analytics - Unlimited nodes (server or platform-as-a-service instance) with Application Insights and 1 GB of telemetry data included per month

9.       You can quickly provision software product development and test environment for Linux and Windows applications at the Azure DevTest Labs and use it without limit

10.   Enterprises can use Machine learning with 100 modules and 1 hour per experiment with 10 GB included storage at the Azure Machine Learning Studio – just drag and drop to deploy a solution – no coding

11.   Capitalize on the free policy assessment and recommendations with Azure Security Center where you get unified security management and advanced threat protection across hybrid cloud workloads.

12.   Get unlimited personalized recommendations and Azure best practices with Azure Advisor

13.   Start connecting IoT assets, monitor and manage them at the Azure Iot Hub. The free edition includes 8,000 messages per day with 0.5 KB message meter size

14.   Start delivering high availability and network performance to your applications using the public load-balanced IP with Azure Load Balancer

15.   Integrate your data in a hybrid environment. You can now experiment with 5 low frequency activities with Azure Data Factory

16.   If you develop mobile and / or web apps, use this service to search the cloud 50 MB storage for 10,000 hosted documents with Azure Search including 3 indexes per service

17.   Get a free namespace and push 1 million notifications to any platform from any back end with Azure Notification Hubs

18.   Manage compute power without limit using Azure Batch for cloud-scale job scheduling and cluster management

19.   Automate your process and manage the cloud with a free 500 minute of job run time with Azure Automation

20.   Get more value from your data assets – include unlimited users and 5,000 catalog objects with Azure Data Catalog

21.   Detect human faces, compare similar ones and organize images – 30,000 transactions per month processing at 20 transactions per minute with Face API

22.   Convert 5,000 audio to text and vice versa transactions per month with Bing Speech API

23.   Easily conduct real-time text translation with a simple REST API call – free 2 million characters included for Translator Text API

24.   Transform your log data into actionable insights using this free 500 MB-per-day analysis plus 7-day retention period with Log Analytics

25.   Run 1 job, 5 jobs per collection and 3,600 job executions on simple or complex recurring schedules for free with Scheduler

26.   Get your first 50 private virtual networks free with Azure Virtual Network

27.   Unlimited inbound Inter-VNet data transfer

These services listed below are free for the first 12 months

1.       Deploy 1 or more Azure B1S General Purpose Virtual Machines for Microsoft Windows Server (1 core 1GB RAM, 2 GB SSD Disk space) and run them for 750 hours (aggregate)

2.       Deploy 1 or more Azure B1S General Purpose Virtual Machines for Linux (1 core 1GB RAM, 2 GB SSD Disk space) and run them for 750 hours (aggregate)

3.       Get 128 GB of Managed Disks (as a combination of two 64 GB (P6) SSD storage, plus 1 GB snapshot and 2 million I/O operations) for persistent secure disc storage for your VMs in Azure

4.       Get 5 GB of LRS-Hot Blob Storage – a massively scalable object storage for unstructured data - with 2 million read, 2 million write and 2 million write/list operations

5.       Get 5 GB of LRS File Storage – simple secure and fully managed files sharewith 2 million read, 2 million list and 2 million other file operations

6.       Deploy an SQL Database Standard S0 instance with 250 GB data and 10 database transaction units

7.       Deploy a globally distributed multi-model database service with Azure Cosmos DB to store 5 GB data with 400 reserved in units

8.       15 GB of bandwidth for outbound data transfer with free unlimited inbound transfer.

There is one service that is always free after first 12 months

1.       5 GB of bandwidth for outbound data transfer with free unlimited inbound transfer always free after first 12 months.

The Azure free account is available to all new customers of Azure. If you have never had an Azure free trial or have never been a paying Azure customer, you are eligible. You don’t have to pay anything at all at the start.

Please access the FAQ here for further details.