In our previous chapters, we defined and demystified Artificial Intelligence and machine learning, as well as helped market researchers explore new ways to expand their understanding of consumer behavior. But insights are only impactful when the ingested data is of the utmost quality. AI cannot accurately answer questions based on inaccurate data. A key role of a data scientist is to select and compile the data inputs that best speak to the question at hand and, through the correct choice of algorithms, transform them into actionable insights that a brand can harness for a competitive edge.
A paramount rule underlying this process is that the quality of the output is determined first and foremost by the quality of the input. Good data is essential for actionable and accurate results, and in this chapter we will discuss some of the lessons we’ve learned when processing and analyzing data.
Ask The Right Questions
When transforming data into insights, it’s important to ask the right questions. You could ask the machine to divide users into groups based on shared past behavior, or ask how likely it is for a new user to be interested in a certain brand. However, all questions should be grounded by historical data, based on existing patterns, in order to be measurable and actionable.
For example, the internet is full of software programs, known as bots, designed to pollute web data. Because these bots perform deterministic tasks, following a set of repetitive behaviors, it’s easy to observe and discover their existing patterns in the data. After a program is employed to spot and block these bots, the system is then able to predict which users are actual people and which are not.
New forms of fraud are harder to predict since the behavioral pattern does not exist in the past data. You may remember Methbot from a couple years ago, which was an example of spoofing algorithm. As a new type of fraud, it was not possible to predict it, but it easy to spot the abnormalities by understanding the patterns of fraudulent behavior. We saw a suspicious volume of bid requests for a website and also noticed that the exchange those bid requests were coming from had more bid requests than all of the other exchanges combined. Digging into it, we realized that this was fraudulent traffic driven by the spoofing algorithm and were able to counteract it.
Ensure Data Quality
There is a common concept in computer science called GIGO – Garbage In, Garbage Out – which denotes that results are only as good as the data inputted, and bad data inevitably leads to meaningless and un-actionable results.
So how do you ensure data quality? There is no “one size fits all” solution, but luckily, suspicious data tends to exhibit a few telltale signs. Typically, if the data set seems too good to be true, it probably is. These extremes are then thoroughly investigated until the problem is discovered, studied and new countermeasures can be developed.
For instance, at Dstillery, we utilize geo-location data from programmatic advertising events triggered by mobile devices, so we see a lot of location data. There would be instances of a device pinging a location, say Times Square in New York City, and then in the next minute, ping its location as Juno, Alaska. We’ve come to refer to this as “The Superman Effect” since, clearly, it would be impossible for that device to have traveled such a great distance in the blink of an eye. You might be surprised to hear that when it comes to location data, we typically discard 65-70 percent in the cleaning process. Our algorithms rely on accurate, time-stamped location data. Therefore, filtering out bogus, random data is crucial to trusting your data and telling the complete story.
Account for Sampling Bias in Data
We’ve discussed the environment and infrastructure, but we must also discuss the data collection itself. Sampling bias, wherein the collection of data systematically favors some outcomes over others, is a pitfall that is all too easy to fall into. While not necessarily a bad thing, it needs to be taken into account as it can alter insights to a certain question.
Calling a list of phone numbers randomly selected from an area of people being surveyed is a common practice in marketing. But this is not a random sample of the target population! It will miss people living in the area who do not have a phone or who have a cell phone with an area code that’s not from the region being surveyed. Therefore this method systematically excludes certain types of consumers in the area. In the online world there is a similar type of sampling bias: People who don’t use the internet or delete their cookies every day or who use browsers that allow for anonymous browsing are excluded from the sample of people selected.
Caution must also be used to avoid introducing hidden bias into the data due to assumptions made while scrubbing it or aggregating it. That is why, as a rule of thumb for AI, the full firehose of raw data is better for modeling than aggregated, processed, “clean,” data sets. If you want to model sales performance for different point of sales locations, rather than using aggregated sales data for each point of sale (that may include some bias that was introduced in the way the aggregation was done), its better to use the raw sales transactions that were used for that aggregation, messy as they may be.
The Job Is Never Done
Ensuring and maintaining data quality is a constant process, wherein every tweak and fix can improve the output and produce more accurate and actionable insights. The landscape is constantly evolving and the ability to adapt will determine whether or not a company is successful. Our final chapter will discuss our predictions of where AI and marketing are going and what exciting new frontiers they will challenge.