With the advent of machine learning and AI rapidly transforming market research, brands and researchers are faced with a dilemma: how do I trust the data that is bring provided to me via a black box algorithm?
Indeed, According to the Forbes Insights and KPMG “2016 Global CEO Outlook, 84% of CEOs are concerned about the quality of the data they’re basing their decisions on. And with good reason, poor data quality is pervasive within the insights industry, especially in the area of unstructured social and voice of customer data where accuracy challenges are significant. But poor data quality necessarily leads to poor insights, and poor business decisions make in these insights.
There is some good reason for this: human language is complex. Sarcasm, slang and implicit meanings abound. Different techniques are often used to analyze the data (ranging from more traditional “rules based” techniques, machine learning/AI and Deep Learning). This has led to a great variance in data quality and vast inconsistency in the way the data is coded for sentiment, emotion, relevance and more. Studies have shown that some techniques pervasive in the industry achieve only 60 percent precision and about 15-30 percent relevancy through primitive Boolean queries. Other more contemporary approaches are yielding different results. To complicate matters further, not all analysis is done at the same level. Some solutions provide record level analysis, while others provide highly granular “facet level.” Non standard performance testing techniques and marketing hype simply add to the confusion.
In the meantime, social data is rapidly being adopted by more and more researchers given the robust, raw and real time nature of it. This vast, unstructured, unprompted discussion can be a gold mine for researchers who know how to harness it. It is being increasingly utilized in areas ranging from brand tracking to post segmentation, market mixed modeling and more where researchers are demanded to provide “better faster, cheaper” insights. In too many cases though, users of this data are forced to “eyeball” results or simply accept the data quality being provided “as is.” The effort to QA data with often millions of records is simply not realistic.
Clearly the status quo is untenable. For the industry to grow to the next level it must enable insights professionals and organizations to more confidently leverage and mainstream this data. This requires opening up the black box and openly providing “labels” for classifier performance.
This of course should seem reasonable to anyone involved in consumer insights. Yet, it would represent a bit of a revolution in the world of social and voice-of-customer analysis. While traditional survey approaches have long provided margin of error and confidence intervals, there has not been a corresponding accuracy for social and voice data provided by text analytics solutions. Yet, those performance scores exist.
The F1 score (and its cousins F2 and F 0.5) provide performance scores on two measures: precision and recall. Precision is essentially how well the algorithm matches the gold standard of “accuracy,” while recall measures the numbers of “signals” gleaned from the data. Historically, when precision goes up, recall goes down; and vice-versa One can be more accurate if evaluating a smaller number of signals. The more signals analyzed, the more chance for error so precision often declines. Showing a precision number without a recall number can be misleading and has led to much misinformation and misunderstanding in the marketplace. Of course one can get to 90% precision if only analyzing the most obvious solutions and leaving the harder-to-analyze as mixed or neutral. The combination of precision and recall results in the F1 score, which you can read more about here.
While this sounds obvious and simple, there are a few complexities here that will require the industry to collaborate and achieve consensus. One area is that we first need to first agree to how to measure precision in social data. Human beings themselves when asked to evaluate a social conversation for expressed sentiment and emotion often differ. In fact, in our experience and depending on the complexity of the category, we see humans only agreeing on average 65-80 percent of the time. Analyst bias is rampant. Context matters. At Converseon we advocate a precision score measured against three independent humans who view the same record and agree. This also requires approaches to ensure inner-coder reliability where there is a different of opinion. How well that algorithm matches that is, in our view, the human gold standard for language analysis and should be established as a benchmark for measurement.
Labeling of classifiers with this scoring would provide confidence to researchers, help demystify performance claims from different vendors, expand the use of this valuable information, and help ensure this data is used responsibility and effectively. In fact, inputting into models or reporting insights based on this data without these scores can be quite misleading and dangerous.
Expressions of opinion through social platforms is powerful and transformative. Whether in politics, culture or simple expressions of brand and product opinions, consumers are demanding to be heard. And we, as an industry, have a responsibility to analyze and report on those opinions accurately. On the flip side, brands also have every right to know that the information they’re receiving from solution providers in this space is also clear and reliable, and at what precision and recall level.
For social data to be more broadly leveraged, no longer should users of this data be required to blindly trust results, or have to try to evaluate the data quality themselves, before using for important insight and modeling work. In fact, broad adoption of this data into critical functional areas requires nothing less.