The Facebook–Cambridge Analytica scandal of 2018 changed the game for social audience analytics. Users of social networking sites now have (slightly) more privacy, which is certainly a good thing, but this also means that marketers have less data to work with.
Pre-scandal, Facebook played it fast and loose with regards to personal data protection. It was possible for an app developer to access a user’s entire friends list — their audience — and, at very least, see names and Facebook ID numbers, if not their full profile (though, often enough, Facebook’s confusing privacy controls allowed app developers to see the full profile of some users).
The advantage of that approach from a data perspective is that it used to be easy to generate audience analytics. The disadvantage, however, was the fact that apps could access data regarding users who had never opted-in to that app.
Post-scandal, Facebook Inc and other major sites cracked down on the data that third party developers could access. Instagram completely removed the ability to view a user’s followers. Facebook made it very difficult to get permission to access a user’s friends; they rarely grant that permission. Facebook also removed the ability to interact in detail with personal profiles, instead pushing their users into the ‘Influencer’ model, requiring them to have a Facebook Business Page and an Instagram Professional account, linked together, in order to authorize third parties’ access to data. And yet developers cannot access a list of a Page’s followers.
Other platforms, like YouTube and Pinterest, followed a similar path and cracked down on the ability to look at followers or subscribers. Newer networks, like TikTok and Snapchat, simply decided not to offer APIs for consumption. Only Twitter has left their platform relatively open.
The result of all this is that audience data is very hard to come by. Facebook gives only four paltry audience demographics metrics, for Pages and Instagram Professional accounts: Country, City, Locale, and Age/Gender buckets. Other than those metrics, there is no such thing as “official audience data” from any of the social networks.
So what is the techie marketer to do? Either use legally-grey web scrapers to harvest follower data, or get clever.
There are two issues with web scraping: first, it exists in a legal grey area. Some precedent exists that indicates scraping is legal (Intel v. Hamidi), other cases have decided it isn’t (Facebook v. Power.com). Secondly, in the EU and UK, which are covered by GDPR, this style of harvesting data without the user’s knowledge or consent is indeed illegal, and there is precedent that shows that the ‘legitimate basis’ argument isn’t the strong foundation data scrapers hoped it would be.
We at Tidal Labs don’t like legal grey areas, so we went for option #2: get clever.
First, we lean on the fact that hundreds of thousands of users have signed up for our platform and opted-in to answering a handful of demographic questions, like their relationship status, whether they have children, their household income, and so on. Then we lean on the fact that those users have also connected their social accounts and allow us to look at the type of content that they write.
Using those two pieces of data, we’re able to build a machine learning model that relates a person’s writing to their demographic attributes, to a high level of accuracy (85%-90%, which, as far as these things go in the machine learning world, is quite good). Let’s call this the ‘content-to-demographics’ model.
That approach doesn’t give us audience data, though — it only lets us ‘guess’ a person’s demographics based on their writing.
Working with what we’re legally, ethically, and morally allowed to obtain, we start with the person’s Twitter followers, which we do have access to. We take a random sample of followers and run them through our content-to-demographics model and start building the audience profile that way. We don’t look at every follower, but we choose a statistically relevant sample size to analyze. For example, sampling 500 followers from a person with 25,000 followers gives us a statistical confidence interval of less than 5% at a 95% confidence level. In practice, if we were attempting to guess the gender breakdown of this audience, we could say something like ‘we are 95% certain the audience gender is 30% male, +/- 5%”. We choose the number of samples to analyze in order to get that confidence interval down below 5%.
If we stopped here, we’d already have a pretty good audience model based on Twitter data. But we don’t stop there; someone may have a significantly different audience on Instagram or YouTube vs Twitter, for instance. Since we can’t look at followers for Instagram or YouTube, we need to get even more clever: we look not at the followers but the commenters. We use the same content-to-demographics model, but apply it to all the comments we can find. This gives us a similar audience demographics breakdown of the Instagram or YouTube commenters. From there we use a ‘transference model’ to blend the data from all the various sources together (this model knows how to correct for statistical differences between the audiences), and generate the user’s overall audience report!
We love this approach at Tidal Labs for a handful of reasons:
- It does not rely on scrapers, and instead uses the official social network APIs
- It does not store personal data of any audience member, follower, or engager
- The transference model makes it easy to add new data sources to the analysis
- We built it in-house and manage it ourselves, and don’t need to rely on third parties for our analysis
- It’s fast and accurate, and can build audience reports for unknown users in a matter of minutes
While it is harder for marketers to get good audience data today than it was in, say, 2016, it’s certainly still possible. It just needs a little more thoughtfulness and caution than it used to.