In my first post on this subject I wrote about the potential alternatives to Google Analytics – motivated of course by the Schrems II rulings that started appearing earlier in 2022 – see refresher article on the summary of Schrems II implications if needed. During that search for a viable EU alternative, I found three possibilities at the enterprise level (*see footnote). One of those analytics tools stood out to me – Piwik PRO – and this is the one that I have decided to do a series of deep dive assessments.
Post #1 examined a privacy key feature – private clouds. This is post #2 about an equally important part of data privacy: device fingerprinting.
DISCLOSURE: I have a working relationship with Piwik PRO. That is, the digital advisory agency I work for is migrating a GA4 client to them. That said, this post is not being compensated for, or requested by Piwik Pro. All posts are my own independent thoughts.
Feature #2 – Removal of Device Fingerprints
All collected data contains potential device identifiers – and by extension, user identifiers. The visitor’s IP address is an obvious one. However, there are many others that may seem benign at first, but when enough of them are stitched together, can be used like a jigsaw to build up a picture of the visitor.
This is one of the reasons why Google Analytics fails the Schrems II test – despite the fact that GA4 does not store IP addresses. Essentially, Google tracks 100s of millions of visitors across its vast ecosystem, not just traffic from your website. It therefore has the potential to stitch ALL of those data points together.
Why Jigsaw Puzzles?
It’s a direct analogy as there no longer exists such as thing as “anonymous” data. Essentially, like a jigsaw puzzle, the more pieces (data points) you have, the easier it is to identify the subject – it’s only a matter of time. Because companies like Google continuously suck up vast quantities of data about individuals – it is trivial for them to do this.
See these excellent articles/studies showing how anonymous data is not so anonymous:
- nature.com/articles/s41467-019-10933-3 – only 15 data points required!
- zdnet.com/article/mozilla-research-browsing-histories- are-unique-enough-to-reliably-identify-users/ – 150 web histories was all it took!
Of course such identifiers can be hashed (a one-way encryption), to make it unreadable. But that is simply a fingerprint by another name, because hashed identifiers can become so unique that they are the ultimate identifier. All data collection vendors face the same problem. I wanted to know if Piwik PRO can handle this and if so, how. It turns out, they have thought about this already…
Salt & Deletion
Piwik PRO creates a session_id based on the visitor’s IP address, operating system, browser name, browser version, browser language, enabled browser plugins – very similar to other vendors, including Google Analytics. But then if hashes it AND importantly adds a salt to the mix. A salt in cryptography is random data added to the hash to enhance security, and in this case they expire after 30 mins of visitor inactivity. In addition, the device details within the raw collected data are processed so that only aggregate information is stored. For example, this is a raw/unprocessed user-agent string from a Mac OS X-based computer using Safari browser:
Mozilla/5.0 (Macintosh; Intel Mac OS X 12_5_1) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
After processing by PIWIK PRO, it is truncated and shown in reports as:
OS = Mac 12.5; Browser = Safari 9.0
The truncation removes the device detail from the reports, yet remains granular enough to conduct detailed analysis.
But what happens next is also important, because the raw/logged data at this point in time still contains device identifiers. Piwik PRO deletes these identifiers on a 6 hour cycle, and they cannot be recovered (more details). In short, these protections make any jigsaw building pretty much impossible to you as the web site owner.
In addition, Piwik PRO is obviously not Google. Meaning there is no connection between Piwik PRO data collected from your website/app, and any data being collected elsewhere. Even if you subsequently upload visitor data to an advertising platform i.e. to target lookalikes on Google Ads, Facebook etc., your user’s device fingerprint information is simply not present.
Why I Like This Feature
For me, the approach described above is simply a great example of “privacy by design” – someone thought about this from the beginning. Of course for troubleshooting reasons you may need very granular details i.e. device fingerprints, and you have those for 6 hours. But then such identifiers become a privacy risk to your organisation. Therefore, it’s much better to remove this data, and a 6 hour clean up cycle provides a nice balance in my opinion. Note, the processed data retention is 14 months and can be upgraded (similar to GA4).
Stored Google Analytics data retains device identifiers – they are not truncated, or removed beyond the data retention settings of your property. Note, the GA4 data retention settings impact all data, not just device identifiers – and the minimum is 2 months. In addition, if you enable Consent Mode, Google collects what it calls “cookieless pings” – that is, data is still collected even when a visitor has explicitly said no to consent. This is something that really irks me. I detail the problem in this post: Google Consent Mode – Why it breaks privacy laws.
Summary of Alternatives – Feature #2
Configuring your web analytics tool to use expiring salts and hashes for session IDs is a good thing for privacy. Combining this with the truncation of device identifiers and the deletion of raw/unprocessed data within 6 hours, makes it pretty impossible – either for yourself as the website owner, or the vendor collecting data on your behalf – to be able to it stitch together and identify your visitors via the jigsaw effect. It’s a great example of privacy by design.
*How I Define Enterprise Analytics
For my own purposes I use the word “enterprise” analytics to mean paid products aimed at organisations with >10M hits per month – and potentially a lot more. Apart from the ability to collect vast quantities of data, an enterprise tool for me needs to meet a few criteria:
- Have an SLA and provide 1:1 support – either via a partner or direct from the provider.
- Have export functions such as an api and data warehousing capabilities e.g. BigQuery.
- Go beyond database limitations of e.g. MySQL (MySQL is a great product, but it struggles with very large data sets).
- Integrate with other tools e.g. Google Ads, Search Console, Data Studio et al.
- Can be deployed via a Tag Management Solution.
- Be an established provider with existing enterprise users i.e. I take into account the wisdom of the crowd.