Normal, well behaving browsers and bots will send their user agents as you'd normally expect, however all sorts of issues can arise with user agents outside of the norm.
As well as finding problems with user agents, our preprocessor will sanitize and clean as much junk out of any malformed user agents as possible.
User agents can get encoded strangely by a proxy or a script that isn't handling them correctly. Maybe they've been stored incorrectly in a database somewhere, maybe a script wasn't transmitting them properly or didn't escape the special characters properly, and now an already messy user agent is even worse.
When you send user agents to our API, we run them through our preprocessor first to remove as many issues as we can, including incorrectly encoded user agents, unescaped characters and more.
So for example, if your visitors are sending incorrectly encoded user agent strings, we can help you identify and fix these problems. Turn:
Mozilla/5.0 (Linux; Android 10; POT-LX1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Mobile Safari/537.36
We've seen a lot of user agents like this:
Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/30.0.1599.12 Mobile/11A465 Safari/8536.25 (3B92C18B-D9DE-4CB7-A02A-22FD2AF17C8F)
So we tidy them up like:
Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/30.0.1599.12 Mobile/11A465 Safari/8536.25
Many user agents have Device IDs or other random identifiers in them; we've seen a few "anonymizer" plugins which (for some strange reason) add unique random strings to user agents (...we have no idea why - surely this would make you more targetable?!) but it doesn't matter - our preprocessor does a great job at stripping them out.
This is very important when you need to save user agents to your database or system - otherwise you'll end up with thousands of "almost-identical" user agents - normal user agents that have a single random GUID or string in them which means your system will save it into a unique record.
For example, we identify and remove the random number in this user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64; WOW64; rv:41.0) Gecko/20100101 Firefox/49.0.2 (x86 de) Anonymisiert durch AlMiSoft Browser-Anonymisierer 96034752
Mozilla/5.0 (Windows NT 10.0; Win64; x64; WOW64; rv:41.0) Gecko/20100101 Firefox/49.0.2 (x86 de) Anonymisiert durch AlMiSoft Browser-Anonymisierer
There are too many to list here, but we continue to identify user agents that have weird random fragments in them, and we extend our preprocessor to remove the random bits in them. For example, we turn something like:
Mozilla/5.0 (Linux; Android 5.1.1; S3 Build/LMY49F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.120 MQQBrowser/6.2 TBS/045230 Safari/537.36 scancode_vc/213 scancode_vcname/2.45.50 scancode_cuid/|0 scancode_token/1_XPXQH3c5HRPtFHkSwi3sCCURmT25QfxM scancode_channel/20200826huidu zyb_jsBridge/1, jsBridge_jsInterface/1 jsBridge_isNewJsBridge/1 jsBridge_vc/2.3.2 jsBridge_os_version/5.1.1
Into this still messy but no longer random user agent:
Mozilla/5.0 (Linux; Android 5.1.1; S3 Build/LMY49F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.120 MQQBrowser/6.2 TBS/045230 Safari/537.36 scancode_vc/213 scancode_vcname/2.45.50 scancode_cuid/|0 scancode_token scancode_channel/20200826huidu zyb_jsBridge/1, jsBridge_jsInterface/1 jsBridge_isNewJsBridge/1 jsBridge_vc/2.3.2 jsBridge_os_version/5.1.1
And instead of having thousands of records in your database like this:
Our preprocessor keeps it neat for you by reducing them all down to:
We've got lots of useragent specific checks and fixes for all the weird user agents we see.
Some unknown web browser extension or script adds strange random fragments to it's user agent, resulting in user agents like:
And so on. In our extensive collection of user agents we've found these fragments appear in several different places in the user agents - in the middle, at the end etc. We identify and remove them, returning a far neater user agent string in the user_agent_sanitized which you can optionally use if you want.
We've noticed what looks like MAC addresses in a number of user agents. This is very strange, browsers, scripts, and programs shouldn't be sending their MAC address in their user agents, so we identify and remove them if you want.
We've seen a lot of user agents with what look like hashes or checksums in them, and we work to remove them, for example, turning:
Podbean/iOS (http://podbean.com) 4.5.1 - 42e1a53e413871204ad43bf95c9c5c94
Podbean/iOS (http://podbean.com) 4.5.1
We built the preprocessor for maintaining our own user agent database and now you can get the same features if you want.
As well as sanitizing user agents, we can also spot many different things that are "wrong" with a user agent, so that you don't get tricked.