User agents are completely "free text" and whatever software sends web requests can choose to send their user agent in any format imaginable. When you're handling normal, legitimate user agents, you'll still see a huge variety of user agent formats, but with a bit of head scratching it generally it all still "sort of makes sense"...
However, user agents can come also in some absolutely crazy formats, which are either subtly "wrong" and "weird", or, which are completely malformed.
We have used our extensive exposure to user agents and our huge collection of them to identify many ways that a user agent can be "weird", the user agent parsing API will tell you if any user agents you send seem "weird" to us. This page has a detailed description of all the ways we know and detect a user agent can be weird.
Here is more detail about each reason a user agent might be marked weird, including a variety of user agents that match the given criteria.
is_weird_reason_code | Description of reason | Some sample user agents which trigger this check |
---|---|---|
fake_version_number | There are a lot of user agents with fake version numbers in the wild; in particular, fake Chrome Version numbers. As such, we now maintain a detailed list of every single real Chrome version number and check Chrome on Desktop version numbers against it. If the version number isn't real, it is marked as fake_version_number. Find out more about Chrome user agents with fake version numbers. |
|
didnt_have_software_version_but_needs_one | Some user agents should always have a version number in them; if the version number is missing then it's obvious that the user agent is fake or has been tampered with. |
|
software_version_int_too_big | If the version number is very large (in the thousands), the software is marked as weird, because the only way this would happen is if the user agent is fake or tampered with. Note that this is not a blanket test: we maintain a whitelist of software which is allowed to have large version numbers and won't get marked as weird. |
|
software_version_int_not_reasonable | This is similar to software_version_int_too_big, but is more subtle: it looks for version numbers which are too high, based on what we know to be the latest version number for that software. For example, if the latest version of Chrome on Desktop is 131, if a user agent reports that it's Chrome 136, then we mark it as software_version_int_not_reasonable. We also check to confirm that it's not too low as well (eg Chrome Version 0) |
|
found_weird_fragment_from_list | We maintain a list of fragments which instantly indicate that something is weird about the user agent. Any user agent with one or more of those fragments will be marked as weird. This list contains things like "c:\", "was here", "firstname", "whatismybrowser", all sorts of HTML tags, various groupings of symbols, "_utm", and so on. These various "weird" fragments are based off the hundreds of millions of user agents we've examined and found problems with; often it's clear that some kind of malfunctioning code has sent the user agent in their request, or that someone has intentionally changed their user agent to something silly or impossible. |
|
has_contradictory_info | Some obviously fake user agents seem to just throw everything in there to see what sticks... we mark that as weird. |
|
has_impossible_info | This when we detect "impossible" things, for example Internet Explorer on iPhone. It's similar to has_contradictory_info |
|
has_long_weird_alphanum | Some user agents have long, inexplicable alpha-numeric strings in them. Often they seem added by the manufacturer or an app or extension on the browser. They often appear in otherwise normal-looking user agents. If we see them, we'll flag them as weird. Note that the user agent sanitizer will neatly remove some of the more common instances of these fragments, leaving you with a normal user agent. |
|
rubbish_fragments | Some user agents have weird rubbish fragments in them; randomized groups of characters and/or symbols. We detect these in lots of different formats/configurations. |
|
started_with_weird_fragment_from_list | The user agent started with a particular fragment that we know real user agents shouldn't start with. |
|
has_encoded_character | There's some kind of encoding problem with the user agent; most likely due to a malfunctioning bot/script. |
|
escaping_problem | There's some kind of escaping problem with the user agent; most likely due to a malfunctioning bot/script. |
|
too_long | The user agent was too long. This one is simple; normal user agents shouldn't be too long. Barring one or two exceptions, any user agent that's too long is considered weird. Often you can see why it's ended up so long; malfunctioning software has repeated a fragment over and over, or multiple user agents have been concatenated, or there's just really random junk fragments. |
|
too_short | The user agents shouldn't be too short either |
|
duplicate_fragments | The user agents contains duplicate fragments. Note; this is a new check and the detection sophistication is still evolving. It will catch some duplication but admittedly not all of it yet. We continue to improve the algorithm. |
|
x_on_y | We went through a phase of seeing hundreds of thousands of user agents that had a sort of "X on Y" fragment in it. They don't appear as frequently these days, but the user agents are obviously weird, and we filter them out. There are many variants of them too; some look like people just changing their user agent by hand, others look like a bot sending the "simple_software_string" from our API back through the API as the user agent. There are actually a few exceptions to this rule too; there are some valid user agents that have an "X on Y" style fragment; these are noted in the example column too. |
FYI: These user agents are examples of ones that have an "X on Y" fragment but which we don't consider weird:
|
one_big_long_string | If the user agent is all just one big long string, then it will be marked as weird. This mostly seems to happen due to malfunctioning software sending the request; often it seems that all the spaces have been stripped out. The user agent must be at least 60 characters for this check to take place, otherwise legitimate user agents such as Baiduspider-image+(+http://www.baidu.com/search/spider.htm) would get incorrectly marked as weird. There are also some exceptions, such as Facebook's Mobile App user agent which doesn't have any spaces in it. |
These user agents are examples of ones which all one big string but which we don't consider weird:
|
started_with_bracket | User agents shouldn't start with brackets, it's almost always a sign of a problem. |
|
surrounded_by_quotation_marks | This often happens because a bot is sending it's user agent correctly. If you see it happen a lot with your API requests, ensure that you're not accidentally double-escaping the strings or anything (make sure you always use a proper JSON library instead of escaping strings yourself!) |
|
surrounded_by_apostrophes | This is basically the same problem as Surrounded by quotation marks. |
|
entirely_symbols | If the user agent consists entirely of symbols and numbers, it's weird. |
|
has_obvious_rubbish_fragment | Our systems constantly see user agents with weird, random, obviously wrong/fake fragments in them. We detect a variety of these weird fragments and mark them as weird. In some cases they seem to be some kind of "anonymizer" extension doing it, other times they seem like some kind of software defect. |
|
has_a_time_stamp | We've seen lots of user agents that have time stamps in them. This is not normal behaviour. |
|
is_a_hash_32 | Occasionally, we've been sent user agents via the API which seem to be encoded as some kind of hash or md5sum. |
|
mismatched_parentheses | Obviously truncated user agents indicate a problem. |
|
multiple_mozilla | Some user agents get concatenated for one reason or another, we check for this. Maybe the visitor's user agent changing extension is faulty or there might be a problem with the way you're handling user agents on your server. |
|
multiple_trident | User agents that have multiple Trident fragments. This isn't normal and as such we mark them as "weird". Similar to the multiple_mozilla reason code, it can also indicate that a user agent has been concatenated with another one. |
|
one_big_alphanum | User agents that are one big string of alphanumeric spaces get marked as weird. |
|
words_with_spaces | The user agent is literally just one or more words with spaces in between; not even version fragments or symbols, just words and spaces. Often these get marked as is_spam too, as some organizations apparently try to spam their name all over the internet with user agent strings? Other people have changed their user agent to things like Stop trying to track me and other fake things. |
|
missing_too_much_parse_data_to_not_be_weird | There wasn't enough information found in the parse for this to not be considered "weird". In all "normal" user agents, there's enough information to at least work out what the user agent string represents, but if it's been flagged as missing_too_much_parse_data_to_not_be_weird then it means we couldn't figure out enough info from the user agent to not assume that it's "weird". Be careful with this one; it mainly exists so that our user agent listing doesn't include user agents that we haven't added detection for yet. These might be legitimate user agents that we haven't added detection for yet, although it will also catch user agents that are also quite strange but which didn't get caught by the other "weirdness" checks. |
|
missing_mozilla_fragment | The user agent is missing the very common Mozilla fragment at the start. This check doesn't apply to all user agents; many bots don't have it or require it, but normal browsers do. This seems to happen when a bot or extension isn't sending the user agent properly. |
|
invalid_trident | The Trident/ fragment (used by Internet Explorer) is invalid. |
|
operating_platform_code_too_long | We haven't seen many instances of useragents like this, but when they appear it's usually because the user agent is malformed in some way. |
|
ended_with_weird_fragment_from_list | There's a few fragments we've noticed that some weird user agents end with. |
|
surrounded_by_type_declaration | An odd little check, but we've seen user agents being sent with what looks like some kind of type declaration around the user agent string. If we see that, we'll mark it as weird, because something's sending the user agent wrong. |
|
all_lower_case | If the user agent comes through entirely in lower case, then something is weird. |
|
csv_fragment | Sometimes malformed user agents come through that look like they're part of a CSV file or a JSON structure. |
|
regex_fragment | Sometimes malformed user agents come through that look like someone accidentally sent a regex through |
|
found_weird_fragment | These kinds of useragents are fairly rare and seem to be the product of some user agent anonymizer |
|
The primary reason for caring is that normal, legitimate software doesn't usually send user agents that are "weird". So if you are really concerned with weeding out "bad" traffic, checking the is_weird and is_weird_reason_code fields is a great way to get a better idea about the software sending the request.
With the addition of the is_weird_reason_code field, you now get a very clear idea about exactly why we consider a particular user agent "weird" - depending on your business logic you may decide to do something different with that particular user agent/web request.
When we detect a user agent as "weird", usually all the other parsing detection runs as normal. As such, even if there's a fragment that shows it's obviously a tampered with user agent and we mark it as such, there still might be other fragments (perhaps a Browser name, or Operating System) which get detected as normal. So just because a user agent is marked as is_weird, you might still be able to use some of the parse data from it.
On any of the paid API plans you will see a is_weird key in the parse section of the response (For the normal Parse or the Batch Parse end points). If the user agent parser thinks that the user agent is "weird", the value in that key will be true.
With the release of v2.4.6, the parse results will also contain an optional key: is_weird_reason_code. If a user agent is marked as is_weird: true, the is_weird_reason_code will also include a text slug that indicates why the user agent was marked as weird.
This is a list of all the reasons we might mark a user agent as "weird", including some sample user agents which indicate the problem. Every single sample user agent here is one that our website or API has seen in the wild. It goes to show how prevalent these issues are and how important it is to look out for them. Note that a user agent might actually have more than one reason why it would be considered "weird", but the key will only ever contain one value.
Note that if a user agent is marked as is_weird might still also be marked as is_abusive and/or is_spam etc.
The API is free to use and easy to set up, so why not get started right now.
Do you have a question? Get in touch! We'd love to help you.