Send As SMS

Tuesday, November 07, 2006

Input Validation: What is an alpha-num character?

A few days ago, I wrote that I didn't agree with Eric Lippert about using a regex to filter alphanumerical input. Let's take a registration system using such validation. It would rule out my daughter Iséa because of the accented é (It would also rule out all text using non-latin scripts such as Greek, Russian, Japanese,...).
I said I would post some code to do such alphanum validation.

Unicode Categories

The idea is behind such code is to loop on all chars in the string and examine their Unicode Categories: Must match Char.IsLetter() and/or Char.IsNumber(). Which means Greek letters, accented French letters, Japanese ideograms et al are accepted (Yes, the docs for IsLetter() say alphabetical. But it includes ideograms). Even Myanmar digits actually.

That's much better than a simple regex. But not enough though. The input might include composite characters: The letter and its diacritic mark(s) are coded using two (or more) separate characters. For example, é is either U+00E9 or the pair U+0065 U+0301. Hence our check should include the NonSpacingMark Unicode category.

Writing such a little routine in a ASP.NET-compatible language is left as an exercice to the reader, given the links in the paragraph above. I admit that I don't how it would look like in PHP even though I have some non negligeable experience in that language.

Want to make similar checks in your C++ Win32 apps? GetStringTypeW is your friend.

Inclusion Set vs Exclusion Set

But the more I think about it, the more I wonder if such alphanumerical checks are a good idea at all. Unless you want to validate input for a very specific format (i.e. e.g. a Belgian car license number), validation consisting in checking if all chars are within a given set is flawed by design: In most cases, you just can't define the acceptable set. Example: alphanumerical chars are not enough for first names validation: One needs the "-" as well (such as in Jean-Pierre). It's actually easier to work the other way around: Check if all characters are out of a given set of unacceptable characters, such as apostrophes which are SQL string delimiters.

The problem with my suggestion is that you let slip some unacceptable chars that you're not aware of. Security zealots would tell me that they prefer to force me write Isea instead of Iséa rather than taking the risk of leaving a door opened.

I agree with them. BTW, is it true that there was once a vulnerability in Windows based on the use of the Turkish dotless i ? That would prove that even though a regex is far too restrictive, using an exclusion set is asking for security holes.

Why exclude characters at all?

What to do then. Well, we can usually rely on our programming platform (language, DB driver author, whatever...) to provide such safety checks for us. Better yet, this check shouldn't reject unacceptable data. It should rather escape it in a way that the DB will be happy. Which will allow company names with an apostrophe to be typed correctly.

Great, we're back to another security measure enumerated by Eric: Use your DB API to escape input. Of course, pay attention to escaping it for HTML/WML/whatever rendering as well. It is obviously a little more difficult than simply accepting US-ASCII alphanum chars only. But security tradeoffs should not become an excuse for promoting incompetence: We must do our homework!

The Short Version

All this to say that input validation is sometimes not a good idea: You'd rather be able to make sure you safely accept all input.
And Unicode categories are cool ;-)

2 Comments:

At 11:59 AM, Anonymous said...

I'm not qualified to give my comments on the content of your post, way above my head and all, so I'm going to comment on the format and more specifically on one of my pet peeves: the use of i.e. vs e.g. Both are from Latin, respectively 'id est' ('that is') and 'exempli gratia' ('for example'). You wrote:

Unless you want to validate input for a very specific format (i.e. a Belgian car license number)

But it should have been 'e.g.' because a Belgian car license number is just on of the possible formats, not the only one your argument is valid for. See also http://www.suite101.com/article.cfm/8707/52862.

Sorry for being a nitpicking asshole.

 
At 12:10 PM, Serge Wautier said...

I stand corrected!

> Sorry for being a nitpicking asshole.

No problem. I am one as well :-)

 

Post a Comment

<< Home