.NET Framework - encoding.ascii

Asked By Da on 30-Jun-08 12:58 PM
I have the following code section that I thought would strip out all the
non-ascii characters from  a string after decoding it.  Unfortunately the
non-ascii characters are still in the string.
What am I doing wrong?

Dim plainText As String
plainText = "t═e"
Dim plainTextBytes() As Byte
Dim enc As Encoding = Encoding.ASCII
plainTextBytes = enc.GetBytes(plainText)
Dim str As String
str = enc.GetString(plainTextBytes).ToString
MessageBox.Show("before " & str)

Dim decodeString As String = enc.GetString(plainTextBytes)
MessageBox.Show("after " & decodeString)


Any help would be greatly appreciated.
Dan




Stephany Young replied on 30-Jun-08 01:50 PM
Before we start, let's get rid of the trees so that we can see the wood:

Dim plainText As String = "t═e"

MessageBox.Show("before " & plainText)

Dim plainTextBytes As Byte() = Encoding.ASCII.GetBytes(plainText)

Dim decodeString As String = Encoding.ASCII.GetString(plainTextBytes)

MessageBox.Show("after " & decodeString)

The 1st MessageBox.Show displays before tâ•e and the 2nd displays after
t???e which is exactly correct and is also what is expected.

The reason that the 'apparent' space does not show between the • and the e
in the 1st MessageBox.Show is that, in a proportional font, that particular
ANSI character, (144 decimal), has either no width or is very narrow.  I may
be corrected on this but I think it is called an emSpace, which I interpret
as meaning that it is 1 em wide, which is 1 point which is also 1/72 of an
inch.

The two preceeding characters have the codes 226 decimal and 8226 decimal
respectively. The first of these is in the ASNI range but the second
requires 2 bytes to represent and therefore is true unicode.

The documentation for the Encoding.ASCII.GetBytes method states
categorically that it 'encodes all the characters in the specified string
into a sequence of bytes'. Nowhere does it give the impression that it
'strips' characters out.  When it comes across a non-ASCII character or an
ASCII charcater that is considered 'unprintable' it substutes the byte &3FH
(63 decimal) which, of course, is the ? character. Therefore, the after
t???e displayed by the 2nd MessageBox.Show is correct.

If you really want the non-ASCII and ASCII 'unprintable' characters stripped
out then you can use any number of algorithms that REMOVE the 'offending'
characters from the string.  One such algorithm is demonstrated:

Dim plainText As String = "t═e"

MessageBox.Show("before " & plainText)

Dim decodeString As String = String.Empty

For _i = 0 To plainText.Length - 1
If AscW(plainText(_i)) >= 32 AndAlso AscW(plainText(_i)) < 127 Then
decodeString &= plainText(_i)
Next

MessageBox.Show("after " & decodeString)


Now, the 1st MessageBox.Show displays before tâ•e and the 2nd displays after
te which is what you appear to want.
Da replied on 30-Jun-08 03:37 PM
Stephany,
thanks for the code.
When I ran your code the returned value included the ? character.  I changed
your code to:
Dim decodeString As String
Dim i As Integer
decodeString = String.Empty
For i = 0 To plainText.Length - 1
If plainTextBytes(i) >= 32 And _
plainTextBytes(i) <= 126 And plainTextBytes(i) <> 63 Then
decodeString += plainText(i)
End If
Next

and only the "te" appeared.

I assume that Encoding.ascii.getbytes converts all valid ASCII characters to
their hex value and Encoding.ascii.getstring returns the printable character
of valid ASCII characters.  If any value is not valid the hex value of 63 is
used.

Again thanks for your help.  It was very informative.
Dan
Stephany Young replied on 30-Jun-08 10:14 PM
It's important that you understand what happens and it is not clear that you
actually do. Your use of inappropriate terminogly is what causes this
suspicion.

There is no conversion to any 'hex' value. A 'hex' value is nothing more
than a readable representation of something.  More importantly, 63 is the
DECIMAL representation of the ? character. 3F is the 'hex' representation.

You don't have to assume anything. The documentation for the
Encoding.ASCII.GetBytes method, that I referred you to, tells you EXACTLY
what it does.  Note the use, in the documentation, of the phrase 'all
characters'.

In general, if the decimal representation for a character in the specified
string is in the range 32 to 126 inclusive, then the actual character is
used otherwise the ? character is used as a substitute. If there are any
exceptions to the general rule, I have yet to encounter any.

When you used the code fragment I posted, and you still got a ? in your
result then the 'before' string must have contained a ? character, which of
course is perfectly valid. It is, of course a punctuation mark that
indicates a question and therefore can appear in all sorts of strings.

It is apparent that you do not have Option Strict set. If this is the case
then I stringly recommend that you set it ON and leave it that way.

When you turn it on you will find that your code will not compile without
warnings and may not even compile at all.

The main thing is that you should not be excluding ? from the new string.
Andrew Morton replied on 01-Jul-08 03:49 AM
I'm have no knowledge of whether or not 144 represents an em space in some
code page, but, for the sake of completeness, that isn't the size of an em:
http://en.wikipedia.org/wiki/Em_%28typography%29

Andrew
Cor Ligthert [MVP] replied on 01-Jul-08 07:06 AM
Stephany,


There are no ANSI characters. So please don't make the confusion wider.
ASCII is a 7 bit character code system, while EBCDIC is an 8 bit.

Most variants not on real mainframes derive from ASCII. However if in those
the most significant bit is used the byte can represent in every code page
another bit range for a character.

Cor
Da replied on 02-Jul-08 06:12 PM
Stephany,
Thanks for the information.
As to your point that the original string had a "?" character included, that
is not true. Maybe the confusion is that I don't understand the function
AscW(plainText(_i).  In all the testing that I have done, all encoding
functions change a decimal value > than 127 to a decimal 63.  I can remove
all the "?" from the string but I will also remove the intended "?".
Dan
Stephany Young replied on 02-Jul-08 07:33 PM
The Ascw(Char) method returns an Integer value representing the character
code corresponding to a Unicode character. This can be 0 through 65535.  The
returned value is independent of the culture and code page settings for the
current thread.

Using a subscript to access the individual characters makes use of the fact
that you can treat a String as if it were an array of Char.

So, the code fragment:

decodeString = String.Empty

For _i = 0 To plainText.Length - 1
If AscW(plainText(_i)) >= 32 AndAlso AscW(plainText(_i)) < 127 Then
decodeString &= plainText(_i)
Next

does nothing more than append all characters from plainText that have their
character codes in the range 32 to 126 inclusive, to decodeString which is,
initially, empty.

Therefore, if you end up with a ? (character code 63) in decodeString, then
it was present in plainText. QED.

If you use another methodology in an attempt to 'remove' non-ASCII and/or
ASCII non-printable characters form a string then you may end up with a
different result, because the culture and/or code page settings for the
current thread may be taken into account.

There is another factor that could come into play here and that is one or
more of the characters in plainText has a character code of 0 (NUL). If you
display such a string with MessageBox.Show, among other methods, then those
characters after the NUL will NOT be displayed.

For example, the string "ABCDE" & ChrW(0) & "?" would be displayed as
the removal of the NUL character having 'exposed' the "?" that you didn't
realise was actually present.

One way of detecting the presence of a NUL character is:

If plainText.Contains(ChrW(0)) Then
' NUL character is present
Else
' NUL character is NOT present
End If

Note that the ChrW(Integer) method is, effectively, the reverse of the
AscW(Char) method in that it returns the character associated with the
specified character code. The character code can be in the range -32768
through 65535 but the values -32768 through -1 are treated the same as
values in the range 32768 through 65535.
Herfried K. Wagner [MVP] replied on 02-Jul-08 08:30 PM
If the OP thinks about adopting this solution for longer strings, I suggest
to take a look at the 'StringBuilder' class for faster string
concatenations.

--
M S   Herfried K. Wagner
M V P  <URL:http://dotnet.mvps.org/>
V B   <URL:http://dotnet.mvps.org/dotnet/faqs/>
Stephany Young replied on 02-Jul-08 09:15 PM
The actual methodology used in any given situation will, of course, need to
take into account things like performance requirements and the size of
objects along with any number of other considerations.  That goes without
saying Herfried.

All 'we' are doing here is demonstrating one (among a myriad) of
methodologies that can be used to achieve the correct result. Obviously
achieving the correct result is the most important consideration.

The main thrust of the whole thing is that any given methodology can have
pitfalls, depending on the circumstances, given the nuances of
string-handling when various cultures and/or code pages come into the mix.