[darcs-users] Debugging issue1739-escape-multibyte-chars-correctly.sh on tn23

Wed Mar 31 21:28:43 UTC 2010

Hello,

The patch

> Wed Mar 31 01:45:35 CEST 2010  Petr Rockai <me at mornfall.net>
>   * Avoid a direct multibyte string in issue1739 test that confuses ShellHarness.

changes the test issue1739-escape-multibyte-chars-correctly.sh that fails on the tn23 builder to read the patch name from a file, instead of getting it via the -m option from the command line. This change does not seem to have removed the problem.

To debug, I did:

> thorkil-naurs-intel-mac-mini:~/tn/buildbot/darcs/tn23-Intel-Mac/tn23 6.8.3/build thorkilnaur$ darcs changes -v --last=1
> Wed Mar 31 21:10:19 CEST 2010  naur at post11.tele.dk
>   * Issue #1739 debugging 2
>     hunk ./src/Darcs/Lock.hs 37
>     -import Data.List ( inits )
>     +import Data.List ( inits, intersperse )
>     hunk ./src/Darcs/Lock.hs 73
>     +
>     +import Debug.Trace
>     +
>     hunk ./src/Darcs/Lock.hs 290
>     +traceIO :: String -> IO String -> IO String
>     +traceIO s x = do c <- x; return (trace (s ++ ": <" ++ c ++ "> (" ++ concat (intersperse " " $ map (show . fromEnum) c) ++ ")") c)
>     +
>     hunk ./src/Darcs/Lock.hs 296
>     -readLocaleFile f = decodeLocale `fmap` B.readFile (toFilePath f)
>     +readLocaleFile f = traceIO "readLocaleFile" (decodeLocale `fmap` B.readFile (toFilePath f))
>     hunk ./tests/issue1739-escape-multibyte-chars-correctly.sh 52
>     +not echo Fail on purpose to become able to see the trace output
>     +
> thorkil-naurs-intel-mac-mini:~/tn/buildbot/darcs/tn23-Intel-Mac/tn23 6.8.3/build thorkilnaur$ 

This allows us to inspect the file read by readLocaleFile after having passed its contents through decodeLocale. And readLocaleFile is the function used to read the patch name from the message.txt file in the issue1739-escape-multibyte-chars-correctly.sh test.

The result is (http://buildbot.darcs.net/builders/tn23%206.8.3/builds/211/steps/test/logs/stdio):

> Running issue1739-escape-multibyte-chars-correctly.sh ... failed.
> Probable reason :
> ## I would use the builtin !, but that has the wrong semantics.
> not () { "$@" && exit 1 || :; }
> 
> # trick: OS-detection (if needed)
> abort_windows () {
> if echo $OS | grep -i windows; then
>   echo This test does not work on Windows
>   exit 200
> fi
> }
> 
> pwd() {
>     ghc --make "$TESTS_WD/hspwd.hs"
>     "$TESTS_WD/hspwd"
> }
> 
> # switch locale to latin9 if supported if there's a locale command, skip test
> # otherwise
> switch_to_latin9_locale () {
>     if ! which locale ; then
>         echo "no locale command"
>         exit 200 # skip test
>     fi
> 
>     latin9_locale=`locale -a | grep @euro | head -n 1`
>     if [ -z "$latin9_locale" ]; then
>             echo "no latin9 locale found"
>             exit 200 # skip, we can't switch away from UTF-8
>     fi
> 
>     echo "Using locale $latin9_locale"
>     export LC_ALL=$latin9_locale
>     echo "character encoding is now `locale charmap`"
> }
> 
> # we want escaping, otherwise output of non-ASCII characters is unreliable
> export DARCS_DONT_ESCAPE_ANYTHING=0
> + export DARCS_DONT_ESCAPE_ANYTHING=0
> + DARCS_DONT_ESCAPE_ANYTHING=0
> 
> rm -rf R
> + rm -rf R
> mkdir R
> + mkdir R
> cd R
> + cd R
> darcs init
> + darcs init
> 
> echo garbelbolf > aargh
> + echo garbelbolf
> darcs add aargh
> + darcs add aargh
> echo -e '\xe2\x80\x9e\x54\x61\x20\x4d\xc3\xa8\x72\x65\xe2\x80\x9d' > message.txt
> + echo -e '\xe2\x80\x9e\x54\x61\x20\x4d\xc3\xa8\x72\x65\xe2\x80\x9d'
> darcs record --logfile=message.txt -A 'Petra Testa van der Test <test at example.com>' -a > rec.txt
> + darcs record --logfile=message.txt -A 'Petra Testa van der Test <test at example.com>' -a
> readLocaleFile: <â€žTa MÃ¨reâ€
> > (226 128 158 84 97 32 77 195 168 114 101 226 128 157 10)
> readLocaleFile: <â€žTa MÃ¨reâ€
> > (226 128 158 84 97 32 77 195 168 114 101 226 128 157 10)
> readLocaleFile: <â€žTa MÃ¨reâ€
> ***END OF DESCRIPTION***
> 
> Place the long patch description above the ***END OF DESCRIPTION*** marker.
> The first line of this file will be the patch name.
> 
> 
> This patch contains the following changes:
> 
> A ./aargh
> > (226 128 158 84 97 32 77 195 168 114 101 226 128 157 10 42 42 42 69 78 68 32 79 70 32 68 69 83 67 82 73 80 84 73 79 78 42 42 42 10 10 80 108 97 99 101 32 116 104 101 32 108 111 110 103 32 112 97 116 99 104 32 100 101 115 99 114 105 112 116 105 111 110 32 97 98 111 118 101 32 116 104 101 32 42 42 42 69 78 68 32 79 70 32 68 69 83 67 82 73 80 84 73 79 78 42 42 42 32 109 97 114 107 101 114 46 10 84 104 101 32 102 105 114 115 116 32 108 105 110 101 32 111 102 32 116 104 105 115 32 102 105 108 101 32 119 105 108 108 32 98 101 32 116 104 101 32 112 97 116 99 104 32 110 97 109 101 46 10 10 10 84 104 105 115 32 112 97 116 99 104 32 99 111 110 116 97 105 110 115 32 116 104 101 32 102 111 108 108 111 119 105 110 103 32 99 104 97 110 103 101 115 58 10 10 65 32 46 47 97 97 114 103 104 10)
> darcs changes > log.txt
> + darcs changes
> cat log.txt
> + cat log.txt
> Wed Mar 31 22:36:25 CEST 2010  Petra Testa van der Test <test at example.com>
>   * [_<U+00E2>_][_<U+0080>_][_<U+009E>_]Ta M[_<U+00C3>_][_<U+00A8>_]re[_<U+00E2>_][_<U+0080>_][_<U+009D>_]
> grep '<U+201E>' log.txt
> + grep '<U+201E>' log.txt

If we run the build with LC_ALL=da_DK.UTF-8, the issue1739-escape-multibyte-chars-correctly.sh test actually succeeds (http://buildbot.darcs.net/builders/tn23%206.8.3/builds/207/steps/test/logs/stdio). But in order to be able to inspect the trace output, I have changed it to fail eventually. With this change, we get (http://buildbot.darcs.net/builders/tn23%206.8.3/builds/209/steps/test/logs/stdio):

> Running issue1739-escape-multibyte-chars-correctly.sh ... failed.
> Probable reason :
> ## I would use the builtin !, but that has the wrong semantics.
> not () { "$@" && exit 1 || :; }
> 
> # trick: OS-detection (if needed)
> abort_windows () {
> if echo $OS | grep -i windows; then
>   echo This test does not work on Windows
>   exit 200
> fi
> }
> 
> pwd() {
>     ghc --make "$TESTS_WD/hspwd.hs"
>     "$TESTS_WD/hspwd"
> }
> 
> # switch locale to latin9 if supported if there's a locale command, skip test
> # otherwise
> switch_to_latin9_locale () {
>     if ! which locale ; then
>         echo "no locale command"
>         exit 200 # skip test
>     fi
> 
>     latin9_locale=`locale -a | grep @euro | head -n 1`
>     if [ -z "$latin9_locale" ]; then
>             echo "no latin9 locale found"
>             exit 200 # skip, we can't switch away from UTF-8
>     fi
> 
>     echo "Using locale $latin9_locale"
>     export LC_ALL=$latin9_locale
>     echo "character encoding is now `locale charmap`"
> }
> 
> # we want escaping, otherwise output of non-ASCII characters is unreliable
> export DARCS_DONT_ESCAPE_ANYTHING=0
> + export DARCS_DONT_ESCAPE_ANYTHING=0
> + DARCS_DONT_ESCAPE_ANYTHING=0
> 
> rm -rf R
> + rm -rf R
> mkdir R
> + mkdir R
> cd R
> + cd R
> darcs init
> + darcs init
> 
> echo garbelbolf > aargh
> + echo garbelbolf
> darcs add aargh
> + darcs add aargh
> echo -e '\xe2\x80\x9e\x54\x61\x20\x4d\xc3\xa8\x72\x65\xe2\x80\x9d' > message.txt
> + echo -e '\xe2\x80\x9e\x54\x61\x20\x4d\xc3\xa8\x72\x65\xe2\x80\x9d'
> darcs record --logfile=message.txt -A 'Petra Testa van der Test <test at example.com>' -a > rec.txt
> + darcs record --logfile=message.txt -A 'Petra Testa van der Test <test at example.com>' -a
> readLocaleFile: <Ta Mère

> > (8222 84 97 32 77 232 114 101 8221 10)
> readLocaleFile: <Ta Mère

> > (8222 84 97 32 77 232 114 101 8221 10)
> readLocaleFile: <Ta Mère

> ***END OF DESCRIPTION***
> 
> Place the long patch description above the ***END OF DESCRIPTION*** marker.
> The first line of this file will be the patch name.
> 
> 
> This patch contains the following changes:
> 
> A ./aargh
> > (8222 84 97 32 77 232 114 101 8221 10 42 42 42 69 78 68 32 79 70 32 68 69 83 67 82 73 80 84 73 79 78 42 42 42 10 10 80 108 97 99 101 32 116 104 101 32 108 111 110 103 32 112 97 116 99 104 32 100 101 115 99 114 105 112 116 105 111 110 32 97 98 111 118 101 32 116 104 101 32 42 42 42 69 78 68 32 79 70 32 68 69 83 67 82 73 80 84 73 79 78 42 42 42 32 109 97 114 107 101 114 46 10 84 104 101 32 102 105 114 115 116 32 108 105 110 101 32 111 102 32 116 104 105 115 32 102 105 108 101 32 119 105 108 108 32 98 101 32 116 104 101 32 112 97 116 99 104 32 110 97 109 101 46 10 10 10 84 104 105 115 32 112 97 116 99 104 32 99 111 110 116 97 105 110 115 32 116 104 101 32 102 111 108 108 111 119 105 110 103 32 99 104 97 110 103 101 115 58 10 10 65 32 46 47 97 97 114 103 104 10)
> darcs changes > log.txt
> + darcs changes
> cat log.txt
> + cat log.txt
> Wed Mar 31 21:07:19 CEST 2010  Petra Testa van der Test <test at example.com>
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> grep '<U+201E>' log.txt
> + grep '<U+201E>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> grep '<U+201D>' log.txt
> + grep '<U+201D>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> grep '<U+00E8>' log.txt
> + grep '<U+00E8>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> 
> # locale should not matter
> LC_ALL=C darcs changes > log.txt
> + LC_ALL=C
> + darcs changes
> grep '<U+201E>' log.txt
> + grep '<U+201E>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> grep '<U+201D>' log.txt
> + grep '<U+201D>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> grep '<U+00E8>' log.txt
> + grep '<U+00E8>' log.txt
>   * [_<U+201E>_]Ta M[_<U+00E8>_]re[_<U+201D>_]
> 
> not echo Fail on purpose to become able to see the trace output
> + not echo Fail on purpose to become able to see the trace output
> + echo Fail on purpose to become able to see the trace output
> Fail on purpose to become able to see the trace output
> + exit 1

The main difference is between LC_ALL=''

> readLocaleFile: <â€žTa MÃ¨reâ€
> > (226 128 158 84 97 32 77 195 168 114 101 226 128 157 10)

where the individual bytes in the UTF-8 sequence have passed unchanged through decodeLocale and LC_ALL=da_DK.UTF-8

> readLocaleFile: <Ta Mère

> > (8222 84 97 32 77 232 114 101 8221 10)

where decodeLocale has converted from UTF-8 to proper Char values, UTF-32 is the appropriate name of this representation, I believe.

Now, decodeLocale is

> ByteStringUtils.hs:decodeLocale = unsafePerformIO . runInputT defaultSettings . decode

where decode is really System.Console.Haskeline.Encoding.decode:

> -- | Convert a 'ByteString' from the console's encoding into a Unicode 'String'.
> decode :: MonadIO m => ByteString -> InputT m String
> decode str = do
>     decoder <- asks decodeForTerm
>     liftIO $ decoder str

I am slightly puzzled to see "the console's encoding" being used to decode the contents of a file. Something similar happens when interpreting the string following the -m option, thus explaining why reading the patch name from a file hasn't really changed the situation.

In any case, I don't really know how to proceed from here, but perhaps these additional details will help someone else to understand the root cause of this problem better and get us closer to a solution.

Thanks and best regards
Thorkil