Jonathan & Tielman,
Thanks a lot, I think that I can make something out of that. The input is a
dos file (untainted by Windows) and I am sure that you are right about
I didn't have perldoc on until very recently, somehow it never got on. So I
have only just realised what a great resource it is.
I do note that when using the editor in linux, either gedit when working on
my pearls, or the inbuilt one in mc when having a squizz in the directory,
that many of the characters don't show up. Like my Alt-127 (DEL or little
house) shows up as a block (an outline square character). And things like C-
cedilla also show as a block. I suppose I should look at the editor
settings. I know that gedit does not detect it as cp437 so a bit of
investigation needed there.
This of course fouls up tr/// because it only trs discrete characters and I
don't think it very smart to paste in blocks which you can't read even if
they would would work (which I think they do, at least in regexes which I
tried out before I found \x) and don't think you can put in \x7F for
instance into tr///
Written in dos, modified on linux perl (5.8), sent over to Windoze!
Hopefully one day to some or other sql database but that is too big a step
> It depends how your input data is formatted.
> If it is UTF8 encoded, in double bytes, and if you have perl 5.6.1 and
> above, use Unicode::Normalize:
> use Unicode::Normalize qw/:all/;
> $string =~ s/([\x80-\xFF])/substr(decompose($1),0,1)/eg;
> If it's formatted by one or other Windows app, with special characters
> such as the Euro symbol single byte encoded, then your input is probably
> CP1252. I'm not sure about the available conversion modules.
> Alternatively, just run your whole input file through the libiconv C
> iconv --from-code=ISO-8859-1 --to-code=UTF-8
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Anne
> Sent: 08 May 2008 17:20
> To: [email protected]
> Subject: [Za-pm] ascii high order character conversion
> I have data prepared on a dos programme that involves high order
> characters, like european letters with umlauts, cedillas, acute and
> accents etc.
> I have a dos utility that I wrote that converts all of these to plain
> unaccented characters, a simple replacement operation. The reason being
> that in moving the data to Windows it does not show them correctly and
> was the easiest way to go at the time. Now I am away from that route and
> want to build this into my perl database conversion routine (convert
> proprietary to delimited).
> Now I am wondering if there is an easier way in perl than doing a s///
> each of the characters used. I looked in the Perl Cookbook, and had a
> wander through the CPAN modules, but nothing struck me as specific for
> task in hand.
> Not that lines of s/// wouldn't do the job, but I wondered if there was
> more concise way of programming this to convert either to the plain
> unaccented character or to the correct windows character.
> [maybe I must study the "perlebcdic Considerations for running Perl on
> EBCDIC platforms" found on CPAN which looks like it might be a guide.
> suggests tr/// , will absorb this evening]
> Had hoped for a ready module from CPAN, but see nothing.
> Any ideas gratefully received on what must have been a common problem
> years back?
> Anne Wainwright
> Za-pm mailing list
> [email protected]
Za-pm mailing list