Commit f4733988 authored by Jason Rhinelander's avatar Jason Rhinelander
Browse files

Added script to generate lc/uc pairs; added 0-1ffff.txt containing script output for 0-1FFFF.

parent 7ae015c2
Lower: abcdefghijklmnopqrstuvwxyzµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķĺļľŀłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžſƃƅƈƌƒƕƙƚƞơƣƥƨƭưƴƶƹƽƿDždžLjljNjnjǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯDzdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȣȥȧȩȫȭȯȱȳȼɓɔɖɗəɛɠɣɨɩɯɲɵʀʃʈʊʋʒʔάέήίαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϐϑϕϖϙϛϝϟϡϣϥϧϩϫϭϯϰϱϲϵϸϻабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹԁԃԅԇԉԋԍԏաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕẛạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹἀἁἂἃἄἅἆἇἐἑἒἓἔἕἠἡἢἣἤἥἦἧἰἱἲἳἴἵἶἷὀὁὂὃὄὅὑὓὕὗὠὡὢὣὤὥὦὧὰάὲέὴήὶίὸόὺύὼώᾀᾁᾂᾃᾄᾅᾆᾇᾐᾑᾒᾓᾔᾕᾖᾗᾠᾡᾢᾣᾤᾥᾦᾧᾰᾱᾳιῃῐῑῠῡῥῳⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩⰰⰱⰲⰳⰴⰵⰶⰷⰸⰹⰺⰻⰼⰽⰾⰿⱀⱁⱂⱃⱄⱅⱆⱇⱈⱉⱊⱋⱌⱍⱎⱏⱐⱑⱒⱓⱔⱕⱖⱗⱘⱙⱚⱛⱜⱝⱞⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⴀⴁⴂⴃⴄⴅⴆⴇⴈⴉⴊⴋⴌⴍⴎⴏⴐⴑⴒⴓⴔⴕⴖⴗⴘⴙⴚⴛⴜⴝⴞⴟⴠⴡⴢⴣⴤⴥabcdefghijklmnopqrstuvwxyz𐐨𐐩𐐪𐐫𐐬𐐭𐐮𐐯𐐰𐐱𐐲𐐳𐐴𐐵𐐶𐐷𐐸𐐹𐐺𐐻𐐼𐐽𐐾𐐿𐑀𐑁𐑂𐑃𐑄𐑅𐑆𐑇𐑈𐑉𐑊𐑋𐑌𐑍𐑎𐑏
Upper: ABCDEFGHIJKLMNOPQRSTUVWXYZΜÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮIIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŹŻŽSƂƄƇƋƑǶƘȽȠƠƢƤƧƬƯƳƵƸƼǷDŽDŽLJLJNJNJǍǏǑǓǕǗǙǛƎǞǠǢǤǦǨǪǬǮDZDZǴǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȢȤȦȨȪȬȮȰȲȻƁƆƉƊƏƐƓƔƗƖƜƝƟƦƩƮƱƲƷɁΆΈΉΊΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΣΤΥΦΧΨΩΪΫΌΎΏΒΘΦΠϘϚϜϞϠϢϤϦϨϪϬϮΚΡϹΕϷϺАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸԀԂԄԆԈԊԌԎԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔṠẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸἈἉἊἋἌἍἎἏἘἙἚἛἜἝἨἩἪἫἬἭἮἯἸἹἺἻἼἽἾἿὈὉὊὋὌὍὙὛὝὟὨὩὪὫὬὭὮὯᾺΆῈΈῊΉῚΊῸΌῪΎῺΏᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾸᾹᾼΙῌῘῙῨῩῬῼⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏⰀⰁⰂⰃⰄⰅⰆⰇⰈⰉⰊⰋⰌⰍⰎⰏⰐⰑⰒⰓⰔⰕⰖⰗⰘⰙⰚⰛⰜⰝⰞⰟⰠⰡⰢⰣⰤⰥⰦⰧⰨⰩⰪⰫⰬⰭⰮⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢႠႡႢႣႤႥႦႧႨႩႪႫႬႭႮႯႰႱႲႳႴႵႶႷႸႹႺႻႼႽႾႿჀჁჂჃჄჅABCDEFGHIJKLMNOPQRSTUVWXYZ𐐀𐐁𐐂𐐃𐐄𐐅𐐆𐐇𐐈𐐉𐐊𐐋𐐌𐐍𐐎𐐏𐐐𐐑𐐒𐐓𐐔𐐕𐐖𐐗𐐘𐐙𐐚𐐛𐐜𐐝𐐞𐐟𐐠𐐡𐐢𐐣𐐤𐐥𐐦𐐧
#!/usr/bin/perl
# Takes a unicode value range (e.g. 20-1ff), and prints out all the lower-case
# characters in that range that have upper-case versions. Note that this isn't
# exactly the same as what you get if you do uc($string)--that also handles
# equivelants that aren't single character, such as ß -> SS, but this does
# not.
use utf8;
use bytes();
use strict;
use warnings;
use Getopt::Long qw(:config gnu_getopt);;
use POSIX qw/ceil/;
use Term::ReadKey qw/GetTerminalSize/;
use Term::ANSIColor;
use Unicode::UCD qw/charinfo/;
use Encode;
Unicode::UCD::openunicode(\my $unicode_fh, 'UnicodeData.txt');
my ($term_cols, $term_lines) = GetTerminalSize();
my $bold = color 'bold';
my $reverse = color 'reverse';
my $underline = color 'underline';
my $reset = color 'reset';
binmode STDOUT, ":utf8";
utf8::is_utf8($_) or utf8::decode($_) or die "Invalid input (not UTF-8): $_\n" for @ARGV;
my $arg = qr{
(
(?:0?x | [Uu]\+?)? [[:xdigit:]]+ (?:[_ ][[:xdigit:]]+)* # Hex, such as: 0x203d, x203d, 0x20_3d, etc. single _'s or spaces [which need to be shell escaped] allowed. Also allows the U+0123 system.
)
|
(
(?:utf)?8\+? [[:xdigit:]]+ (?:[_ ][[:xdigit:]]+)* # Hex such as UTF8+E2_80_BD or 8+42, representing a UTF8 representation
)
|
(
0?b[01]+(?:[_ ][01]+)* # Binary, such as: 0b100000_00111101, b00100000_00111101, b10000000111101, etc. single _'s allowed.
)
|
(.)
}six;
@ARGV = ('00-ff') if not @ARGV;
my $error;
my @chars;
for (@ARGV) {
if (/^($arg)(?:-($arg))?\z/) {
my ($char, $to) = ($1, $6);
push @chars, defined $to
? (codepoint($char) .. codepoint($to))
: codepoint($char);
}
else {
warn "Invalid input: `$_'\n";
$error++;
}
}
if ($error) {
print "\n";
usage();
exit 1;
}
my $lc_chars = '';
my $uc_chars = '';
for (@chars) {
my $uinfo = charinfo($_) or next;
my $upper = $uinfo->{upper} or next;
$lc_chars .= chr;
$uc_chars .= chr(hex $upper);
}
print "Lower: $lc_chars\n";
print "Upper: $uc_chars\n";
sub codepoint {
my $value = shift;
$value =~ /^$arg$/ or die "Cannot compute codepoint of `$value'";
my ($hex, $utf8_code, $bin, $chr) = ($1, $2, $3, $4);
$hex =~ s/^(?:U\+?|0?x)//i if defined $hex;
if (defined $utf8_code) {
$utf8_code =~ s/.*\+//;
(my $c = $utf8_code) =~ y/ _//d;
my $bytes = join '', map bytes::chr(hex $_), split /(?=(?:[[:xdigit:]]{2})+$)/, $c;
utf8::decode($bytes) or die "Invalid input: 'UTF8+$utf8_code' does not appear to be a valid UTF-8 sequence\n";
length($bytes) == 1 or die "Invalid input: 'UTF8+$utf8_code' decodes to multiple characters\n";
return ord $bytes;
}
return defined $hex ? hex($hex) : defined $bin ? oct($bin) : ord $chr
}
sub usage {
print <<USAGE;
Usage: $0 [OPTION]... [FROM-TO]...
Displays UTF-8 characters from FROM to TO, or from each character in the list
provided. FROM, TO and CHARACTER can be hexadecimal numbers (optionally
beginning with 0x or the unicode-style U+ prefix), binary numbers (beginning
with 0b), or single UTF-8 characters. If no FROM-TO pairs nor CHARACTERs are
specified, characters from 0 to 255 are displayed.
The programs output is two lines, one beginning with "Lower: " followed by the
utf8 lower-case characters, the second is "Upper: " followed by the utf8
upper-case characters.
USAGE
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment