Language Data, Computer & Keyboard Language
First of all we have Steve Jobs to thanks for the fonts and look of type on the personal computer which supports the look of language.
21ST CENTURY LINGUISTIC RIGHTS
Since its inception, the Internet's domain-name system has made a point to accommodate only English-language characters. That provision has helped streamline the engineering of the Web, but according to delegates at a recent United Nations summit, it has left speakers of Russian, Arabic, Lao and the like out in the cold. At the meeting, held in Athens, speakers argued that the Web's apparent love of English has marginalized many surfers in developing nations. "I think the digital divide is not as important as the linguistic divide," said Adama Samasskou, president of the African Academy of Languages. Help for non-English speakers may be on the way: Although it took years of work, Web browsers including Mozilla's Firefox and Microsoft's Internet Explorer now support characters from other languages.
Expert Jeff Allen - Haitian Creole Language Technologies - Language Data Distribution
dotSUB - Any Video Any Language
Multilingual Translation System Receives Over 2 Million Euro in EU
Funding
All citizens, regardless of native tongue, shall have the same
access to knowledge on the Internet. The MOLTO project, coordinated
by University of Gothenburg, Sweden, receives more than 2 million
euro in project support from the EU to create a reliable translation
tool that covers a majority of the EU languages.
'It has so far been impossible to produce a translation tool that
covers entire languages,' says Aarne Ranta, professor at the
Department of Computer Science and Engineering at the University of
Gothenburg, Sweden.
Google Translator is a widely spread translation programme that
gradually improves the quality of translations through machine
learning - the system learns from its own mistakes via system
feedback, but tries to do without explicit grammatical rules.
In contrast, MOLTO is being developed in the opposite direction,
meaning it begins with precision and grammar, while wide coverage
comes later. We wanted to work with a translation technique that is
so accurate that people who produce texts can use our translations
directly. We have now started to move from precision to increased
coverage, meaning that we have started to add more languages to the
tool and database.
Professor Ranta is the coordinator of the MOLTO (Multilingual
On-Line Translation) project, which includes three universities and
two companies. The project is to receive 25 million SEK (2.375 euro)
in EU funding over three years. The grant falls in the Machine
Translation category, and one requirement has been that the system
be developed to include a majority of EU's official languages.
The technique used in MOLTO is based on type theory, just like the
technique used by Professor Thierry Coquand when introducing
mathematical formulas into computer software. In Coquand's project,
type theory serves as a bridge between programming language and
mathematics, while in MOLTO it is used to bridge natural languages.
The advantage of type theory is that each 'type' expresses content
in a language-independent manner. This feature is used in speech
technology to transfer meaning from one human language to another.
It is time-consuming to implement the system. First, all words
needed for the field of application must be inserted in the language
database. Each word is then provided with a type that indicates all
possible meanings of the word. Finally, the grammar needs to be
defined. At this point, the system needs to be told all the possible
combinations of different types, which alternative expressions there
are, in which forms the words can occur and how they should be
ordered.
The database containing the grammar is called 'resource grammar',
and the idea is to make it very easy for a user to extend the
grammatical content and add new words. One of the main ideas of the
project is that it is open source, meaning that the software shall
be accessible to all.
'The purpose of the EU grant is to enable us to use the MOLTO
technology to create a system that can be used for translation on
the Internet', says Ranta. 'The plan is that producers of web pages
should be able to freely download the tool and translate texts into
several languages simultaneously. Although the technology does exist
already, it is quite cumbersome to use unless you are a computer
scientist. In a nutshell, the EU gives us money to modify the tool
and make it user friendly for a large number of users.
The project aims at developing the system to suit different areas of
applications. One area is translation of patent descriptions.
Ultimately, people around the world should be able to take advantage
of new technology immediately without having to master the language
in which the patent description is written. A large number of
translators have long had to be engaged in connection with new
patents. Another sub-project aims at meeting the needs of
mathematicians for a precise terminology for translation of
mathematical teaching material, and then there is one sub-project
that concerns descriptions of cultural heritage and museum objects,
with a goal that anybody should be able to access these descriptions
regardless of native tongue.
First Nations gain entree to electronic age
by David Akin - The Hamilton Spectator
http://www.southam.com/calgaryherald/cgi/newsnow.pl?nkey=ch&file=/business/Technology/970922/t0922mt10.html
[ ... English is the lingua franca of the world's software
developers and hardware manufacturers. The core code that runs most
of the world's computing devices was written in English, then
translated into the ones and zeroes that machines can understand.
Which means wherever you want to go today using your computer, you
will likely need to be able to speak and understand English. In
Canada, of course, no manufacturer would be so brazen as to make
something that could operate in only one of our official languages.
Yet, just a decade ago, a French-speaking Quebecois living in
Chicoutimi had to use the English accentless alphabet when sending
e-mail to another French speaker in Trois Rivieres because the only
e-mail programs in existence were written by English-speaking --
usually American -- developers who never thought about incorporating
communication capabilities for those who use other alphabets.
FRENCH REPRESENTATION
Today, though, most popular software can represent French
characters. But translating a software product from English to
French is not as simple as running sub-titles through a movie or
re-publishing a book. That's because the basic input device for a
computer -- the keyboard -- has been designed and built for people
who use the English alphabet. The French alphabet, of course,
includes more possibilities than the English. There is c and then
there is , for instance. Or e and and even .
Still, French characters, based as they are on the Latin alphabet,
were close enough to the basic English alphabet that inclusion in
new international standards was easy and quick.
But those who use an alphabet that doesn't rely on Latin letters --
Arabs, Greeks, Russians, and Chinese, to name a few -- can still
come across Internet documents and software programs that require
not only knowledge of a language they don't know but also an
alphabet they've never used.
When Western Internet enthusiasts rave about the ability of
telecommunications to unite the world in one global village, people
of many non-Western cultures fail to see why they should rejoice in
a communications system that marginalizes their language by forcing
them into a homogenous English-only global village.
As a result, the rather narrow, technical issue of incorporating new
computer characters into the machine language computers can
understand has become a highly politicized issue in Canada and
around the world.
PUSH IS ON
Now, the push is on to bring the world's and
Canada's aboriginal cultures into the electronic age
, taking what are, in many cases, societies that were marginalized
by an aggressive, dominant white culture during pre-industrial and
industrial times, and giving them a prominent, participatory role in
the new post-industrial digital age.
"It's a form of democratization. It allows smaller groups a voice
at a lot of different levels,"
said educational consultant Dirk Vermeulen.
Vermeulen, who lives in Beamsville and works out of an office in the
back of a native art gallery in Jordan, has developed curriculum and
curriculum materials for Arctic boards of education since the 1970s.
And, just as southern Canadian boards of education are trying put
more computers in the classroom, so too, are Arctic boards. Most
computers, though, cannot support the phonetic syllabic characters
used to represent Inuktitut in written language.
"We said, well, hold on, if you're going to allow computers into
these schools, we have to make sure they'll work not just in English
but also in Inuktitut and in French, so we went to work at that
point to try and establish the ability of computers to be able to
handle those various scripts.
NOTHING INTERCHANGEABLE
"We quickly found that a lot of other native groups across Canada
that were using syllabics were doing the same thing, but that none
of the data was interchangeable. Everybody had their own method and
their own solution to the problem," Vermeulen said.
In 1992, Industry Canada, with the urging of Canadian aboriginal
groups, called on Vermeulen and others to form the Canadian
Aboriginal Syllabics Encoding Committee, to come up with a proposed
standard for including Canadian aboriginal syllabics into computer
character sets that could be adopted by the International
Organization for Standardization or ISO.
"
The native cultures, at this point, are very ready to take
control as to where their languages or culture is going
,"
said Vermeulen in a recent telephone interview.
Through the Canadian Standards Association, Vermeulen's committee
submitted that standard June 10 to the ISO. The ISO's global
membership has voted in favour of the new standard three times since
then. The fourth and final vote on the standard is expected some
time in the spring.
If the ISO agrees to include
Canadian aboriginal syllabics in the standard
, computer manufacturers from California to Singapore will begin
making computers that support that language.
"It doesn't mean they have to make fonts for it, but what it does
mean is that if you buy a font, any computer that you have you will
be able to process syllabics without any problem," said Michael
Everson said.
Everson, born in Arizona but now living in Ireland, is one of
Vermeulen's colleague's on
CASEC
.
The language standard used by computers is known as the
Universal Multiple-Octet Coded Character Set.
This set contains 64,000 characters that a computer can be made to
understand. So far, though, just 29,000 characters have been
assigned a spot in that set.
Those characters include, for instance, the English alphabet -- in
both capital and small letters -- as well as special characters
such as tildes ( ~ ) or curly brackets { }.
The characters that have already been incorporated in the approved
set also include many characters from Japanese, Chinese, Korean,
Arabic, Hebrew and East Indian alphabets.
The ISO may also soon consider proposals to include important
historical alphabets such as ancient
Egyptian hieroglyphics
as part of the approved coded character set.
The computer character set is crucial if people who use writing
systems different from the English alphabet are to communicate in
their own language using modern telecommunications technologies.
"It equalizes a lot of situations," Vermeulen said. "I think that's
very useful and very good. I really stand behind that. What's
interesting in many ways is that the native cultures are at this
point very ready to take control as to where their cultures are
going and where their languages are going."
Setting a standard for which languages computer products will
support is not, just to be clear on the matter, a matter of
translation.
A computer that supports different character sets cannot translate
between languages.
In other words, if an English-speaker types in the word 'Igloo', it
does not show up on the computer screen of an Inuktitut-speaker in
the Canadian Syllabic characters for igloo.
What does happen is that when an English-speaker types i-g-l-o-o,
the computer is programmed to understand that English word in its
hexadecimal numeric language as 0069 0067 006C 006F 006F and act
upon that word.
The proposed new standard would see computer manufacturers assign
the hexadecimal string 1403 14A1 14D7 to the Inuktitut syllabic
symbols for igloo or house.
ENABLING TOOL
The proposed new standard would be an enabling tool, allowing people
to use their own writing systems in digital communications.
"We've been trying to allow the language room to be used in a
variety of situations, including offices and governmental situations
and whatever else, in order to broaden that base of the use of the
language," said Vermeulen.
"
I think Nunavut is a big deal
," said Everson in a telephone interview from his office in Dublin.
Dublin is the home base for Everson Gunn Teoranta, his firm that
'localizes' or re-writes computer software in minority languages
such as
Gaelic
.
"
Nunavut
is really remarkable and amazing and it's going to change things.
These people are getting their own state," Everson said.
"The fact that they're getting their own state is giving them the
impetus to make some amazing technological jumps."
New communications technologies also give the newly empowered state
of Nunavut to better control and direct the education of its young
people, Vermeulen said.
"While there are a lot of pressures on the language from the English
and the French media in Canada, the larger (
aboriginal
) groups are able to actually take advantage of the various media
and promote their language.
"We hope that by including the writing systems into the modern
technologies and into the modern standard it will do two things. One
is that it'll allow people to use these technologies to promote
their own language in whatever way they feel fit.
"The second thing is that it provides
international recognition for those writing systems.
In doing so, nobody can deny them the right to exist. That's a very
important issue politically," Vermeulen said.
Nunavut comes into being April 1, 1999, when the Northwest
Territories is divided, roughly along the tree line, into Nunavut
and a western territory.
Communication technologies could play an important role in Nunavut's
development if only because it, like Canada, must meet the
challenges of serving a tiny population spread over a wide area.
Nunavut encompasses an area more than five times the size of
Germany, yet it has just 20 kilometres of roads. Its 26
settlements are spread across three time zones. CASEC expects the
ISO will formally adopt Canadian Syllabics into the standard some
time in the spring.
The hard work, though, has just begun as Inuktitut speakers take
English versions of popular software packages and re-write them
using the complex and different Inuktitut grammar, syntax, and
alphabet.
The Baffin Divisional Board of Education is already localizing
Macintosh operating system 7.5 to be able to use Canadian syllabic
characters.
"I've looked at the grammar of this and it is a language from hell,"
said Everson, sizing up the job of turning
Apple's elegant English computer code into the phonetic symbols of
written Inuktitut.
"I don't know how this poor woman is doing the translations of this
technical vocabulary into this amazing language. It's a wonderful,
wonderful language, but it is not like English, I'll tell you that."
CASEC estimates that there are about 200,000 people in Canada's
north who use the syllabics system to express themselves in
written form. Most of those people are Cree and some Dene people
who live in Canada's eastern Arctic.
Ironically, the language of those Arctic dwellers had no written
form until Methodist missionaries visited them in the 1830s. Now,
just 160 years after the language first found its way onto
parchment, it is being digitized.
The Methodist missionaries took the oral culture of the Cree and
Dene and imposed a written vocabulary using French shorthand
symbols. Since those first early efforts, syllabic character shapes
have been added to the 'alphabet' while existing ones have been
modified. The approved computer standard set already incorporates
the syllabic characters for several Algonkian and Athapaskan
languages.
"We find a lot of these languages actually strengthening,"
Vermeulen.
"The existence of the phonetic syllabic characters is credited with
helping to sustain and strengthen native culture, by making it easy
for users to read, write and publish in their own language."
RELATED WEB SITES
The Canadian Standards Association submission to the International
Organization for Standardization is titled Proposed pDAM for Unified
Canadian Aboriginal Syllabics. You can find it at:
http://www.evertype.com/standards/sl/n1441-en.html
Universal Declaration of Linguistic Rights -- a statement that argues for the protection and encouragement of minority languages is at www.indigo.ie/egt/udhr/udlr-en.html