160 likes | 299 Views
Building the Federal Multilingual Infrastructure in Unicode Foreign Language Dictionary Tools. Project Goals. Unite federal foreign language analysts in communities of interest by language to increase the speed and accuracy of multilingual work
E N D
Building the Federal Multilingual Infrastructure in UnicodeForeign Language Dictionary Tools . John J. Kovarik, NSA/CSS Senior Language Technology Authority
Project Goals • Unite federal foreign language analysts in communities of interest by language to increase the speed and accuracy of multilingual work • Outgrowth of NSA legacy individual foreign language dictionary tools • ShareNext Generation tool suite across the federal government in 90 languages John J. Kovarik, NSA/CSS Senior Language Technology Authority
Foreign Language Work 1970’s • Manual tools • Hardcopy dictionaries (2-10 per person) • 3x5 card files for specialized vocabulary • Pen and paper only • Work environment • Career analysts revered as subject matter experts rule the work place. • College graduates hired right out of school, some with military experience, enter the job. John J. Kovarik, NSA/CSS Senior Language Technology Authority
Foreign Language Challenge IThe classic sparse data problem • Never enough vocabulary • Never enough grammar training • Never enough cultural knowledge John J. Kovarik, NSA/CSS Senior Language Technology Authority
Foreign Language Challenge IIWhy it’s a sparse data problem. • Communication is usually spontaneous between 2 or more people who share a great deal of special knowledge in common • Ultimate goals often not explicit • Ambiguity reigns for outsiders • No simple rules for filling in the blanks John J. Kovarik, NSA/CSS Senior Language Technology Authority
An example— 女人 去 打敲 竹鋼的 密醫 來 解決 她的 問題 。 • All glossed (4 min/chr 17chrs) meaning obscure—”Female people go hit knock bamboo curtain’s secret doctor come untie decide her ask issue.” • All phrases verified (longest string match—9) clearer—”A woman goes and knocks on the bamboo curtain’s secret doctor to come resolve her problem.”…but still uncertain • Check for neologism—go to FBIS recent translations, look to clarify meaning of new term “knock bamboo curtain”. • “Knock on the bamboo curtain for a secret doctor” = “seek out an illegal quack” • “A woman (must) go seek out an illegal quack to resolve her problem.” John J. Kovarik, NSA/CSS Senior Language Technology Authority
People say, “What’s the big deal with just an on-line dictionary?” • “I never/seldom use a dictionary!” • Native speaker syndrome • Vast majority of people must use a dictionary in a second/third language • “Hardcopy dictionaries are better.” • Can’t do wild-card searches by hand • Not engineered for 10 sec. avg. response • Humans tire; machines do not. John J. Kovarik, NSA/CSS Senior Language Technology Authority
1991First Generation Dictionary DB Tool • 200,000 entries from 3x5 cards collected over 20 years • Wild card searchable • Cross referenced 4 ways in accordance with user requirements • Displayed in native script • Can cut and paste queries/responses John J. Kovarik, NSA/CSS Senior Language Technology Authority
Reactions to 1st Generation Tool • Younger analysts used it; liked it; made great suggestions to improve it • Senior analysts usually would not use it John J. Kovarik, NSA/CSS Senior Language Technology Authority
19952nd Generation Dictionary DB Tool • Responses faster on queries with leading wild card • GUI customized per user input • Candidate entry system established • Usership doubled ! • Senior analysts start to use it John J. Kovarik, NSA/CSS Senior Language Technology Authority
19983rd Generation Dictionary DB Tool • Database re-encoded in UTF8 • Simultaneous simplified and traditional Chinese display enabled • Average 1,000-3,000 candidate entries approved annually ’98-’02 • Usership againdoubled ! John J. Kovarik, NSA/CSS Senior Language Technology Authority
Today WordscapeThe Next Generation Dictionary Tool • Retains all Chinese capabilities • Expands to all language collections • Neologism newswire research tools • Over 90 languages represented in one Unicode DB unified under one XML schema and one suite of tools • Under LASER ACTD funding, extending all across the federal government! John J. Kovarik, NSA/CSS Senior Language Technology Authority
Technology and Standards • New technology being used • Benefits of scale from use of UTF8, XML • Standards adopted—leading change • Participating in ISO standards group Technical Committee 37 on terminology and language resources (developing standardized formats for foreign language lexical resources and data exchange) John J. Kovarik, NSA/CSS Senior Language Technology Authority
When do Unicode standards fail? When Unicode standards are not standard! • 3rd World languages less commonly taught in the United States • Hindi (many different script rendering implementations) • Mongolian (no standardized spelling, many newswire web sites employ non-standard fonts) John J. Kovarik, NSA/CSS Senior Language Technology Authority
Language Knowledge Services Team/Resources • John L. George Program Manager (301) 688-9133 • Over 20 computer scientists/techs • Currently deploying Beta version • Learning from testing with earlier version instantiations at FBI and NSA • on JWICS now, SIPRnet/NIPRnet next John J. Kovarik, NSA/CSS Senior Language Technology Authority
Contact Information John J. Kovarik Senior Language Technology Authority NSA Representative to LASER ACTD National Security Agency 9800 Savage Road Suite 6486 S2 Phone: (301) 688-7198 Kovarik@afterlife.ncsc.mil John J. Kovarik, NSA/CSS Senior Language Technology Authority