NLPA Machine Translation

马上开始. 它是免费的哦
注册 使用您的电邮地址
NLPA Machine Translation 作者: Mind Map: NLPA Machine Translation

1. more

1.1. controlled language

1.2. phrase-based statistical translation

2. software and projects

2.1. online

2.1.1. Google Translate (statistical)

2.1.2. Systran (rule-based)

2.1.3. UI: Android and browser extensions

2.2. systems

2.2.1. Giza++

2.2.1.1. open source statistical machine translation

2.2.1.2. Och

2.2.1.3. IBM Models 1-5, HMM word alignment

2.2.2. Jane

2.2.2.1. Aachen (Ney)

2.2.2.2. phrase-based statistical machine translation

2.2.3. Joshua

2.2.3.1. Johns Hopkins

2.2.3.2. synchronous context free grammars

2.2.3.2.1. chart parsing

2.2.3.2.2. n-gram language models

2.2.3.2.3. beam / cube pruning

2.2.3.2.4. k-best extraction

2.2.3.2.5. suffix-array grammar extraction

2.2.3.2.6. minimum error rate training

2.2.4. MOSES

2.2.4.1. EU project

2.2.4.2. phrase-based translation

2.2.5. GALE

2.2.5.1. DARPA program (military, intelligence applications)

2.3. evaluation

2.3.1. BLEU score

2.3.2. BLEU = bilingual evaluation understudy

2.3.3. high correlation with human judgments of quality

2.3.4. BLEU scores are between 0 and 1

2.3.5. calculate

2.3.5.1. compute precision for words/n-grams between machine translation and human translation

2.3.5.2. modified precision score (clipped etc.)

2.3.5.3. calculate modified precision for n-grams

2.3.5.4. n=4 correlates well with human quality judments

2.4. data sets

2.4.1. LDC (linguistic data consortium)

2.4.2. EUROPARL

2.4.2.1. European parliament translations

2.4.2.2. aligned sentences

2.5. should you work on it?

2.5.1. +

2.5.1.1. interesting and easy to understand problem

2.5.1.2. great test case for machine learning

2.5.1.3. lots of ideas that haven't been explored/tested

2.5.1.4. fundamental AI questions about learning language, semantics, etc.

2.5.2. -

2.5.2.1. problem is ill-defined

2.5.2.2. practical solutions may not depend much on actual translation quality, but by using translation well, controlled language, etc.

2.5.2.3. competing against people with lots of resources

2.5.2.4. current approaches have nothing to do with AI

2.5.2.5. scoring / evaluation in the community is arbitrary and wouldn't catch the interesting improvements

2.5.3. =

2.5.3.1. may be better to work on NLU and text generation separately

2.5.3.2. pick an interesting application like gaming (dormant for years, but imagine being able to talk to your NPCs)

2.5.3.3. or pick a well-defined sub-topic: better tools (including statistical learning) for controlled languages, camera-based translation, etc.

2.5.3.4. lots of smaller topics: POS tagging as feature functions, topic modeling, neural network language modeling, etc.

2.5.3.5. E-prime enforcer?

3. approaches

3.1. rule-based

3.1.1. approach

3.1.1.1. write down rules for mapping source sentences into target sentences

3.1.1.2. use a dictionary to translate specific words

3.1.1.3. use a morphological analyzer to translate word forms

3.1.2. properties

3.1.2.1. creation requires a combination of programming and linguistic expertise, but little data

3.1.2.2. efficient and easy to understand

3.1.2.3. doesn't work very well, unfortunately

3.1.2.4. doesn't scale up, hard to adapt to new languages and domains

3.2. statistical

3.2.1. approach

3.2.1.1. treat language as a sequence of words with statistical relationships (HMMs, etc.)

3.2.1.2. model the statistical relationship between input and output languages

3.2.1.3. generally, Bayesian approach

3.2.1.4. P( English sentence | German sentence)

3.2.1.5. try to factor this probability in some way

3.2.1.6. build models by learning from large corpora

3.2.2. properties

3.2.2.1. requires some kind of performance measure

3.2.2.2. can take advantage of large amounts of online text

3.2.2.3. requires large amounts of training data

3.2.2.4. can be retargeted to different languages with new training data

3.2.2.5. has difficulties with syntactically dissimilar languages

3.2.2.6. has no deep understanding of what it is translating

3.3. interlingual machine translation

3.3.1. approach

3.3.1.1. analyze the source text and translate it into an intermediate language

3.3.1.2. translate the intermediate language into the target

3.3.1.3. the intermediate is an actual natural language (possibly constructed)

3.3.2. properties

3.3.2.1. reduces N^2 problem to N+1 problem

3.3.2.2. seemingly attractive but not widely used

3.3.2.3. people love designing the intermediate language

3.3.2.4. you now have two synset mismatches

3.4. transfer-based

3.4.1. approach

3.4.1.1. analyze the source text and translate it into an intermediate representation

3.4.1.2. translate the intermediate representation into an output representation

3.4.1.3. intermediate representation

3.4.1.3.1. shallow / syntactic

3.4.1.3.2. deep / semantic

3.4.2. ultimately

3.4.2.1. natural language understanding

3.4.2.2. text generation

3.4.3. properties

3.4.3.1. usually rule-based (could be statistical, but that hasn't been explored much)

3.4.3.2. statistical parsers, word sense disambiguation, tagging, etc. slowly being incorporated

4. challenges

4.1. word ambiguities

4.1.1. e.g. "borrow / lend" vs. "leihen"

4.2. syntactic ambiguities

4.2.1. Mary cuts the card with the code.

4.2.2. Mary cuts the card with the scissors.

4.3. anaphora

4.3.1. John sees Jack. His glasses really help.

4.3.2. John sees Jack. He appears to be walking down the street.

4.4. sutble meaning

4.4.1. John is a figurehead for the organization.

4.4.2. John is a representative for the organization.

5. vision

5.1. replace human translators

5.2. kinds

5.2.1. text-to-text

5.2.1.1. technical documents, manuals. web pages, education, medical

5.2.2. speech-to-speech

5.2.2.1. travel, military, conferences, medical, business

5.2.3. image-to-text

5.2.3.1. travel, military

5.2.4. speech-to-signed

5.2.4.1. accessibility

5.2.5. speech-to-neural

5.2.5.1. Star Trek universal translator

5.3. current practice

5.3.1. simultaneous interpretation

5.3.1.1. origin at the Nuremberg trials

5.3.1.2. widely used speech-to-speech

5.3.2. offline translation

5.3.3. artistic recreation (literary translation)

6. existing automatic systems

6.1. image-to-text apps for phones

6.2. speech-to-speech app for phones

6.3. word translation popups for browsers

6.4. text-to-text for web pages (browser plug-in or bookmarklet)

6.5. text-to-text using computer/human combo

6.5.1. full pages

6.5.2. splitting up and recombining

7. utility and simplifications

7.1. image-to-text

7.1.1. foreign scripts are even hard to look up if you don't know them

7.1.2. just words and translations are very useful

7.2. word-by-word translations

7.2.1. can get a sense of what something is about just from the words, without syntax

7.2.2. fairly easy, but still runs into the problem of ambiguities

7.3. translations in an interactive context (e.g., travel)

7.3.1. travel translation, others often allow for feedback

7.3.2. yes/no questions, please point to...

7.3.3. pictionaries, visual feedback

7.4. translations of legal and technical documents

7.4.1. high accuracy required

7.4.2. highly specialized and knowledgeable translators needed

7.4.3. may benefit from controlled language

7.5. artistic translations

7.5.1. often more recreations of the original work "in the spirit of"

7.5.1.1. Now is the winter of our discontent Made glorious summer by this son of York

7.5.1.1.1. Shakespeare (Richard the Third)

7.5.1.2. Nun ward der Winter unsers Mißvergnügens Glorreicher Sommer durch die Sonne Yorks

7.5.1.2.1. Schlegel

7.5.1.3. sun / son