自然语言处理(NLP)简史
原文:In the early 1900s, a Swiss linguistics professor named Ferdinand de Saussure died, and in the process, almost deprived the world of the concept of “Language as a Science.” From 1906 to 1911, Professor Saussure offered three courses at the University of Geneva, where he developed an approach describing languages as “systems.” Within the language, a sound represents a concept – a concept that shifts meaning as the context changes.
He argued that meaning is created inside language, in the relations and differences between its parts. Saussure proposed “meaning” is created within a language’s relationships and contrasts. A shared language system makes communication possible. Saussure viewed society as a system of “shared” social norms that provides conditions for reasonable, “extended” thinking, resulting in decisions and actions by individuals. (The same view can be applied to modern computer languages).
Saussure died in 1913, but two of his colleagues, Albert Sechehaye and Charles Bally, recognized the importance of his concepts. (Imagine the two, days after Saussure’s death, in Bally’s office, drinking coffee and wondering how to keep his discoveries from being lost forever). The two took the unusual steps of collecting “his notes for a manuscript,” and his students’ notes from the courses. From these, they wrote the Cours de Linguistique Générale, published in 1916. The book laid the foundation for what has come to be called the structuralist approach, starting with linguistics, and later expanding to other fields, including computers.
In 1950, Alan Turing wrote a paper describing a test for a “thinking” machine. He stated that if a machine could be part of a conversation through the use of a teleprinter, and it imitated a human so completely there were no noticeable differences, then the machine could be considered capable of thinking. Shortly after this, in 1952, the Hodgkin-Huxley model showed how the brain uses neurons in forming an electrical network. These events helped inspire the idea of Artificial Intelligence (AI), Natural Language Processing (NLP), and the evolution of computers.
Natural Language Processing
Natural Language Processing (NLP) is an aspect of Artificial Intelligence that helps computers understand, interpret, and utilize human languages. NLP allows computers to communicate with people, using a human language. Natural Language Processing also provides computers with the ability to read text, hear speech, and interpret it. NLP draws from several disciplines, including computational linguistics and computer science, as it attempts to close the gap between human and computer communications.
Generally speaking, NLP breaks down language into shorter, more basic pieces, called tokens (words, periods, etc.), and attempts to understand the relationships of the tokens. This process often uses higher-level NLP features, such as:
- Content Categorization: A linguistic document summary that includes content alerts, duplication detection, search, and indexing.
- Topic Discovery and Modeling: Captures the themes and meanings of text collections, and applies advanced analytics to the text.
- Contextual Extraction: Automatically pulls structured data from text-based sources.
- Sentiment Analysis: Identifies the general mood, or subjective opinions, stored in large amounts of text. Useful for opinion mining.
- Text-to-Speech and Speech-to-Text Conversion: Transforms voice commands into text, and vice versa.
- Document Summarization: Automatically creates a synopsis, condensing large amounts of text.
- Machine Translation: Automatically translates the text or speech of one language into another.
NLP Begins and Stops
Noam Chomsky published his book, Syntactic Structures, in 1957. In it, he revolutionized previous linguistic concepts, concluding that for a computer to understand a language, the sentence structure would have to be changed. With this as his goal, Chomsky created a style of grammar called Phase-Structure Grammar, which methodically translated natural language sentences into a format that is usable by computers. (The overall goal was to create a computer capable of imitating the human brain, in terms of in thinking and communicating, or AI.)
In 1958, the programming language LISP (Locator/Identifier Separation Protocol), a computer language still in use today, was released by John McCarthy. In 1964, ELIZA, a “typewritten” comment and response process, designed to imitate a psychiatrist using reflection techniques, was developed. (It did this by rearranging sentences and following relatively simple grammar rules, but there was no understanding on the computer’s part.) Also in 1964, the U.S. National Research Council (NRC) created the Automatic Language Processing Advisory Committee, or ALPAC, for short. This committee was tasked with evaluating the progress of Natural Language Processing research.
In 1966, the NRC and ALPAC initiated the first AI and NLP stoppage, by halting the funding of research on Natural Language Processing and machine translation. After twelve years of research, and $20 million dollars, machine translations were still more expensive than manual human translations, and there were still no computers that came anywhere near being able to carry on a basic conversation. In 1966, Artificial Intelligence and Natural Language Processing (NLP) research was considered a dead end by many (though not all).
Return of the NLP
It took nearly fourteen years (until 1980) for Natural Language Processes and Artificial Intelligence research to recover from the broken expectations created by extreme enthusiasts. In some ways, the AI stoppage had initiated a new phase of fresh ideas, with earlier concepts of machine translation being abandoned, and new ideas promoting new research, including expert systems. The mixing of linguistics and statistics, which had been popular in early NLP research, was replaced with a theme of pure statistics. The 1980s initiated a fundamental reorientation, with simple approximations replacing deep analysis, and the evaluation process becoming more rigorous.
Until the 1980s, the majority of NLP systems used complex, “handwritten” rules. But in the late 1980s, a revolution in NLP came about. This was the result of both the steady increase of computational power, and the shift to Machine Learning algorithms. While some of the early Machine Learning algorithms (decision trees provide a good example) produced systems similar to the old school handwritten rules, research has increasingly focused on statistical models. These statistical models are capable making soft, probabilistic decisions. Throughout the 1980s, IBM was responsible for the development of several successful, complicated statistical models.
In the 1990s, the popularity of statistical models for Natural Language Processes analyses rose dramatically. The pure statistics NLP methods have become remarkably valuable in keeping pace with the tremendous flow of online text. N-Grams have become useful, recognizing and tracking clumps of linguistic data, numerically. In 1997, LSTM recurrent neural net (RNN) models were introduced, and found their niche in 2007 for voice and text processing. Currently, neural net models are considered the cutting edge of research and development in the NLP’s understanding of text and speech generation.
After the Year 2000
In 2001, Yoshio Bengio and his team proposed the first neural “language” model, using a feed-forward neural network. The feed-forward neural network describes an artificial neural network that does not use connections to form a cycle. In this type of network, the data moves only in one direction, from input nodes, through any hidden nodes, and then on to the output nodes. The feed-forward neural network has no cycles or loops, and is quite different from the recurrent neural networks.
In the year 2011, Apple’s Siri became known as one of the world’s first successful NLP/AI assistants to be used by general consumers. Within Siri, the Automated Speech Recognition module translates the owner’s words into digitally interpreted concepts. The Voice-Command system then matches those concepts to predefined commands, initiating specific actions. For example, if Siri asks, “Do you want to hear your balance?” it would understand a “Yes” or “No” response, and act accordingly.
By using Machine Learning techniques, the owner’s speaking pattern doesn’t have to match exactly with predefined expressions. The sounds just have to be reasonably close for an NLP system to translate the meaning correctly. By using a feedback loop, NLP engines can significantly improve the accuracy of their translations, and increase the system’s vocabulary. A well-trained system would understand the words, “Where can I get help with Big Data?” “Where can I find an expert in Big Data?,” or “I need help with Big Data,” and provide the appropriate response.
The combination of a dialog manager with NLP makes it possible to develop a system capable of holding a conversation, and sounding human-like, with back-and-forth questions, prompts, and answers. Our modern AIs, however, are still not able to pass Alan Turing’s test, and currently do not sound like real human beings. (Not yet, anyway.)
译文:在1900年代初期,一位瑞士语言学教授费迪南·德·索绪尔去世了,他的死,几乎使世界失去了 “语言作为一门科学”的概念。从1906年到1911年,索绪尔教授在日内瓦大学开设了三门课程,在那里他开发了 将语言描述为“系统”的方法。在语言中,一个声音代表一个概念——一个随着上下文变化而改变意义的概念。
他认为,意义是在语言之中创造的,在其各部分之间的关联和差异之中创造的。索绪尔提出的“意义”是在语言中的联系和对比中创建的。一个共享的语言系统使沟通成为可能。索绪尔将社会视为一个“共享”的社会规范的理论为导致个人的决定和行动的合理的(reasonable)、“扩展”(extended)的思维提供了条件。(相同的观点可应用于现代计算机语言)。
索绪尔于1913年去世,但他的两位同事阿尔伯特·塞切哈耶(Albert Sechehaye)和查尔斯·巴利(Charles Bally)认识到他的概念的重要性。(想象一下索绪尔死后的两天,在巴利的办公室里,喝着咖啡,想着如何让他的发现不至于永远消失)。两人采取了不同寻常的步骤,收集“他的手稿笔记”,以及他的学生从课程中得到的笔记。从这些中,他们写了《普通语言学教程》,出版于1916年。这本书为后来被称为《结构主义方法》,从语言学开始,后来扩展到其他领域,包括计算机。
1950年,艾伦·图灵(Alan Turing)写了一篇论文,描述了对“思考”机器的测试。他说,如果一台机器可以通过使用电传打印机参与会话,并且它可以完全模仿人类,让人察觉不到明显的差异,那么这个机器就可以被认为是能够思考的。此后不久,在1952年,霍奇金-赫胥黎模型展示了大脑如何通过神经元形成网络。这些活动有助于激发关于人工智能(AI),自然语言处理(NLP)的想法,以及推进计算机的发展。
自然语言处理
自然语言处理(NLP)是人工智能这有助于计算机理解、解释和利用人类语言。NLP允许计算机使用人类语言与人交流。自然语言处理还为计算机提供了阅读文本、收听语音和解释文本的能力。NLP借鉴了几个学科,包括计算语言学和计算机科学,因为它试图缩小人类和计算机通信之间的差距。
一般而言非语言处理分解语言变成更短、更基本的片段,称为词元(token)(就是单词、句点等),并尝试理解词元的关系。此过程通常使用更高级别的 NLP 功能,例如:
内容分类:一个语言文档摘要,包括内容警报、重复检测、搜索和索引。
主题发现和建模:捕获 文本集合的主题和含义,并应用高级 对文本的分析。
上下文提取:自然而然 从基于文本的源中提取结构化数据。
情绪分析:标识 一般情绪或主观意见,存储在大量文本中。 对意见挖掘很有用。
文本到语音转换和语音到文本转换:将语音命令转换为文本,反之亦然。
文档摘要:创建概要,压缩大量文本。
机器翻译:将一种语言的文本或语音翻译成另一种语言。
NLP 的开始和停止
1957年,诺姆·乔姆斯基出版了他的书《句法结构》。在这篇文章中,他彻底改变了以前的语言概念,得出的结论是,要使计算机理解一语言,句子结构必须改变。以此为目标,乔姆斯基创造了一种称为Phase-Structure Grammar的语法概念,它有条不紊地将自然语言句子翻译成计算机可以使用的格式。(总体目标是创造一台能够在思考和交流方面模仿人脑的计算机,或者人工智能)。
1958年,编程语言LISP(定位器/标识符分离协议)是一种至今仍在使用的计算机语言,由John McCarthy发布。1964年,ELIZA被开发,一个“打字”的评论和回复程序,旨在模仿精神科医生使用反射技术。(它通过重新排列句子和遵循相对简单的语法规则来做到这一点,但计算机方面没有理解。同样在1964年,美国国家研究委员会(NRC)创建了自动语言处理咨询委员会(简称ALPAC)。该委员会的任务是评估自然语言处理研究的进展。
1966年,NRC和ALPAC发起了第一次AI和NLP技术的“停工”, 停止资助自然语言处理和机器翻译的研究。因为经过十二年的研究,和20万美元的耗资,机器翻译仍然比人工翻译更昂贵,并且仍然没有计算机能够进行基本对话。1966年,人工智能和自然语言处理 (NLP)研究被许多人(尽管不是全部)视为死胡同。
NLP的回归
自然语言过程花了将近十四年从(直到1980年)人工智能研究的极端爱好者造成的令人扫兴的期望中恢复过来。在某些方面,人工智能的停工开启了新想法的新阶段,早期的机器翻译概念被抛弃,新想法推动了新研究,包括“expert systems”.语言学和统计学的结合在早期的NLP研究中一直很流行,后逐渐被被纯统计学所取代。1980年代开始了根本性的重新定位,简单的近似取代了深入分析,评估过程变得更加严格。
直到 1980 年代,大多数 NLP 系统都使用复杂的、 “手写的(handwritten)”规则。但在 1980 年代后期,NLP 发生了一场革命。 这是计算能力稳步增长所造成的,也是转向机器学习算法的结果。虽然一些早期的机器学习算法(决策树就是一个很好的例子)产生的系统类似于老派的“手写”规则,但研究越来越关注统计模型。这些统计模型能够进行概率的、“软的(soft)”决定。在整个 1980 年代,IBM 负责开发 几个成功的、复杂的统计模型。
在1990年代,用于自然语言过程分析的统计模型的普及率急剧上升。纯统计的NLP方法在跟上在线文本的巨大流动方面变得非常有价值。N-Grams 已经变得有所用处(以数字方式识别和跟踪语言数据)。1997年,LSTM递归神经网络 (RNN) 模型被引入,并于 2007 年在语音和文本处理中找到了他们的市场。目前,神经网络模型被认为是NLP理解文本和语音生成的研究和开发的前沿。
2000年之后
2001年,Yoshio Bengio和他的团队提出了第一个神经“语言”模型,使用:前馈神经网络.前馈神经网络描述了一种不使用连接形成循环的人工神经网络。在这种类型的网络中,数据仅沿一个方向移动,从输入节点,通过隐藏节点,然后到达输出节点。前馈神经网络没有循环,与循环神经网络有很大不同。
在2011年,苹果的Siri成为世界上第一批成功的普通消费者使用的NLP / AI助手之一。在 Siri 中,自动语音识别模块将所有者的话翻译成数字解释的概念。然后,语音命令系统将这些概念与预定义的命令相匹配,启动特定操作。例如,如果Siri问“Do you want to hear your balance?”它会明白给出的“Yes” or “No” 的回答,并采取相应的行动。
通过使用机器学习技术,所有者的说话模式不必与预定义的表达式完全匹配。声音只需要相当接近,NLP系统就可以正确理解含义。通过使用反馈循环,NLP引擎可以显着提高其翻译的准确性,并增加系统的词汇量。一个训练有素的系统会理解“我在哪里可以获得大数据方面的帮助?“我在哪里可以找到大数据专家?”或“我需要大数据方面的帮助”,并提供适当的回应。
对话框管理器与 NLP 的结合使其可以开发一个能够进行对话和发声的系统(像人类一样,有来回的问题、提示和答案)。然而,我们的现在的AI仍然无法通过Alan Turing的测试,它们听起来也并不像真人。(至少现在不像)
转载自:https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/#
机器翻译,人工修改。仅供学习交流,请勿转载!