AWESOMEDEEPLEARNING Telegram 221
Forwarded from Artificial Intelligence
Dear friends,

I’d like to share a part of the origin story of large language models that isn’t widely known.

A lot of early work in natural language processing (NLP) was funded by U.S. military intelligence agencies that needed machine translation and speech recognition capabilities. Then, as now, such agencies analyzed large volumes of text and recorded speech in various languages. They poured money into research in machine translation and speech recognition over decades, which motivated researchers to give these applications disproportionate attention relative to other uses of NLP.

This explains why many important technical breakthroughs in NLP stem from studying translation — more than you might imagine based on the modest role that translation plays in current applications. For instance, the celebrated transformer paper, “Attention is All You Need” by the Google Brain team, introduced a technique for mapping a sentence in one language to a translation in another. This laid the foundation for large language models (LLMs) like ChatGPT, which map a prompt to a generated response.

Or consider the BLEU score, which is occasionally still used to evaluate LLMs by comparing their outputs to ground-truth examples. It was developed in 2002 to measure how well a machine-generated translation compares to a ground truth, human-created translation.

A key component of LLMs is tokenization, the process of breaking raw input text into sub-word components that become the tokens to be processed. For example, the first part of the previous sentence may be divided into tokens like this:

/A /key /component /of /LL/Ms/ is/ token/ization

The most widely used tokenization algorithm for text today is Byte Pair Encoding (BPE), which gained popularity in NLP after a 2015 paper by Sennrich et al. BPE starts with individual characters as tokens and repeatedly merges tokens that occur together frequently. Eventually, entire words as well as common sub-words become tokens. How did this technique come about? The authors wanted to build a model that could translate words that weren’t represented in the training data. They found that splitting words into sub-words created an input representation that enabled the model, if it had seen “token” and “ization,” to guess the meaning of a word it might not have seen before, such as “tokenization.”

I don’t intend this description of NLP history as advocacy for military-funded research. (I have accepted military funding, too. Some of my early work in deep learning at Stanford University was funded by DARPA, a U.S. defense research agency. This led directly to my starting Google Brain.) War is a horribly ugly business, and I would like there to be much less of it. Still, I find it striking that basic research in one area can lead to broadly beneficial developments in others. In similar ways, research into space travel led to LED lights and solar panels, experiments in particle physics led to magnetic resonance imaging, and studies of bacteria’s defenses against viruses led to the CRISPR gene-editing technology.

So it’s especially exciting to see so much basic research going on in so many different areas of AI. Who knows, a few years hence, what today’s experiments will yield?

Keep learning!
Andrew NG
👍21



tgoop.com/awesomedeeplearning/221
Create:
Last Update:

Dear friends,

I’d like to share a part of the origin story of large language models that isn’t widely known.

A lot of early work in natural language processing (NLP) was funded by U.S. military intelligence agencies that needed machine translation and speech recognition capabilities. Then, as now, such agencies analyzed large volumes of text and recorded speech in various languages. They poured money into research in machine translation and speech recognition over decades, which motivated researchers to give these applications disproportionate attention relative to other uses of NLP.

This explains why many important technical breakthroughs in NLP stem from studying translation — more than you might imagine based on the modest role that translation plays in current applications. For instance, the celebrated transformer paper, “Attention is All You Need” by the Google Brain team, introduced a technique for mapping a sentence in one language to a translation in another. This laid the foundation for large language models (LLMs) like ChatGPT, which map a prompt to a generated response.

Or consider the BLEU score, which is occasionally still used to evaluate LLMs by comparing their outputs to ground-truth examples. It was developed in 2002 to measure how well a machine-generated translation compares to a ground truth, human-created translation.

A key component of LLMs is tokenization, the process of breaking raw input text into sub-word components that become the tokens to be processed. For example, the first part of the previous sentence may be divided into tokens like this:

/A /key /component /of /LL/Ms/ is/ token/ization

The most widely used tokenization algorithm for text today is Byte Pair Encoding (BPE), which gained popularity in NLP after a 2015 paper by Sennrich et al. BPE starts with individual characters as tokens and repeatedly merges tokens that occur together frequently. Eventually, entire words as well as common sub-words become tokens. How did this technique come about? The authors wanted to build a model that could translate words that weren’t represented in the training data. They found that splitting words into sub-words created an input representation that enabled the model, if it had seen “token” and “ization,” to guess the meaning of a word it might not have seen before, such as “tokenization.”

I don’t intend this description of NLP history as advocacy for military-funded research. (I have accepted military funding, too. Some of my early work in deep learning at Stanford University was funded by DARPA, a U.S. defense research agency. This led directly to my starting Google Brain.) War is a horribly ugly business, and I would like there to be much less of it. Still, I find it striking that basic research in one area can lead to broadly beneficial developments in others. In similar ways, research into space travel led to LED lights and solar panels, experiments in particle physics led to magnetic resonance imaging, and studies of bacteria’s defenses against viruses led to the CRISPR gene-editing technology.

So it’s especially exciting to see so much basic research going on in so many different areas of AI. Who knows, a few years hence, what today’s experiments will yield?

Keep learning!
Andrew NG

BY GenAi, Deep Learning and Computer Vision


Share with your friend now:
tgoop.com/awesomedeeplearning/221

View MORE
Open in Telegram


Telegram News

Date: |

More>> fire bomb molotov November 18 Dylan Hollingsworth yau ma tei The initiatives announced by Perekopsky include monitoring the content in groups. According to the executive, posts identified as lacking context or as containing false information will be flagged as a potential source of disinformation. The content is then forwarded to Telegram's fact-checking channels for analysis and subsequent publication of verified information. Telegram has announced a number of measures aiming to tackle the spread of disinformation through its platform in Brazil. These features are part of an agreement between the platform and the country's authorities ahead of the elections in October. Read now
from us


Telegram GenAi, Deep Learning and Computer Vision
FROM American