tmp-77534-C1gj67YexvwY-.html

Published on May 15, 2025

Scene 1 (0s)

Neural Natural Language Processing (NLP) Pre-Training, Word-Embeddings, and Transformers Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 1.

Scene 2 (9s)

Pre-Training General Idea: Multiple problems share common features / properties Transfer knowledge from one domain to another domain Transfer knowledge from one modality to another modality/domain Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 2.

Scene 3 (22s)

Terminology Pre-training Supervised Unsupervised Semi-Supervised Self supervised Self training Fine-tuning Transfer learning Representation learning Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 3.

Scene 4 (31s)

Terminology Supervised: all training data has labels loss measures correctness of the model towards labels (categorical and continuous) Semi supervised a small part of the data has labels knowledge about the class distribution use of clustering use of Bayes-rule to assessing labels Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 4.

Scene 5 (46s)

Self Supervised Learning (SSL) • Artificial Neural Network Method • between supervised and unsupervised learning • First solve task with pseudo-labels (intialization) • Then supervised / or unsupervised training contrastive SSL • contrastive loss • show positive examples and unlabeled negative samples Triplet Loss: max(d(a,p) — d(a, n) + a, 0) Anchor Negative learning Positive Technische Hochschule Rosenheim Anchor Positive Negat ive, Figure showing the desired effect on cosine similarity of extracted neural network representations after training image source: https://de.wikipedia.org/wiki/Kontrastives_Lernen#/media/Datei:The-Triplet-loss-in-cosine- similarity. png Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 5.

Scene 6 (1m 11s)

Self Supervised Learning (SSL) non-contrastive SSL only positive examples needs to be an extra predictor on the online side, the gradient cannot be back-propagated on the target side needs a non-trainable predictor on the target side image taken from: Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko, M., 2020. Bootstrap your own latent: A new approach to self-supervised Learning. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 6.

Scene 7 (1m 45s)

Self Training I like cats Help me, I don't know anything I hate dogs Audio Supervised Model Audio Audio Train superversised model with all the labled data you have Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 7.

Scene 8 (1m 56s)

Self Training Get quality unlabeled (audio) data Pseudo Label / Inference I like burgers Audio books are great Help me, I don't get sequence learning Supervised Model Audio Audio Audio new training data Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 8.

Scene 9 (2m 8s)

Self Training Pseudo Label / Inference I like burgers Audio books are great Help me, I don't get sequence learning Train Train Supervised Model Audio Audio Audio new training data Semi-supervised model original labeled training data Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 9.

Scene 10 (2m 20s)

Pre-Training Image Processing Simonyan and Zisserman, 2015: “Very Deep Convolutional Networks for Large-Scale Image Recognition https://arxiv.org/pdf/1409.1556.pdf Deep ConvNet trained on ImageNet (winner 2014!) aka OxfordNet, VGG16 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 10.

Scene 11 (2m 35s)

Pre-Training Image Processing He et al., 2015: “Deep Residual Learning for Image Recognition” https://arxiv.org/pdf/1512.03385.pdf Deep ConvNet with residual connections aka ResNet (resnet50) Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 11.

Scene 12 (2m 49s)

Pre-Training Image Processing Dosovotskiy et al. 2021: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” Paper: https://arxiv.org/pdf/2010.11929.pdf Transformer with patch+position embedding aka ViT Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 12.

Scene 13 (3m 4s)

Pre-Training Audio Snyder et al., 2018: “X-Vectors: Robist DNN Embeddings for Speaker Recognition”. Paper: https://www.danielpovey.com/files/2018_icassp_xvectors. pdf Use TDNN features and global pooling Classification of Age/Sex, medical conditions, etc. image source: Chung, J.S., Nagrani, A., Zisserman, A., 2018. VoxCeleb2: Deep Speaker Recognition. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 13.

Scene 14 (3m 27s)

Pre-Training wav2vec 2.0 Baevski, A., Zhou, Y., Mohamed, A., Auli, M., 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Baevski, A., Hsu, W.-N., Conneau, A., Auli, M., 2021. Unsupervised Speech Recognition. Transformer based contextual audio embeddings Multi purpose representations of audio Audio Classification, Speech Recognition, Speaker Recognition... Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 14.

Scene 15 (3m 50s)

Pre-Training Natural Language Processing Unfortunately, thereʼs no ImageNet equivalent :-( But thereʼs plenty of text out there! Wikipedia Stackoverflow So far: TF-IDF n-grams Bag of Words (BoW) Lexical analysis ontologies manual feature engineering rule-based systems Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 15.

Scene 16 (4m 3s)

Word Embeddings Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 16.

Scene 17 (4m 11s)

Why word embeddings? Waveform: [1, 1, 1, ... -182, -99, ... 133, 1, 1] Spectral Representation: [ [7.39e-02, 1.23e-01, 1.23e-01, ... 1.33e+02, 1.23e-01, 1.23e-01], [3.39e-03, 2.43e-01, 1.23e-01, ... 3.33e+02, 5.27e-01, 3.25e-02], [....] [7.39e-02, 1.23e-01, 1.23e-01, ... 1.33e+02, 1.23e-01, 1.23e-01]] Output: Trumpet, Saxophone, Violin, Guitar, Piano, Drums, Bass, ... Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 17.

Scene 18 (4m 44s)

Why word embeddings? Very enyojable, nonsense, this movie Output: Positive, Negative, Neutral Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 18.

Scene 19 (4m 54s)

One-Hot-Representation very enjoyable nonsense this movie 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 19.

Scene 20 (5m 4s)

One-Hot-Representation very enjoyable nonsense this movie film 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 20.

Scene 21 (5m 13s)

One-Hot-Representation very enjoyable nonsense this movie film 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 exploding dimensionality / size of the input with the number of words in the vocabulary no relation between words (synonyms, e.g., film/movie) missing context Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 21.

Scene 22 (5m 29s)

Suggestions for improvements? dog movie film fixed size representation meaning encoded in representation encoding of relations between words could it be done using an ontology? Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 22.

Scene 23 (5m 40s)

Word Word Meaning dog movie film 0.9 0.8 0.8 "moves" 0.0 0.6 0.6 "art" 0.9 0.8 0.2 "us-english" 0.0 0.0 1.0 "is alive" 1.0 1.0 0.5 "noun" … … … DISTRIBUTED REPRESENTATIONS! Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 23.

Scene 24 (5m 54s)

How to generate word distributed representations of text? Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 24.

Scene 25 (6m 3s)

Automatic generation of distributed representations How? dog A hairy, small, Wolpertinger hid behind the tree. 0.9 0.0 0.9 0.0 1.0 image source: https://de.wikipedia.org/wiki/Wolpertinger#/media/Datei:Wolpertinger.jpg Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 Technische Hochschule Rosenheim 25.

Scene 26 (6m 20s)

Automatic generation of distributed representations dog 0.9 0.0 0.9 0.0 1.0 … How? A hairy, small, dog hid behind the tree. A tabby, small, dog hid behind the barn. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 26.

Scene 27 (6m 34s)

Automatic generation of distributed representations dog 0.9 0.0 0.9 0.0 1.0 Sebastian Bayerl "You shall know a word by the company it keeps." J.R. Firth, A synopsis of linguistic theory 1930-55, 1957 image source: https://de.wikipedia.org/wiki/Wolpertinger#/media/Datei:Wolpertinger.jpg https://en.wikipedia.org/wiki/Tabby_cat#/media/File:Cat_November_2010-1a.jpg o.png - Sequence Technische Hochschule Rosenheim 27.

Scene 28 (6m 57s)

Word2Vec Learning "word embeddings" using a neural network Automatically learning meaningful representations of words in context Use of contextual information (context words) Skip-gram and negative sampling methodology I would like a glass of apple juice. An apple grows on a tree. Yesterday, my father baked an apple pie. She drank a glass of orange juice. There is an orange tree in the backyard. First, peel the orange . Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed Representations of Words and Phrases and their Compositionality Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 28.

Scene 29 (7m 24s)

Word2Vec Target Word 0 1 0 0 0 0 0 0.1 0.3 0.4 0.9 0.05 0.7 movie 0 0 1 0 0 0 0 see Encode Encode 0.7 dot product Context Word Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 29.

Scene 30 (7m 36s)

Word2Vec Target Word 0 1 0 0 0 0 0 0.1 0.3 0.4 0.9 0.05 0.7 movie 0 0 1 0 0 0 0 see Encode Encode 0.7 dot product Context Word e.g., dim = 10k e.g., dim = 300 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 30.

Scene 31 (7m 51s)

Word2Vec Were do we get and from? They have to be learned Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 31.

Scene 32 (8m 0s)

Word2Vec Skip-gram choose context words to define positive samples with relation to the target word define a context size, e.g., words around the target word Example: Let's go see a movie at the cinema. target context label movie see 1 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 32.

Scene 33 (8m 15s)

Word2Vec Negative Sampling choose random words from the vocabulary (not in the context of the target word!) label as negative samples sampling frequency based on the frequency of the word in the training corpus Example: Let's go see a movie at the cinema. target context label movie see 1 movie glass 0 movie dear 0 movie autonomy 0 movie where 0 Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 33.

Scene 34 (8m 31s)

Word2Vec Training Process Target Word 0 1 0 0 0 0 0 0.1 0.3 0.4 0.9 0.05 0.7 movie movie movie movie 0 0 1 0 0 0 0 see dear autonomy where Encode Encode Context Word Labels 1 0 0 0 Calculate the Loss (Cross Entropy) Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 34.

Scene 35 (8m 44s)

Word2Vec Training Process 0 1 0 0 0 0 0 0.1 0.3 0.4 0.9 0.05 0.7 0 0 1 0 0 0 0 Encode Encode Labels 1 0 0 0 Calculate the Loss (Cross Entropy) Backpropagation Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 35.

Scene 36 (8m 56s)

Word2Vec Training Process Where do we extract the embeddings from in the finished model? A B C D E F Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 36.

Scene 37 (9m 6s)

Word2Vec Training Process Where do we extract the embeddings from in the finished model? Independent of the vocabulary size Lower dimensionality than the input size/ size of the vocabulary Extracted from the hidden layer of the neural network encoding of contextual relations D Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 37.

Scene 38 (9m 21s)

Word2Vec Formalism given a sequence of words the objective of the Skip-gram model is to maximize the average log-probability of where is the number of words in the corpus and is the size of the context window (function of the center word ) larger leads to higher accuracy but also longer training times where and are the input and output vector representations of respectively is the size of the vocabulary impractical formulation, because computing the gradient is proportional to Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 38.

Scene 39 (9m 44s)

Word2Vec Formalism Hierarchical Softmax efficient approximation of the softmax binary tree representation of the vocabulary each word is represented by a path from the root to the leaf (relative probabilities of the child node) reduces the complexity of the softmax to each word can be reached by an appropriate path from the root of the tree is the th node from root to and is the length of the path one represetation for each word and for each inner node Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 39.

Scene 40 (10m 7s)

Word2Vec Formalism Negative Sampling Noise Contrastive Estimation (NCE) alternative to the hierarchical softmax can be shown to approximately maximize the log probability of the softmax only concerned with learning high quality word vectors can therefore simplify the NCE as long as the quality does not suffer Objective: replaces in the skip gram objective distinguishes the target word from from words in the noise distribution: is the number of negative samples good between 5-20 in small datasets, about 2-5 in large datasets Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 40.

Scene 41 (10m 30s)

Word2Vec Formalism Negative Sampling large corpora can easily have most frequent words in the hundreds of millions ('in', 'the', 'a', 'and', ...) little information gain from learning the noise distribution while co-occurences of words are usefull for learning the word vectors (e.g., "France" and "Paris") counter balance between rre and frequent words: probability of discarding words is given by: is the frequency of the word in the corpus, and is a chosen threshold (typically: ) heuristic, worked well in practice Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 41.

Scene 42 (10m 55s)

Why word embeddings? Very enjoyable nonsense, this movie very enjoyable nonsense this movie 0.6 0.01 0.03 0.3 0.01 0.02 0.9 0.32 0.88 0.12 0 0.2 0.25 0 0.25 0.22 0.33 0.8 0.1 0.2 0.88 0.65 0.23 0.24 0.1 0.01 0.23 0.65 0.44 0.9 mostly pre-trained on large datasets. Fine-tuning for specific use cases and domain adaptation. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 42.

Scene 43 (11m 19s)

Visualizing embeddings: Image source: Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed Representations of Words and Phrases and their Compositionality Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 43.

Scene 44 (11m 35s)

Visualizing embeddings: visulization of word embeddings in 2D/3D • PCA, t-SNE, UMAP good for understanding the quality of the embeddings good embeddings visualize as clusters good embeddings show semantic relations Spain Italy Hadr id Germany Berlin walked O walking O swimmi ng Verb tense swam king queen Male-Female Turkey Rus s ia Canada v ietnam China Ankara Moscow Ot tawa Beijing Country-Capital Image source: https://www.tensorflow.org/images/linear-relationships.png> Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 Technische Hochschule Rosenheim 44.

Scene 45 (11m 54s)

Timeline Word Embeddings there are many more embeddings e.g.,ELEDTRA, RoBERTa, DistilBERT, XLNet, T5, GPT-2/3/4 and many more new models... Image source: Wang, S., Zhou, W. & Jiang, C. A survey of word embeddings based on deep learning. Computing 102, 717–740 (2020). Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 45.

Scene 46 (12m 13s)

Attention Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 46.

Scene 47 (12m 21s)

Attention https://en.wikipedia.org/wiki/Attention https://en.wikipedia.org/wiki/Selective_auditory_attention Image source: Gaspelin, Nicholas & Luck, Steven. (2018). The Role of Inhibition in Avoiding Distraction by Salient Stimuli. Trends in Cognitive Sciences. Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. Selective auditory attention or selective hearing is a type of selective attention. Selective hearing is characterized as the action in which people focus their attention intentionally on a specific source of a sound or spoken words. ☉ Attention Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 47.

Scene 48 (12m 51s)

Attention components Image source: Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An Attentive Survey of Attention Models. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 48.

Scene 49 (13m 5s)

Attention interpretation Image source: Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An Attentive Survey of Attention Models. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 49.

Scene 50 (13m 19s)

Attention Types of Attention Mechanism Taken from: Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An Attentive Survey of Attention Models. Sebastian Bayerl - Sequence Learning and Speech Recognition Summer Term 2025 50.