Notice: file_put_contents(): Write of 4890 bytes failed with errno=28 No space left on device in /var/www/tgoop/post.php on line 50

Warning: file_put_contents(): Only 12288 of 17178 bytes written, possibly out of free disk space in /var/www/tgoop/post.php on line 50
Machine learning books and papers@Machine_learn P.3440
MACHINE_LEARN Telegram 3440
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

Paper: https://arxiv.org/pdf/2501.13306v2.pdf

Code: https://github.com/aslp-lab/osum

Datasets: LibriSpeech - IEMOCAP



@Machine_learn
2



tgoop.com/Machine_learn/3440
Create:
Last Update:

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

Paper: https://arxiv.org/pdf/2501.13306v2.pdf

Code: https://github.com/aslp-lab/osum

Datasets: LibriSpeech - IEMOCAP



@Machine_learn

BY Machine learning books and papers




Share with your friend now:
tgoop.com/Machine_learn/3440

View MORE
Open in Telegram


Telegram News

Date: |

Image: Telegram. Find your optimal posting schedule and stick to it. The peak posting times include 8 am, 6 pm, and 8 pm on social media. Try to publish serious stuff in the morning and leave less demanding content later in the day. In the “Bear Market Screaming Therapy Group” on Telegram, members are only allowed to post voice notes of themselves screaming. Anything else will result in an instant ban from the group, which currently has about 75 members. Unlimited number of subscribers per channel But a Telegram statement also said: "Any requests related to political censorship or limiting human rights such as the rights to free speech or assembly are not and will not be considered."
from us


Telegram Machine learning books and papers
FROM American