auditus gives you simple access to state-of-the-art audio embeddings. Like SentenceTransformers for audio.
$ pip install auditus
Quickstart
The high-level object in auditus is the AudioPipeline which takes in a path and returns a pooled embedding.
from auditus.transform import AudioPipelinepipe = AudioPipeline(# Default AST model model_name="MIT/ast-finetuned-audioset-10-10-0.4593", # PyTorch output return_tensors="pt", # Resampled to 16KhZ target_sr=16000, # Mean pooling to obtain single embedding vector pooling="mean",)output = pipe("../test_files/XC119042.ogg").squeeze(0)print(output.shape)output[:5]
Many Audio Transformer models work only on a specific sampling rate. With Resampling you can resample the audio to the desired sampling rate. Here we go from 32kHz to 16kHz.
from auditus.transform import Resamplingresampled = Resampling(target_sr=16000)(audio)resampled
The main transform in auditus is the AudioEmbedding transform. It takes an AudioArray and returns a tensor. Check out the HuggingFace docs for more information on the available parameters.
from auditus.transform import AudioEmbeddingemb = AudioEmbedding(return_tensors="pt")(resampled)print(emb.shape)emb[0][:5]