from IPython.display import Audio
Transform
AudioLoader Transform
The AudioLoader
transform reads in audio file paths with a given sampling rate. The file is loaded into an AudioArray
object, which contains a 1D NumPy array of the audio signal and the sampling rate.
AudioLoader
AudioLoader (sr:int=None)
Load audio files into an AudioArray
object.
Our test files are .ogg
files with a sampling rate of 32kHz (32_000
).
= 32_000
sr = AudioLoader(sr=sr)
al test_eq(al.sr, sr)
= "../test_files"
test_dir = globtastic(test_dir, file_glob="*.ogg")
file_paths file_paths
(#2) ['../test_files/H02_20230421_190500.ogg','../test_files/XC119042.ogg']
= file_paths[-1]
test_path test_path
'../test_files/XC119042.ogg'
str -> AudioArray
Our test file is a bird song from Xeno Canto of approximately 20 seconds. The length should be nearly \(32000 \times 20 = 640000\) samples.
= al(test_path)
audio_arr
test_eq(audio_arr.sr, sr)632790,))
test_eq(audio_arr.shape, ( audio_arr
auditus.core.AudioArray(a=array([-2.64216160e-05, -2.54259703e-05, 5.56615578e-06, ...,
-2.03555092e-01, -2.03390077e-01, -2.45199591e-01]), sr=32000)
audio_arr.audio()
Resampling
The AST (Audio Transformer) model we use requires 16kHz audio. We can use Resampling
to get audio with the correct sampling rate.
Resampling
Resampling (target_sr:int)
Resample audio to a given sampling rate.
= 16_000
target_sr = Resampling(target_sr=target_sr)
r r
Resampling -- {'target_sr': 16000}
(enc:1,dec:0)
The new length is:
\[l_{new} = l_{old} \frac{sr_{new}}{sr_{old}}\]
where \(l\) is the NumPy array length and \(sr\) is the sampling rate.
In our example:
\[632790 \frac{16000}{32000} = 632790 * 0.5 = 316395\]
= 316395
expected_length test_eq(r._new_length(audio_arr, target_sr), expected_length)
= r(audio_arr)
resampled
test_eq(resampled.sr, target_sr)
test_eq(resampled.shape, (expected_length,)) resampled
auditus.core.AudioArray(a=array([-2.64216160e-05, 5.56613802e-06, -1.35020873e-06, ...,
-2.39605007e-01, -2.03555112e-01, -2.45199591e-01]), sr=16000)
=target_sr) Audio(resampled, rate
AudioEmbedding
AudioEmbedding
allows us to use Audio models from the HuggingFace Hub as feature extractors. A great baseline model is the Audio SpectrogramTransformer model, which is the default in auditus
.
AudioEmbedding
AudioEmbedding (model_name:str='MIT/ast-finetuned-audioset-10-10-0.4593', return_tensors:str='np')
Embed audio using a HuggingFace Audio model.
= AudioEmbedding()
ae ae.model
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/Users/clepelaars/miniconda3/envs/py312/lib/python3.12/site-packages/transformers/audio_utils.py:297: UserWarning: At least one mel filter has all zero values. The value for `num_mel_filters` (128) may be set too high. Or, the value for `num_frequency_bins` (256) may be set too low.
warnings.warn(
ASTModel(
(embeddings): ASTEmbeddings(
(patch_embeddings): ASTPatchEmbeddings(
(projection): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10))
)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): ASTEncoder(
(layer): ModuleList(
(0-11): 12 x ASTLayer(
(attention): ASTSdpaAttention(
(attention): ASTSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(output): ASTSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(intermediate): ASTIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): ASTOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
(layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
NumPy
The default is to return embeddings in NumPy format.
= ae(resampled)
emb 1214, 768))
test_eq(emb.shape, (0][:5] emb[
array([-0.5875584 , 0.2830076 , -0.72917604, 0.7644301 , -1.1770165 ],
dtype=float32)
Torch
Optionally, you can return embeddings as PyTorch tensors.
= AudioEmbedding(return_tensors="pt")
torch_ae = torch_ae(resampled)
torch_emb 1214, 768]))
test_eq(torch_emb.shape, torch.Size([0][:5] torch_emb[
tensor([-0.5876, 0.2830, -0.7292, 0.7644, -1.1770])
Custom model
Any audio model on the HuggingFace Hub can be used to get audio embeddings. Here we test a custom fine-tuned AST model.
= AudioEmbedding(model_name="xpariz10/ast-finetuned-audioset-10-10-0.4593_ft_env_aug_0-2", return_tensors="np")
custom_ae = custom_ae(resampled)
custom_emb 1214, 768))
test_eq(custom_emb.shape, (0][:5] custom_emb[
array([-0.79336447, 0.17551161, -0.95863634, 0.71531856, -1.04658 ],
dtype=float32)
TFAudioEmbedding
TFAudioEmbedding
allows us to use Audio models from the TensorFlow Hub as feature extractors.
TFAudioEmbedding
TFAudioEmbedding (model_name:str)
Embed audio using a Tensorflow Hub model.
This local example model only works on a max. of 5 seconds of audio. We therefore truncate the audio to 5 seconds.
= AudioArray(audio_arr.a[:160000], 32000) five_sec_arr
five_sec_arr.audio()
= TFAudioEmbedding("../test_models/bird-vocalization-classifier-tensorflow2-bird-vocalization-classifier-v8")
tf_ae = tf_ae(five_sec_arr)
tf_emb 1, 1280))
test_eq(tf_emb.shape, (0][:5] tf_emb[
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1743196317.278296 5032010 service.cc:152] XLA service 0x38f0d7f10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743196317.278327 5032010 service.cc:160] StreamExecutor device (0): Host, Default Version
2025-03-28 22:11:57.499461: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:39] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert
I0000 00:00:1743196318.484499 5032010 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
array([0.07468231, 0.0335138 , 0.03465324, 0.02102477, 0.0374587 ],
dtype=float32)
Pooling
Pooling is convenient to convert embeddings to a single vector. auditus
supports mean
and max
pooling.
Pooling
Pooling (pooling:str)
Pool embeddings
= Pooling(pooling="mean")
mean_pooled mean_pooled
Pooling -- {'pooling': 'mean'}
(enc:2,dec:0)
= np.array([
test_emb 0.1, 0.2, 0.1],
[0.1, 0.2, 0.9],
[0.8, 0.6, 0.0]
[
]) test_emb.shape
(3, 3)
If pooling="mean"
, the mean of each embedding is taken.
= Pooling(pooling="mean")
mean_pooler = mean_pooler(test_emb)
mean_pooled 1/3, 1/3, 1/3]]))
test_eq(mean_pooled, np.array([[ mean_pooled
array([0.33333333, 0.33333333, 0.33333333])
If pooling="max"
, the maximum of each embedding is taken.
= Pooling(pooling="max")
max_pooler = max_pooler(test_emb)
max_pooled 0.8, 0.6, 0.9]]))
test_eq(max_pooled, np.array([[ max_pooled
array([0.8, 0.6, 0.9])
The Pooler can handle Torch tensors as well.
= torch.tensor(test_emb)
torch_emb = Pooling(pooling="mean").encodes(torch_emb)
torch_pooled 1/3, 1/3, 1/3]], dtype=torch.float64))
test_eq(torch_pooled, torch.tensor([[ torch_pooled
tensor([0.3333, 0.3333, 0.3333], dtype=torch.float64)
Pipeline
We can now compose a pipeline that loads an audio file with a sampling rate of 32kHz, resamples it to 16kHz, embeds it and max-pools the result.
= Pipeline([al, r, ae, max_pooler]) pipe
= pipe(test_path)
emb 768,))
test_eq(emb.shape, (5] emb[:
array([2.8618667, 2.7183478, 4.1287794, 2.6301968, 2.2177424],
dtype=float32)
For convenience, we create an AudioPipeline
that processed audio end-to-end. From an audio file path to a single vector embedding.
AudioPipeline
AudioPipeline (model_name:str='MIT/ast-finetuned-audioset-10-10-0.4593', return_tensors:str='np', target_sr:int=16000, pooling:str='mean')
A pipeline of composed (for encode/decode) transforms, setup with types
= AudioPipeline(return_tensors="pt")
pipe = pipe(test_path)
emb 768]))
test_eq(emb.shape, torch.Size([5] emb[:
/Users/clepelaars/miniconda3/envs/py312/lib/python3.12/site-packages/transformers/audio_utils.py:297: UserWarning: At least one mel filter has all zero values. The value for `num_mel_filters` (128) may be set too high. Or, the value for `num_frequency_bins` (256) may be set too low.
warnings.warn(
tensor([0.8653, 1.1659, 0.5956, 0.8498, 0.5322])
To process multiple audio files at once, we can call the AudioPipeline
on each file path and stack the results.
# Multiple audio files in Torch
= torch.stack([pipe(f).squeeze(0) for f in file_paths])
multi_emb 2, 768]))
test_eq(multi_emb.shape, torch.Size([5] multi_emb[:, :
tensor([[1.1501, 0.5910, 0.4068, 0.6158, 0.5433],
[0.8653, 1.1659, 0.5956, 0.8498, 0.5322]])
# Multiple audio files in NumPy
= AudioPipeline(return_tensors="np")
pipe = np.stack([pipe(f) for f in file_paths])
multi_emb 2, 768))
test_eq(multi_emb.shape, (5] multi_emb[:, :
array([[1.1501473 , 0.5909629 , 0.40684646, 0.61581504, 0.5432954 ],
[0.8652818 , 1.1659273 , 0.5955627 , 0.84978944, 0.53222984]],
dtype=float32)