Transform

from IPython.display import Audio

AudioLoader Transform

The AudioLoader transform reads in audio file paths with a given sampling rate. The file is loaded into an AudioArray object, which contains a 1D NumPy array of the audio signal and the sampling rate.


source

AudioLoader

 AudioLoader (sr:int=None)

Load audio files into an AudioArray object.

Our test files are .ogg files with a sampling rate of 32kHz (32_000).

sr = 32_000
al = AudioLoader(sr=sr)
test_eq(al.sr, sr)
test_dir = "../test_files"
file_paths = globtastic(test_dir, file_glob="*.ogg")
file_paths
(#2) ['../test_files/H02_20230421_190500.ogg','../test_files/XC119042.ogg']
test_path = file_paths[-1]
test_path
'../test_files/XC119042.ogg'

str -> AudioArray

Our test file is a bird song from Xeno Canto of approximately 20 seconds. The length should be nearly \(32000 \times 20 = 640000\) samples.

audio_arr = al(test_path)
test_eq(audio_arr.sr, sr)
test_eq(audio_arr.shape, (632790,))
audio_arr
auditus.core.AudioArray(a=array([-2.64216160e-05, -2.54259703e-05,  5.56615578e-06, ...,
       -2.03555092e-01, -2.03390077e-01, -2.45199591e-01]), sr=32000)
audio_arr.audio()

Resampling

The AST (Audio Transformer) model we use requires 16kHz audio. We can use Resampling to get audio with the correct sampling rate.


source

Resampling

 Resampling (target_sr:int)

Resample audio to a given sampling rate.

target_sr = 16_000
r = Resampling(target_sr=target_sr)
r
Resampling -- {'target_sr': 16000}
(enc:1,dec:0)

The new length is:

\[l_{new} = l_{old} \frac{sr_{new}}{sr_{old}}\]

where \(l\) is the NumPy array length and \(sr\) is the sampling rate.

In our example:

\[632790 \frac{16000}{32000} = 632790 * 0.5 = 316395\]

expected_length = 316395
test_eq(r._new_length(audio_arr, target_sr), expected_length)
resampled = r(audio_arr)
test_eq(resampled.sr, target_sr)
test_eq(resampled.shape, (expected_length,))
resampled
auditus.core.AudioArray(a=array([-2.64216160e-05,  5.56613802e-06, -1.35020873e-06, ...,
       -2.39605007e-01, -2.03555112e-01, -2.45199591e-01]), sr=16000)
Audio(resampled, rate=target_sr)

AudioEmbedding

AudioEmbedding allows us to use Audio models from the HuggingFace Hub as feature extractors. A great baseline model is the Audio SpectrogramTransformer model, which is the default in auditus.


source

AudioEmbedding

 AudioEmbedding (model_name:str='MIT/ast-finetuned-audioset-10-10-0.4593',
                 return_tensors:str='np')

Embed audio using a HuggingFace Audio model.

ae = AudioEmbedding()
ae.model
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
/Users/clepelaars/miniconda3/envs/py312/lib/python3.12/site-packages/transformers/audio_utils.py:297: UserWarning: At least one mel filter has all zero values. The value for `num_mel_filters` (128) may be set too high. Or, the value for `num_frequency_bins` (256) may be set too low.
  warnings.warn(
ASTModel(
  (embeddings): ASTEmbeddings(
    (patch_embeddings): ASTPatchEmbeddings(
      (projection): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10))
    )
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): ASTEncoder(
    (layer): ModuleList(
      (0-11): 12 x ASTLayer(
        (attention): ASTSdpaAttention(
          (attention): ASTSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): ASTSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
        )
        (intermediate): ASTIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): ASTOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
  (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)

NumPy

The default is to return embeddings in NumPy format.

emb = ae(resampled)
test_eq(emb.shape, (1214, 768))
emb[0][:5]
array([-0.5875584 ,  0.2830076 , -0.72917604,  0.7644301 , -1.1770165 ],
      dtype=float32)

Torch

Optionally, you can return embeddings as PyTorch tensors.

torch_ae = AudioEmbedding(return_tensors="pt")
torch_emb = torch_ae(resampled)
test_eq(torch_emb.shape, torch.Size([1214, 768]))
torch_emb[0][:5]
tensor([-0.5876,  0.2830, -0.7292,  0.7644, -1.1770])

Custom model

Any audio model on the HuggingFace Hub can be used to get audio embeddings. Here we test a custom fine-tuned AST model.

custom_ae = AudioEmbedding(model_name="xpariz10/ast-finetuned-audioset-10-10-0.4593_ft_env_aug_0-2", return_tensors="np")
custom_emb = custom_ae(resampled)
test_eq(custom_emb.shape, (1214, 768))
custom_emb[0][:5]
array([-0.79336447,  0.17551161, -0.95863634,  0.71531856, -1.04658   ],
      dtype=float32)

TFAudioEmbedding

TFAudioEmbedding allows us to use Audio models from the TensorFlow Hub as feature extractors.


source

TFAudioEmbedding

 TFAudioEmbedding (model_name:str)

Embed audio using a Tensorflow Hub model.

This local example model only works on a max. of 5 seconds of audio. We therefore truncate the audio to 5 seconds.

five_sec_arr = AudioArray(audio_arr.a[:160000], 32000)
five_sec_arr.audio()
tf_ae = TFAudioEmbedding("../test_models/bird-vocalization-classifier-tensorflow2-bird-vocalization-classifier-v8")
tf_emb = tf_ae(five_sec_arr)
test_eq(tf_emb.shape, (1, 1280))
tf_emb[0][:5]
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1743196317.278296 5032010 service.cc:152] XLA service 0x38f0d7f10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743196317.278327 5032010 service.cc:160]   StreamExecutor device (0): Host, Default Version
2025-03-28 22:11:57.499461: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:39] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert
I0000 00:00:1743196318.484499 5032010 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
array([0.07468231, 0.0335138 , 0.03465324, 0.02102477, 0.0374587 ],
      dtype=float32)

Pooling

Pooling is convenient to convert embeddings to a single vector. auditus supports mean and max pooling.


source

Pooling

 Pooling (pooling:str)

Pool embeddings

mean_pooled = Pooling(pooling="mean")
mean_pooled
Pooling -- {'pooling': 'mean'}
(enc:2,dec:0)
test_emb = np.array([
    [0.1, 0.2, 0.1],
    [0.1, 0.2, 0.9],
    [0.8, 0.6, 0.0]
])
test_emb.shape
(3, 3)

If pooling="mean", the mean of each embedding is taken.

mean_pooler = Pooling(pooling="mean")
mean_pooled = mean_pooler(test_emb)
test_eq(mean_pooled, np.array([[1/3, 1/3, 1/3]]))
mean_pooled
array([0.33333333, 0.33333333, 0.33333333])

If pooling="max", the maximum of each embedding is taken.

max_pooler = Pooling(pooling="max")
max_pooled = max_pooler(test_emb)
test_eq(max_pooled, np.array([[0.8, 0.6, 0.9]]))
max_pooled
array([0.8, 0.6, 0.9])

The Pooler can handle Torch tensors as well.

torch_emb = torch.tensor(test_emb)
torch_pooled = Pooling(pooling="mean").encodes(torch_emb)
test_eq(torch_pooled, torch.tensor([[1/3, 1/3, 1/3]], dtype=torch.float64))
torch_pooled
tensor([0.3333, 0.3333, 0.3333], dtype=torch.float64)

Pipeline

We can now compose a pipeline that loads an audio file with a sampling rate of 32kHz, resamples it to 16kHz, embeds it and max-pools the result.

pipe = Pipeline([al, r, ae, max_pooler])
emb = pipe(test_path)
test_eq(emb.shape, (768,))
emb[:5]
array([2.8618667, 2.7183478, 4.1287794, 2.6301968, 2.2177424],
      dtype=float32)

For convenience, we create an AudioPipeline that processed audio end-to-end. From an audio file path to a single vector embedding.


source

AudioPipeline

 AudioPipeline (model_name:str='MIT/ast-finetuned-audioset-10-10-0.4593',
                return_tensors:str='np', target_sr:int=16000,
                pooling:str='mean')

A pipeline of composed (for encode/decode) transforms, setup with types

pipe = AudioPipeline(return_tensors="pt")
emb = pipe(test_path)
test_eq(emb.shape, torch.Size([768]))
emb[:5]
/Users/clepelaars/miniconda3/envs/py312/lib/python3.12/site-packages/transformers/audio_utils.py:297: UserWarning: At least one mel filter has all zero values. The value for `num_mel_filters` (128) may be set too high. Or, the value for `num_frequency_bins` (256) may be set too low.
  warnings.warn(
tensor([0.8653, 1.1659, 0.5956, 0.8498, 0.5322])

To process multiple audio files at once, we can call the AudioPipeline on each file path and stack the results.

# Multiple audio files in Torch
multi_emb = torch.stack([pipe(f).squeeze(0) for f in file_paths])
test_eq(multi_emb.shape, torch.Size([2, 768]))
multi_emb[:, :5]
tensor([[1.1501, 0.5910, 0.4068, 0.6158, 0.5433],
        [0.8653, 1.1659, 0.5956, 0.8498, 0.5322]])
# Multiple audio files in NumPy
pipe = AudioPipeline(return_tensors="np")
multi_emb = np.stack([pipe(f) for f in file_paths])
test_eq(multi_emb.shape, (2, 768))
multi_emb[:, :5]
array([[1.1501473 , 0.5909629 , 0.40684646, 0.61581504, 0.5432954 ],
       [0.8652818 , 1.1659273 , 0.5955627 , 0.84978944, 0.53222984]],
      dtype=float32)