Didge2Wav¶

It would be very interesting to know the sound of a didgeridoo before building it. This script converts a didgeridoo bore into a wav file. Below, you can find a mathematical explanation. Here is an example of the generated audio. The "player" morphes through the vowels a -> e -> i -> o -> u.

In [ ]:

import numpy as np
from scipy.fft import rfft, irfft
from scipy.io import wavfile
from scipy.signal import butter, lfilter, find_peaks
from didgelab import acoustical_simulation, Geo

def get_vowel_params(t, total_duration):
    """
    Interpolates F1 and F2 from 'a' -> 'e' -> 'i' -> 'o' -> 'u' over time.
    Uses standard approximate formant targets (Hz).
    """
    # Normalized progress (0 to 1)
    progress = t / total_duration
    
    # Time anchor points for each vowel transition
    # a(0.0) -> e(0.25) -> i(0.5) -> o(0.75) -> u(1.0)
    points = [0.0, 0.25, 0.5, 0.75, 1.0]
    
    # Typical vowel formants (approx. male vocal tract)
    # a: F1~730, F2~1090
    # e: F1~530, F2~1840
    # i: F1~270, F2~2290
    # o: F1~570, F2~840
    # u: F1~300, F2~870
    f1_vals = [730, 530, 270, 570, 300]
    f2_vals = [1090, 1840, 2290, 840, 870]
    
    f1 = np.interp(progress, points, f1_vals)
    f2 = np.interp(progress, points, f2_vals)
    
    return f1, f2

def generate_dijj_excitation(f, a, duration, fs=44100, p0=100, R=10, a0=1.0):
    """
    Implements: dijj(t) = p0/R - (p0^2 / R^3) / (a0 + a * sin(2πft))^2
    """
    t = np.linspace(0, duration, int(fs * duration), endpoint=False)
    
    # Calculate the denominator (the oscillating lip area)
    # Ensure a < a0 to avoid division by zero or negative area
    lip_area_term = (a0 + a * np.sin(2 * np.pi * f * t))
    
    # The physical excitation signal
    excitation = (p0 / R) - ((p0**2 / R**3) / lip_area_term)
    
    # DC Offset Removal (Physical excitation has a constant flow component, 
    # but audio is AC coupled)
    excitation -= np.mean(excitation)
    
    return excitation


def synthesize_vocalized_didge(freqs, bore_z, f_lips=65, duration=4.0, fs=44100):
    # 1. Generate the Raw Physics Source (dijj formula)
    # We use a slightly higher 'a' for more high-end harmonics for the 'i' vowel
    source = generate_dijj_excitation(f=f_lips, a=0.85, duration=duration, fs=fs)
    
    # Pre-filter the source (Low Pass)
    b, a_lp = butter(2, 2500 / (fs/2), btype='low')
    source = lfilter(b, a_lp, source)

    # 2. Process in Windows (to allow the filter to change over time)
    window_size = 2048
    hop_size = 512
    output = np.zeros(len(source) + window_size)
    window = np.hanning(window_size)
    
    # Linear frequency grid for the FFT
    fft_freqs = np.linspace(0, fs//2, window_size // 2 + 1)
    # Interpolate bore impedance onto our FFT grid
    z_bore_linear = np.interp(fft_freqs, freqs, bore_z)

    for start in range(0, len(source) - window_size, hop_size):
        end = start + window_size
        chunk = source[start:end] * window
        
        # Get current vowel formants for this timestamp
        current_time = start / fs
        f1, f2 = get_vowel_params(current_time, duration)
        
        # Create Vowel Filter (Lorentzian peaks)
        bw = 100 # Bandwidth
        v_filter = (1 / (1 + ((fft_freqs - f1) / (bw / 2))**2) + 
                    0.8 / (1 + ((fft_freqs - f2) / (bw / 2))**2) + 0.05)
        
        # Combine: Source Spectrum * Bore Impedance * Vowel Filter
        chunk_spec = rfft(chunk)
        processed_spec = chunk_spec * (z_bore_linear**1.2) * v_filter
        
        # Inverse FFT and Overlap-Add
        processed_chunk = irfft(processed_spec)
        output[start:end] += processed_chunk * window

    # 3. Final Master and Export
    audio = output[:len(source)]
    audio = audio / (np.max(np.abs(audio)) + 1e-9)
    wavfile.write("didge.wav", fs, (audio * 32767).astype(np.int16))
    
# construct bore
geo = [[0,32], [1500, 43]]
geo = Geo(geo)

# run acoustical simulation
freq_grid = np.linspace(30, 40000, 1000)
z = acoustical_simulation(geo, freq_grid)

# find funamental frequency
i = find_peaks(z)[0]
f0 = freq_grid[i][0]
print(f'fundamental frequency {f0:.2f}')

# synthesize audio
synthesize_vocalized_didge(freq_grid, z, f_lips=f0, duration=10)

Explanation¶

This script simulates the complex interaction between a human performer (the vocal tract) and the physical instrument (the bore). It treats the didgeridoo as a time-variant linear system where the "input" is a lip-buzz and the "filter" is a combination of the wooden tube and the player's mouth.

1. The Source-Filter Model¶

The core logic follows the source-filter theory of speech production. The total output $Y(f)$ in the frequency domain is defined as:

$$Y(f) = X(f) \cdot Z_{bore}(f) \cdot V(f)$$

Where:

$X(f)$: The harmonic excitation from the lips.
$Z_{bore}(f)$: The acoustic impedance of the didgeridoo tube.
$V(f)$: The time-varying vowel filter (formants).

2. Excitation from Lips¶

The excitation from the lips, according to (Hindle and Posnett, 2017):

$$dijj(f, a, t) = p_0/R - \frac{(p_0^2)/(R^3)}{(a_0 + a \sin(2\pi f t))^2}$$

Where:

$f$ Frequency of lips,
$a$ size of lip opening where $0 ≤ a < a0$
$t$ time
$p_0$ pressure
$R$ acoustic resistance of the instrument tube at the resonant frequency

We keep $p_0$, $R$, and $a0$ constant while we can vary f and a over t. The output across t is a waveform of a didgeridoo excitation signal.

3. Vowel Interpolation (The $V(f)$ Filter)¶

To create the "a-e-i-o-u" shift, the script uses Lorentzian Oscillators to simulate the resonances of the vocal tract. Each vowel is defined by two primary formants, $F_1$ and $F_2$.

The transfer function for the vowel filter is calculated as the sum of two resonant peaks:

$$V(f) = \sum_{i=1}^{2} \frac{A_i}{1 + \left( \frac{f - F_i}{BW/2} \right)^2} + C$$

$F_i$: The center frequency of the formant (e.g., $F_1 = 730$ Hz for 'a').
$BW$: The bandwidth (set to 90 Hz), determining how "sharp" the vowel sounds.
$A_i$: The amplitude/gain of the formant.
$C$: A small constant (0.05) to ensure we don't multiply by zero, preserving some background harmonics.

The script uses np.interp to transition smoothly between these frequencies over time. For example, as you move from "i" ($F_2 \approx 2290$ Hz) to "o" ($F_2 \approx 800$ Hz), the filter's second peak slides down the spectrum, creating the characteristic "wa-wa" sweeping sound.

4. Acoustic Impedance ($Z_{bore}$)¶

The variable bore_z represents the Acoustic Impedance of the instrument. In the code, this is applied as:

$$Z_{effective} = |Z_{bore}|^{1.2}$$
The exponent $1.2$ is a "shaping factor." It non-linearly boosts the resonance peaks (where the tube naturally wants to vibrate) relative to the troughs, making the synthesized sound more "percussive" and realistic.

5. Short-Time Fourier Transform (STFT)¶

Because the vowels change over time, we cannot apply a single filter to the whole sound. The script uses a sliding window approach:

Windowing: The raw signal is chopped into chunks of 2048 samples. To prevent clicking at the edges, it applies a Hanning Window $w[n]$:
$$w[n] = 0.5 \left( 1 - \cos\left( \frac{2\pi n}{M-1} \right) \right)$$
FFT: Each chunk is converted to the frequency domain using the Fast Fourier Transform (FFT).
Spectral Processing: The multiplication $X(f) \cdot Z(f) \cdot V(f)$ occurs here.
IFFT & Overlap-Add: The signal is converted back to the time domain. Because the windows overlap (hop size of 512), the script sums the overlapping sections to ensure a smooth, continuous transition between vowels.

6. Signal Normalization¶

Finally, to prevent digital clipping, the script normalizes the amplitude:

$$x_{norm}[n] = \frac{x[n]}{\max(|x|) + \epsilon}$$
Where $\epsilon$ ($10^{-9}$) is a "safety" value to prevent division by zero if the audio is silent.