Application examples

Acoustic scene classifier

This tutorial shows how to build simple acoustic scene classifier with utilities available in dcase_util. Acoustic scene classifier application contains usually following stages:

  • Dataset initialization stage, make use dataset is downloaded and ready to be used.

  • Feature extraction stage, acoustic features are extracted for all audio files in the development dataset and stores to the disk for easier access later

  • Feature normalization stage, go though training material per cross-validation fold and calculate mean and standard deviation for acoustic features to normalize feature data later

  • Learning stage, go through training material per cross-validation fold, and learn acoustic model.

  • Testing stage, go through testing material per cross-validation fold, and estimate scene class for each sample.

  • Evaluation, evaluate system output against ground truth.

This examples uses acoustic scene dataset published for DCASE2013 (10 scene classes), static MFCCs as features, and GMM as classifier. Example is showing only bare minimum code needed, usual development system requires better parametrization to make system development easier.

Full code example can be found examples/asc_gmm_simple.py.

Dataset initialization

This examples uses acoustic scene dataset published for DCASE2013, dataset class to handle this class is delivered with the dcase_utils: dcase_util.datasets.DCASE2013_Scenes_DevelopmentSet.

Dataset needs to be downloaded first, extracted to disk, and prepared for usage:

import os
import dcase_util
# Setup logging
dcase_util.utils.setup_logging()

log = dcase_util.ui.FancyLogger()
log.title('Acoustic Scene Classification Example / GMM')

# Create dataset object and set dataset to be stored under 'data' directory.
db = dcase_util.datasets.DCASE2013_Scenes_DevelopmentSet(
    data_path='data'
)

# Initialize dataset (download, extract and prepare it).
db.initialize()

# Show dataset information
db.show()
# DictContainer :: Class
#   audio_source                      : Field recording
#   audio_type                        : Natural
#   authors                           : D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley
#   microphone_model                  : Soundman OKM II Klassik/studio A3 electret microphone
#   recording_device_model            : Unknown
#   title                             : IEEE AASP CASA Challenge - Public Dataset for Scene Classification Task
#   url                               : https://archive.org/details/dcase2013_scene_classification
#
# MetaDataContainer :: Class
#   Filename                          : data/DCASE2013-acoustic-scenes-development/meta.txt
#   Items                             : 100
#   Unique
#     Files                           : 100
#     Scene labels                    : 10
#     Event labels                    : 0
#     Tags                            : 0
#
#   Scene statistics
#         Scene label             Count
#         --------------------   ------
#         bus                        10
#         busystreet                 10
#         office                     10
#         openairmarket              10
#         park                       10
#         quietstreet                10
#         restaurant                 10
#         supermarket                10
#         tube                       10
#         tubestation                10

Feature extraction

Usually it is most efficient to extract features for all audio files and store them on disk, rather than extracting them each time when acoustic features are needed. Example how to do this:

log.section_header('Feature Extraction')

# Prepare feature extractor
extractor = dcase_util.features.MfccStaticExtractor(
    fs=44100,
    win_length_seconds=0.04,
    hop_length_seconds=0.02,
    n_mfcc=14
)
# Define feature storage path
feature_storage_path = os.path.join('system_data', 'features')

# Make sure path exists
dcase_util.utils.Path().create(feature_storage_path)

# Loop over all audio files in the dataset and extract features for them.
for audio_filename in db.audio_files:
    # Show some progress
    log.line(os.path.split(audio_filename)[1], indent=2)

    # Get filename for feature data from audio filename
    feature_filename = os.path.join(
        feature_storage_path,
        os.path.split(audio_filename)[1].replace('.wav', '.cpickle')
    )

    # Load audio data
    audio = dcase_util.containers.AudioContainer().load(
        filename=audio_filename,
        mono=True,
        fs=extractor.fs
    )

    # Extract features and store them into FeatureContainer, and save it to the disk
    features = dcase_util.containers.FeatureContainer(
        filename=feature_filename,
        data=extractor.extract(audio.data),
        time_resolution=extractor.hop_length_seconds
    ).save()

log.foot()

Feature normalization

In this stage, training material is gone through per cross-validation fold and mean & standard deviation are calculated for acoustic features. These normalization factors are used to normalize feature data before using it in the learning and testing stages.

Code:

log.section_header('Feature Normalization')

# Define normalization data storage path
normalization_storage_path = os.path.join('system_data', 'normalization')

# Make sure path exists
dcase_util.utils.Path().create(normalization_storage_path)

# Loop over all cross-validation folds and calculate mean and std for the training data
for fold in db.folds():
    # Show some progress
    log.line('Fold {fold:d}'.format(fold=fold), indent=2)

    # Get filename for the normalization factors
    fold_stats_filename = os.path.join(
        normalization_storage_path,
        'norm_fold_{fold:d}.cpickle'.format(fold=fold)
    )

    # Normalizer
    normalizer = dcase_util.data.Normalizer(filename=fold_stats_filename)

    # Loop through all training data
    for item in db.train(fold=fold):
        # Get feature filename
        feature_filename = os.path.join(
            feature_storage_path,
            os.path.split(item.filename)[1].replace('.wav', '.cpickle')
        )

        # Load feature matrix
        features = dcase_util.containers.FeatureContainer().load(
            filename=feature_filename
        )

        # Accumulate statistics
        normalizer.accumulate(features.data)

    # Finalize and save
    normalizer.finalize().save()

log.foot()

Model learning

In this stage, training material is gone though per cross-validation fold, and acoustic model is learned and stored.

Code:

log.section_header('Learning')

# Imports
from sklearn.mixture import GaussianMixture
import numpy

# Define model data storage path
model_storage_path = os.path.join('system_data', 'model')

# Make sure path exists
dcase_util.utils.Path().create(model_storage_path)

# Loop over all cross-validation folds and learn acoustic models
for fold in db.folds():
    # Show some progress
    log.line('Fold {fold:d}'.format(fold=fold), indent=2)

    # Get model filename
    fold_model_filename = os.path.join(
        model_storage_path,
        'model_fold_{fold:d}.cpickle'.format(fold=fold)
    )

    # Get filename for the normalizer
    fold_stats_filename = os.path.join(
        normalization_storage_path,
        'norm_fold_{fold:d}.cpickle'.format(fold=fold)
    )

    # Normalizer
    normalizer = dcase_util.data.Normalizer().load(filename=fold_stats_filename)

    # Collect class wise training data
    class_wise_data = {}
    for scene_label in db.scene_labels():
        class_wise_data[scene_label] = []

        # Loop through all training items from specific scene class
        for item in db.train(fold=fold).filter(scene_label=scene_label):
            # Get feature filename
            feature_filename = os.path.join(
                feature_storage_path,
                os.path.split(item.filename)[1].replace('.wav', '.cpickle')
            )

            # Load all features.
            features = dcase_util.containers.FeatureContainer().load(
                filename=feature_filename
            )

            # Normalize features.
            normalizer.normalize(features)

            # Store feature data.
            class_wise_data[scene_label].append(features.data)

    # Initialize model container.
    model = dcase_util.containers.DictContainer(filename=fold_model_filename)

    # Loop though all scene classes and train acoustic model for each
    for scene_label in db.scene_labels():
        # Show some progress
        log.line('[{scene_label}]'.format(scene_label=scene_label), indent=4)

        # Train acoustic model
        model[scene_label] = GaussianMixture(
            n_components=8
        ).fit(
            numpy.hstack(class_wise_data[scene_label]).T
        )

    # Save model to the disk
    model.save()

log.foot()

Testing

In this stage, testing material is gone through per cross-validation fold, and scene class is estimated for each test sample.

Code:

log.section_header('Testing')

# Define model data storage path
results_storage_path = os.path.join('system_data', 'results')

# Make sure path exists
dcase_util.utils.Path().create(results_storage_path)

# Loop over all cross-validation folds and test
for fold in db.folds():
    # Show some progress
    log.line('Fold {fold:d}'.format(fold=fold), indent=2)

    # Get model filename
    fold_model_filename = os.path.join(
        model_storage_path,
        'model_fold_{fold:d}.cpickle'.format(fold=fold)
    )

    # Load model
    model = dcase_util.containers.DictContainer().load(
        filename=fold_model_filename
    )

    # Get filename for the normalizer
    fold_stats_filename = os.path.join(
        normalization_storage_path,
        'norm_fold_{fold:d}.cpickle'.format(fold=fold)
    )

    # Normalizer
    normalizer = dcase_util.data.Normalizer().load(filename=fold_stats_filename)

    # Get results filename
    fold_results_filename = os.path.join(results_storage_path, 'res_fold_{fold:d}.txt'.format(fold=fold))

    # Initialize results container
    res = dcase_util.containers.MetaDataContainer(filename=fold_results_filename)

    # Loop through all test files from the current cross-validation fold
    for item in db.test(fold=fold):
        # Get feature filename
        feature_filename = os.path.join(
            feature_storage_path,
            os.path.split(item.filename)[1].replace('.wav', '.cpickle')
        )

        # Load all features.
        features = dcase_util.containers.FeatureContainer().load(
            filename=feature_filename
        )

        # Normalize features.
        normalizer.normalize(features)

        # Initialize log likelihoods matrix
        logls = numpy.ones((db.scene_label_count(), features.frames)) * -numpy.inf

        # Loop through all scene classes and get likelihood for each per frame
        for scene_label_id, scene_label in enumerate(db.scene_labels()):
            logls[scene_label_id] = model[scene_label].score_samples(features.data.T)

        # Accumulate log likelihoods
        accumulated_logls = dcase_util.data.ProbabilityEncoder().collapse_probabilities(
            probabilities=logls,
            operator='sum'
        )

        # Estimate scene label based on max likelihood.
        estimated_scene_label = dcase_util.data.ProbabilityEncoder(
            label_list=db.scene_labels()
        ).max_selection(
            probabilities=accumulated_logls
        )

        # Store result into results container
        res.append(
            {
                'filename': item.filename,
                'scene_label': estimated_scene_label
            }
        )

    # Save results container
    res.save()
log.foot()

Evaluation

In this stage, system output is evaluated against ground truth delivered with the dataset.

Code:

log.section_header('Evaluation')

# Imports
import sed_eval

all_res = []
overall = []
class_wise_results = numpy.zeros((len(db.folds()), len(db.scene_labels())))
for fold in db.folds():
    # Get results filename
    fold_results_filename = os.path.join(
        results_storage_path,
        'res_fold_{fold:d}.txt'.format(fold=fold)
    )

    # Get reference scenes
    reference_scene_list = db.eval(fold=fold)
    for item_id, item in enumerate(reference_scene_list):
        # Modify data for sed_eval
        reference_scene_list[item_id]['file'] = item.filename

    # Load estimated scenes
    estimated_scene_list = dcase_util.containers.MetaDataContainer().load(
        filename=fold_results_filename
    )
    for item_id, item in enumerate(estimated_scene_list):
        # Modify data for sed_eval
        estimated_scene_list[item_id]['file'] = item.filename

    # Initialize evaluator
    evaluator = sed_eval.scene.SceneClassificationMetrics(scene_labels=db.scene_labels())

    # Evaluate estimated against reference.
    evaluator.evaluate(
        reference_scene_list=reference_scene_list,
        estimated_scene_list=estimated_scene_list
    )

    # Get results
    results = dcase_util.containers.DictContainer(evaluator.results())

    # Store fold-wise results
    all_res.append(results)
    overall.append(results.get_path('overall.accuracy')*100)

    # Get scene class-wise results
    class_wise_accuracy = []
    for scene_label_id, scene_label in enumerate(db.scene_labels()):
        class_wise_accuracy.append(results.get_path(['class_wise', scene_label, 'accuracy', 'accuracy']))
        class_wise_results[fold-1, scene_label_id] = results.get_path(['class_wise', scene_label, 'accuracy', 'accuracy'])

# Form results table
cell_data = class_wise_results
scene_mean_accuracy = numpy.mean(cell_data, axis=0).reshape((1, -1))
cell_data = numpy.vstack((cell_data, scene_mean_accuracy))
fold_mean_accuracy = numpy.mean(cell_data, axis=1).reshape((-1, 1))
cell_data = numpy.hstack((cell_data, fold_mean_accuracy))

scene_list = db.scene_labels()
scene_list.extend(['Average'])
cell_data = [scene_list] + (cell_data*100).tolist()

column_headers = ['Scene']
for fold in db.folds():
    column_headers.append('Fold {fold:d}'.format(fold=fold))

column_headers.append('Average')

log.table(
    cell_data=cell_data,
    column_headers=column_headers,
    column_separators=[0, 5],
    row_separators=[10],
    indent=2
)
log.foot()

Results:

Scene                | Fold 1   Fold 2   Fold 3   Fold 4   Fold 5 | Average
-------------------- | ------   ------   ------   ------   ------ | -------
bus                  | 100.00   100.00   100.00   100.00   100.00 |  100.00
busystreet           | 100.00    33.33    33.33   100.00    66.67 |   66.67
office               |  66.67   100.00   100.00    66.67   100.00 |   86.67
openairmarket        |  66.67   100.00     0.00    66.67   100.00 |   66.67
park                 |  33.33    33.33     0.00    33.33    33.33 |   26.67
quietstreet          |  66.67   100.00    33.33    66.67    66.67 |   66.67
restaurant           |  66.67     0.00    66.67     0.00    33.33 |   33.33
supermarket          |  33.33     0.00    33.33     0.00    33.33 |   20.00
tube                 | 100.00    33.33    33.33    66.67    66.67 |   60.00
tubestation          |   0.00    66.67    66.67     0.00     0.00 |   26.67
-------------------- | ------   ------   ------   ------   ------ | -------
Average              |  63.33    56.67    46.67    50.00    60.00 |   55.33