Clotho dataset (v2)

captions

Download Site Drossos2019

Label				Value	Description
General information
	Name			Clotho dataset (v2)	Full dataset name
	ID			captions/clotho_v2	Datalist id for external indexing
	Abbreviation			Clotho	Official dataset abbreviation, e.g. one used in the original paper
	Provider			TAU
	Year			2019	Dataset release year
	Modalities			Audio	Data modalities included in the dataset
	Collection name			Clotho	Common name for all related datasets, used to group datasets coming from same source
	Research domain			Captioning Tagging Multi-annotator	Related domains, e.g., Scenes, Mobile devices, Audio-visual, Open set, Ambient noise, Unlabelled, Multiple sensors, SED, SELD, Tagging, FL, Strong annotation, Weak annotation, Unlabelled, Multi-annotator
	License			Free
	Download			Download (6.5GB)
	Companion site			Site	Link to the companion site for the dataset
	Citation			[Drossos2019] Clotho: an Audio Captioning Dataset
Audio
Label				Value	Description
	Data
		Data type		Audio	Possible values: Audio \| Features
		File format
			File format type	Constant	Possible values: Constant \| Variable
			File format	wav	Possible value: wav \| aiff \| flac \| mp3 \| aac \| ogg
			Lossy compression	No	is audio compressed in a lossy manner
			Bit rate	16	Bit depth of audio, possible values: 8 \| 16 \| 24 \| 32
			Sampling rate (kHz)	44.1 kHz	Sampling rate in kHz, possible values: 8 \| 16 \| 22.05 \| 32 \| 44.1 \| 48
		Channels
		Material
			Source	Freesound	Possible values: Original \| Youtube \| Freesound \| Online \| Crowdsourced \| [Dataset name]
	Content
	Recording
	Files
		Count		6974 files	Total number of files
		File length		Quasi-constant	Characterization of the file lengths, possible values: Constant \| Quasi-constant \| Variable
		File length (seconds)		15-30 sec	Approximate length of files
Meta
Label				Value	Description
	Types			Caption Tag	List of meta data types provided for the data, possible values: Event, Tag, Scene, Caption, Geolocation, Spatial location, Annotator, Timestamp, Presence, Proximity, etc.
	Scene
	Event
	Caption
		Annotation
		Languages		English	Languages used for annotation
			Source	Crowdsourced	Possible values: Experts \| Crowdsourced \| Synthetic \| Metadata \| Automatic
			Captions per item	5	How many annotations there are available per item (possible multi-annotator setup)
			Validated amount (%)	100 %	Percentage of all data, amount of data which is validated by human
Cross-validation setup
Label				Value	Description
		Provided		Yes
Baseline
Label				Value	Description
		Download		Download	Link to baseline system source code
Info
Label				Value	Description
		Evaluation campaign		DCASE2021 task6	Evaluation campaigns where the dataset was used.

Clotho dataset (v2)

General information

Audio

Data

File format

Channels

Material

Content

Recording

Files

Meta

Scene

Event

Caption

Annotation

Cross-validation setup

Baseline

Info