| General information | |||||
| Label | Value | Description | |||
|---|---|---|---|---|---|
| Name | Clotho dataset (v2) | Full dataset name | |||
| ID | captions/clotho_v2 | Datalist id for external indexing | |||
| Abbreviation | Clotho | Official dataset abbreviation, e.g. one used in the original paper | |||
| Provider | TAU | ||||
| Year | 2019 | Dataset release year | |||
| Modalities | Audio | Data modalities included in the dataset | |||
| Collection name | Clotho | Common name for all related datasets, used to group datasets coming from same source | |||
| Research domain | Captioning Tagging Multi-annotator | Related domains, e.g., Scenes, Mobile devices, Audio-visual, Open set, Ambient noise, Unlabelled, Multiple sensors, SED, SELD, Tagging, FL, Strong annotation, Weak annotation, Unlabelled, Multi-annotator | |||
| License | Free | ||||
| Download | Download (6.5GB) | ||||
| Companion site | Site | Link to the companion site for the dataset | |||
| Citation | [Drossos2019] Clotho: an Audio Captioning Dataset | ||||
| Audio | |||||
| Label | Value | Description | |||
| Data | |||||
| Data type | Audio | Possible values: Audio | Features | |||
| File format | |||||
| File format type | Constant | Possible values: Constant | Variable | |||
| File format | wav | Possible value: wav | aiff | flac | mp3 | aac | ogg | |||
| Lossy compression | No | is audio compressed in a lossy manner | |||
| Bit rate | 16 | Bit depth of audio, possible values: 8 | 16 | 24 | 32 | |||
| Sampling rate (kHz) | 44.1 kHz | Sampling rate in kHz, possible values: 8 | 16 | 22.05 | 32 | 44.1 | 48 | |||
| Channels | |||||
| Material | |||||
| Source | Freesound | Possible values: Original | Youtube | Freesound | Online | Crowdsourced | [Dataset name] | |||
| Content | |||||
| Recording | |||||
| Files | |||||
| Count | 6974 files | Total number of files | |||
| File length | Quasi-constant | Characterization of the file lengths, possible values: Constant | Quasi-constant | Variable | |||
| File length (seconds) | 15-30 sec | Approximate length of files | |||
| Meta | |||||
| Label | Value | Description | |||
| Types | Caption Tag | List of meta data types provided for the data, possible values: Event, Tag, Scene, Caption, Geolocation, Spatial location, Annotator, Timestamp, Presence, Proximity, etc. | |||
| Scene | |||||
| Event | |||||
| Caption | |||||
| Annotation | |||||
| Languages | English | Languages used for annotation | |||
| Source | Crowdsourced | Possible values: Experts | Crowdsourced | Synthetic | Metadata | Automatic | |||
| Captions per item | 5 | How many annotations there are available per item (possible multi-annotator setup) | |||
| Validated amount (%) | 100 % | Percentage of all data, amount of data which is validated by human | |||
| Cross-validation setup | |||||
| Label | Value | Description | |||
| Provided | Yes | ||||
| Baseline | |||||
| Label | Value | Description | |||
| Download | Download | Link to baseline system source code | |||
| Info | |||||
| Label | Value | Description | |||
| Evaluation campaign | DCASE2021 task6 | Evaluation campaigns where the dataset was used. | |||