General information |
Label |
Value |
Description |
|
Name |
AudioCaption: Listen and Tell |
Full dataset name |
|
ID |
captions/audiocaption
|
Datalist id for external indexing |
|
Abbreviation |
AudioCaption |
Official dataset abbreviation, e.g. one used in the original paper |
|
Provider |
SJTU |
|
|
Year |
2019 |
Dataset release year |
|
Modalities |
Audio
|
Data modalities included in the dataset |
|
Collection name |
AudioCaption |
Common name for all related datasets, used to group datasets coming from same source |
|
Research domain |
Captioning
|
Related domains, e.g., Scenes, Mobile devices, Audio-visual, Open set, Ambient noise, Unlabelled, Multiple sensors, SED, SELD, Tagging, FL, Strong annotation, Weak annotation, Unlabelled, Multi-annotator |
|
License |
Creative Commons, CC BY 4.0 |
|
|
Download |
Download
(13.8GB)
|
|
|
Companion site |
Site
|
Link to the companion site for the dataset |
|
Citation |
[Wu2019] Audio Caption: Listen and Tell
|
|
Audio |
Label |
Value |
Description |
|
Data |
|
|
Data type |
Audio
|
Possible values: Audio | Features |
|
|
File format |
|
|
|
File format type |
Constant
|
Possible values: Constant | Variable |
|
|
Channels |
|
|
Material |
|
|
|
Source |
Youku
Iqiyi
Tencent
|
Possible values: Original | Youtube | Freesound | Online | Crowdsourced | [Dataset name] |
|
Content |
|
|
Content type |
Freefield
|
Possible values: Freefield | Synthetic | Isolated |
|
Recording |
|
Files |
|
|
Count |
7311 files |
Total number of files |
|
|
Total duration (minutes) |
1218.5 min |
Total duration of the dataset in minutes |
|
|
File length |
Constant
|
Characterization of the file lengths, possible values: Constant | Quasi-constant | Variable |
|
|
File length (seconds) |
10 sec |
Approximate length of files |
Meta |
Label |
Value |
Description |
|
Types |
Caption
|
List of meta data types provided for the data, possible values: Event, Tag, Scene, Caption, Geolocation, Spatial location, Annotator, Timestamp, Presence, Proximity, etc. |
|
Scene |
|
|
Classes |
2 |
Number of scene classes |
|
|
Classes |
|
|
|
Event |
|
|
Annotation |
|
|
Labeling |
|
|
Instance |
|
Caption |
|
|
Annotation |
|
|
Languages |
Mandarin Chinese |
Languages used for annotation |
|
|
|
Source |
Crowdsourced
|
Possible values: Experts | Crowdsourced | Synthetic | Metadata | Automatic |
|
|
|
Captions per item |
3 |
How many annotations there are available per item (possible multi-annotator setup) |
|
|
|
Validated amount (%) |
100 % |
Percentage of all data, amount of data which is validated by human |
|
|
|
Guidance |
Video
|
Type of guidance annotators were provided during the annotation, e.g. Video, Image, Tags |
Cross-validation setup |
Label |
Value |
Description |
|
|
Provided |
Yes
|
|
|
|
Folds |
1 |
|
|
|
Sets |
Dev
Eval
|
Set types provided in the split, possible values: Train | Test | Val | Dev | Eval |
Baseline |
Label |
Value |
Description |
|
|
Download |
Download
|
Link to baseline system source code |
|
|
Citation |
[Wu2019]
|
Paper to cite for the baseline |