| General information | 
    
    | Label | Value | Description | 
    
        
    
    |  | Name | AudioCaption: Listen and Tell | Full dataset name | 
    
    
        
    
    |  | ID | captions/audiocaption | Datalist id for external indexing | 
    
    
        
    
    |  | Abbreviation | AudioCaption | Official dataset abbreviation, e.g. one used in the original paper | 
    
    
        
    
    |  | Provider | SJTU |  | 
    
    
        
    
    |  | Year | 2019 | Dataset release year | 
    
    
        
    
    |  | Modalities | Audio | Data modalities included in the dataset | 
    
    
        
    
    |  | Collection name | AudioCaption | Common name for all related datasets, used to group datasets coming from same source | 
    
    
        
    
    |  | Research domain | Captioning | Related domains, e.g., Scenes, Mobile devices, Audio-visual, Open set, Ambient noise, Unlabelled, Multiple sensors, SED, SELD, Tagging, FL, Strong annotation, Weak annotation, Unlabelled, Multi-annotator | 
    
    
    
        
    
    |  | License | Creative Commons, CC BY 4.0 |  | 
    
    
        
    
    |  | Download | Download
    
    (13.8GB) |  | 
    
    
        
    
    |  | Companion site | Site | Link to the companion site for the dataset | 
    
    
        
    
    |  | Citation | [Wu2019] Audio Caption: Listen and Tell |  | 
    
                    
    
| Audio | 
    
    | Label | Value | Description | 
    
        
    
    |  | Data | 
        
            
    
    |  |  | Data type | Audio | Possible values: Audio | Features | 
        
        
            
    
    |  |  | File format | 
            
                
    
    |  |  |  | File format type | Constant | Possible values: Constant | Variable | 
            
            
            
            
            
        
        
            
    
    |  |  | Channels | 
            
            
        
        
            
    
    |  |  | Material | 
            
                
    
    |  |  |  | Source | Youku
        
             Iqiyi
        
             Tencent | Possible values: Original | Youtube | Freesound | Online | Crowdsourced | [Dataset name] | 
            
            
        
    
    
        
    
    |  | Content | 
        
            
    
    |  |  | Content type | Freefield | Possible values: Freefield | Synthetic | Isolated | 
        
        
        
    
    
        
    
    |  | Recording | 
        
        
        
    
    
        
    
    |  | Files | 
        
            
    
    |  |  | Count | 7311 files | Total number of files | 
        
        
            
    
    |  |  | Total duration (minutes) | 1218.5 min | Total duration of the dataset in minutes | 
        
        
            
    
    |  |  | File length | Constant | Characterization of the file lengths, possible values: Constant | Quasi-constant | Variable | 
        
        
            
    
    |  |  | File length (seconds) | 10 sec | Approximate length of files | 
        
    
                    
    
| Meta | 
    
    | Label | Value | Description | 
    
        
    
    |  | Types | Caption | List of meta data types provided for the data, possible values: Event, Tag, Scene, Caption, Geolocation, Spatial location, Annotator, Timestamp, Presence, Proximity, etc. | 
    
    
        
    
    |  | Scene | 
        
            
    
    |  |  | Classes | 2 | Number of scene classes | 
        
        
        
            
    
    |  |  | Classes |  |  | 
        
        
        
        
    
    
        
    
    |  | Event | 
        
        
        
        
            
    
    |  |  | Annotation | 
            
            
            
            
            
            
            
        
        
            
    
    |  |  | Labeling | 
            
            
        
        
            
    
    |  |  | Instance | 
            
            
        
    
    
    
        
    
    |  | Caption | 
        
            
    
    |  |  | Annotation | 
            
                
    
    |  |  | Languages | Mandarin Chinese | Languages used for annotation | 
            
            
                
    
    |  |  |  | Source | Crowdsourced | Possible values: Experts | Crowdsourced | Synthetic | Metadata | Automatic | 
            
            
                
    
    |  |  |  | Captions per item | 3 | How many annotations there are available per item (possible multi-annotator setup) | 
            
            
                
    
    |  |  |  | Validated amount (%) | 100 % | Percentage of all data, amount of data which is validated by human | 
            
            
                
    
    |  |  |  | Guidance | Video | Type of guidance annotators were provided during the annotation, e.g. Video, Image, Tags | 
            
            
            
        
    
                    
    
| Cross-validation setup | 
    
    | Label | Value | Description | 
    
        
    
    |  |  | Provided | Yes |  | 
    
    
        
    
    |  |  | Folds | 1 |  | 
    
    
        
    
    |  |  | Sets | Dev
        
             Eval | Set types provided in the split, possible values: Train | Test | Val | Dev | Eval | 
    
                    
    
| Baseline | 
    
    | Label | Value | Description | 
    
        
    
    |  |  | Download | Download | Link to baseline system source code | 
    
    
        
    
    |  |  | Citation | [Wu2019] | Paper to cite for the baseline |