FHIBE Crowdwork Sheet | The Fair Human-Centric Image Benchmark

Fair Human-Centric Image Benchmark (FHIBE) Crowdwork Sheet

Document authors: Wiebke Hutiri, Austin Hoag

Document reviewers: Jerone Andrews, Rebecca Bourke, Aida Rahmattalabi, Jinru Xue, Tiffany Georgievski, Alice Xiang, Tiffany Georgievski

Last updated: 16 September 2025

Task formulation

What are we asking annotators to do?

1. At a high level, what are the subjective aspects of your task?

Annotation tasks include:

Reporting of demographic metadata
Reporting observable person attributes, including skin tone, hair type, hair style, eye color and facial hair
Reporting environment-level metadata, including weather, illumination and camera position
Reporting actions and interactions between image subjects and objects
Reporting the time and place of image capture
Identifying non-consensual image subjects
Providing annotations for segmentation masks, landmark annotations, pose landmarks, facial and person bounding boxes and head pose

All of these annotations present a degree of subjectivity. As far as possible, subjective aspects of annotation tasks have been reduced by:

Requiring that all demographic labels of image subjects and data annotators are self-reported.
Ensuring that only observable, physical attributes of human bodies are annotated, such as skin tone or hair type.
Avoiding the labeling of socially constructed concepts, such as race or ethnicity.
Providing visual examples for response options.
Providing open-ended response options where possible.

Despite taking care to reduce subjectivity, annotations will always have variability. For example, pixel level annotations, such as masks, will not be drawn identically by two different people. Even physical and observable labels, such as lighting conditions or eye color, may be interpreted differently by different people.

Annotations that were not self-reported are the following:

Head pose
Keypoints
Segmentation masks
Facial bounding box
Nonconsensual person segmentation mask
Nonconsensual person bounding box

2. What assumptions do you make about annotators?

We assumed that vendors would be able to meet Sony’s specifications for annotator diversity.

3. How did you choose the specific wording of your task instructions? What steps, if any, were taken to verify the clarity of task instructions and wording for annotators?

Data collection commenced in incremental batches, initially beginning with small ones and gradually scaling up to larger batch sizes over time. This approach helped us to address issues during annotation, such as questions, misunderstandings, inconsistencies and quality assurance challenges.

We provided vendor QA workers with example images for all annotation types and choice options. We sourced multiple images for each choice for each annotation type (e.g., short head hairstyle), with examples spanning across images of people with different skin tones and perceived genders. Where possible, we disambiguated text-based options by including images during the annotation task (e.g., color palettes for skin tone selection).

The following training materials were provided to annotators:

Example-based training materials in slides format
Training videos on how to use the GUI for QA tasks
Open office hours for QA workers until workers were sufficiently familiar with the tasks that office hours were no longer needed.

4. What, if any, risks did your task pose for annotators and were they informed of the risks prior to engagement with the task?

This question is also answered in Section 3.13 of FHIBE’s datasheet. This is a copy of the response from the datasheet.

Risk for data subjects

There exists a possibility that upon public release of the dataset, image information or annotations could be:

Used for unauthorized or unintended purposes, such as catfishing or creating deep fakes
Sold to third parties for various purposes, such as for advertising or other commercial or political activities.
Metadata and annotations about data subjects can be matched to their images. These metadata can pose a risk to data subjects. For example, in certain jurisdictions, information about someone's self-identified pronouns may be sensitive and create the potential for discrimination on the basis of gender identity.

The likelihood of harm from these activities is comparable to the risks that individuals take in everyday life when posting images and personal information to public websites or on social media platforms. Sony has taken extensive mitigating actions to protect data subjects:

Self-identification: Data subjects could moderate their risk by self-selecting categories for personal and demographic information that is acceptable to their context and circumstances.
Voluntary disclosure: For several self-reported demographic attributes (pronouns, difficulty/disability, pregnancy, sub-continental ancestry) a “Prefer not to say” response option was provided during data collection to make the disclosure of sensitive information voluntary.
Risk awareness: Data subjects were informed of the potential, but unlikely risks associated with submitting their data.
Consent revocation: Data subjects can revoke consent to have their personal information removed.
Removal of sensitive metadata: Certain collected metadata will not be made publicly available and only provided in aggregate form. These metadata are: disability, pregnancy, height, weight, country of residence, biologically related subjects.
Secure data storage: All data is stored and hosted securely on a S3 server located in the United States, following established data security protocols of Sony.
Anonymization: Data subjects’ names and email addresses are stored separately from the images and metadata and will not be shared publicly. The unique identifiers with which data subjects can be linked to their names and email addresses have been anonymized.
Terms of use: Individuals accessing the image dataset will be contractually restricted from attempting to re-identify subjects, from using FHIBE for training AI software, algorithms, or other technologies (with the narrow exception of training bias detection or mitigation methods), and from using FHIBE for specific purposes that fall outside of its intended scope.

Data access control: The dataset will be made available on request to users after they agree to our terms of use.

Risk for annotators

Data annotators might be exposed to images that are offensive or triggering.
It is possible that metadata of annotators could be acquired by third parties with malicious intent. Without access to internal documents, the risk of re-identification and subsequent harm from such activities is however very low.

Mitigating actions taken by Sony to protect data annotators:

Restricted image content: Sony explicitly prohibited data subjects from submitting images that contain offensive content whatsoever, including, but not limited to: explicit nudity or sexual content; violence; visually disturbing content; rude gestures; drugs; hate symbols; and vulgar text.
Manual checks for offensive content: All images were reviewed manually by Sony's internal QA specialists.
Automated violent content check: Automated checks were done to check for violent and explicit content.
Consent revocation: Annotators can revoke consent to have their personal data removed.
Secure data storage: All data is stored and hosted securely, following established data security protocols of Sony.
Anonymization: Data annotators’ names and email addresses are stored separately from their demographic information, and will not be shared publicly. The unique identifiers with which data annotators can be linked to their names and email addresses have been anonymized.

Other risks

There exists a possibility that some of the images that appear in the FHIBE dataset were submitted by bad actors who scraped them from the web.

Mitigating actions taken by Sony to protect users and people from other risks:

Sony deployed a reverse image match using Google’s image web search API. While helpful, such automated checks are not totally accurate when applied to person detection.

5. What are the precise instructions that were provided to annotators?

Extensive guidelines were provided to vendors to create or validate the following annotations:

Mask segmentation for non-consensual subjects
Mask segmentation for consensual subjects
Pose landmarks
Bounding boxes
Head and body pose annotation
Camera position annotation
Subject-subject and subject-object interaction annotations
Hair type, hair style, skin type

To standardize self-reported attributes, Sony provided a document with recommendations on how to report these attributes and any external annotations. The document lists the available choices for attribute values, the range of the values and their format. To avoid mistakes during annotation, attributes had to be enumerated following the numeration convention provided by Sony.

Selecting annotators

Who is annotating the data?

6. Are there certain perspectives that should be privileged? If so, how did you seek these perspectives out?

The FHIBE dataset collection and annotation requirements have been intentionally designed with diversity in mind. As the purpose of this dataset is to provide a general-purpose consensual human-centric image dataset of human bodies and faces, there were no particular perspectives that were considered more important than others. However, representation of diverse image subjects was prioritized and carefully considered during data collection. Similarly, we strived to obtain a diverse set of annotators by placing diversity requirements on the recruitment of data annotators.

7. Are there certain perspectives that would be harmful to include? If so, how did you screen these perspectives out?

Image subjects were specifically instructed not to submit images that contain offensive content whatsoever, including, but not limited to: explicit nudity or sexual content; violence; visually disturbing content; rude gestures; drugs; hate symbols; and vulgar text. Vendors checked for this content during their internal quality assurance process, and our internal quality assurance team conducted additional checks. Offensive content, images with third party IP like logos and trademarks, or images that did not meet specification criteria were withdrawn from the data collection. Sony also performed an automated check for child sexual abuse materials (CSAM) against the National Center for Missing & Exploited Children’s hashed database of known CSAM.

In addition, we did not include annotations or images from people under the age of majority as it is difficult to ensure their informed consent. We also did not obtain images or annotations from residents or nationals of (OFAC) sanctioned countries.

8. Were sociodemographic characteristics used to select annotators for your task? If so, please detail the process.

Recruitment

Before starting recruitment, vendors were contractually required to submit any text or images that were used for project-related recruiting activities to Sony for approval. Furthermore, before starting any data collection or annotation, vendors were contractually required to submit a methodology report to Sony, which included the following details on recruitment:

how image subjects and data annotators were recruited (e.g., social media/online/print-media advertisements, in-person recruitment, existing crowd-worker pools, existing in-house teams); and
the text and images that were used in advertisements and names of platforms or publications they were published in.

We continuously engaged vendors to ensure that reports were submitted. The submitted documents were validated on an ongoing basis to ensure that our requirements are met. We initiated correcting actions from vendors where necessary.

Diversity requirements

Sony requested that vendors use demographically diverse annotators with regards to their age, pronouns, and ancestry. If a vendor had multiple annotation and quality assurance stages, annotators at each stage were required to be demographically diverse. The following categories were considered for diversity:

Age groups: 18–30, 31–45, 46–
Pronouns: She/her/hers, He/him/his
Ancestry: Africa, Americas, Asia, Europe, Oceania

Note regarding pronouns:

Annotators who did not select, or who selected both “She/her/hers” or “He/him/his” contributed to both “She/her/hers” and “He/him/his” subgroup counts, i.e., we added 1/2 to both the “She/her/hers” and “He/him/his” subgroup counts.

Note regarding ancestry:

If an annotator selected a region but did not select any subregions within the region, then the annotator contributed to the counts of each subregion within the region. For example, consider an annotator who selected the regions "Americas" and "Asia", as well as subregions "Central Asia" and "Eastern Asia". In this scenario, the annotator contributed to the 4 subregions in the "Americas" and the 2 selected subregions in "Asia". Therefore, 1/6 was added to each of the subregion counts of "Caribbean", "Central America", "South America", "Northern America", "Central Asia", and "Eastern Asia".

While we strived to obtain a diverse and representative set of annotators across intersectional demographic subgroups, this was not always feasible for operational reasons. We thus relaxed our requirements and accommodated the annotator pools that were accessible to vendors, allowing for a 'best effort' approach from vendors to meet the diversity requirements.

Self-reported annotator details

We collected the following self-reported demographic metadata of annotators:

Age (reported in aggregate categories: 18–30 years, 31–45 years, 46 years –)
Pronouns (with option “prefer not to say”)
Ancestry
Nationality
Country of Residence [only released as aggregate statistics or on request]
Contact Information [not released and stored separately]
Consent Form [not released and stored separately]

Providing self-reported demographic information was a requirement for crowdsourced annotations, which was further detailed in the project’s informed consent required to join the project as an annotator. For vendor employees who worked as annotators, submitting demographic information was optional.

9. If you have any aggregated sociodemographic statistics about your annotator pool, please describe.

Annotator demographics are provided in Section D of FHIBE’s supplementary materials.

10. Do you have reason to believe that sociodemographic characteristics of annotators may have impacted how they annotated the data? Why or why not?

We collected annotator demographics where possible because we acknowledged a priori that annotator attributes may influence perception and thus also annotation. We strived to obtain a diverse set of annotators, but it was not always feasible for operational reasons. Additionally we took the following measure to reduce the influence of annotator bias:

Annotators only annotated more objective, perceivable attributes
We provided examples, trainings, and opportunities for annotators to ask questions
We conducted gold standard checks and found the inter-rater reliability was high.

11. Consider the intended context of use of the dataset and the individuals and communities that may be impacted by a model trained on this dataset. Are these communities represented in your annotator pool?

FHIBE is intended to be used as a diverse human-centric dataset to evaluate a variety of computer vision tasks. The dataset has not been created for a specific context of use. However, the individuals represented in the dataset are diverse across multiple demographic axes and the annotators that labeled the dataset are as diverse as possible, thus supporting the use of FHIBE for evaluating human-centric computer vision models in geographically diverse contexts.

Platform and infrastructure choices

Under what conditions are data annotated

12. What annotation platform did you utilize?

The choice of annotation platform was at the discretion of vendors, provided that it met the specifications that we provided.

13. At a high level, what considerations informed your decision to choose this platform?

The specifications that we provided to vendors described the desired properties of our dataset and associated annotations. The specifications relevant for the vendors’ choice of annotation platform included:

Potential use cases for the dataset
The ability for annotators to report hours and be compensated for their work
The specific annotations required, e.g., face bounding boxes, segmentation masks, head pose, etc.
The quality assurance tasks required for reviewing/editing each annotation task
The number of annotators required for each annotation and quality assurance task.
The ability to deliver annotations in their “raw” form as well as after quality assurance reviews/edits.
The ability to collect and provide annotator demographic information.
The ability to distinguish between consensual and nonconsensual image subjects.

14. Did the chosen platform sufficiently meet the requirements you outlined for annotator pools? Are any aspects not covered?

Yes. In some minor cases, when a vendor’s annotation platform was unable to meet our specifications, we slightly relaxed our requirements. For example, our specifications required that if an annotator self-selected a pronoun of 'none of the above' or 'prefer not to say', then they were prohibited from selecting an additional pronoun option. One of the vendor platforms could not limit the number of pronoun selections. We allowed the vendor to contact the annotator to resolve ambiguity when this situation arose.

15. What, if any, communication channels did your chosen platform offer to facilitate communication with annotators? How did this channel of communication influence the annotation process and/or resulting annotations?

During the annotation process, we held weekly FAQ sessions with data vendors to address any questions that arose. Furthermore, the informed consent form included an email address for Sony that data subjects could contact if there were any questions about the project. We closely monitored the email account for any correspondence.

16. How much were annotators compensated? Did you consider any particular pay standards, when determining their compensation? If so, please describe.

Sony required vendors to pay, at minimum, the legal minimum wage per hour of work to annotators. Minimum wage was based on the country in which the annotator resides. In the case of countries with no legal minimum wage, vendors were instructed to propose a rate and gain Sony’s approval in advance.

Before commencing any annotation work, vendors were obligated to submit a methods report as part of the deliverables. If vendors required updates to their methods, the report had to undergo re-approval before any changes could be implemented. The report documented:

How image subjects/annotators will be recruited (e.g., social media/online/print-media advertisements, in-person recruitment, existing crowd-worker pools, existing in-house teams).
For social media/online/print-media advertisements, the text and images that will be used in the advertisements must be provided, along with names of platforms/publications they will be published on/in.
How much image subjects/annotators located in each country/region will be paid per hour, and how this hourly rate was determined.
How many hours of work they will be paid for, and how this was calculated.
When and how image subjects/annotators will be informed of the rate per hour or total compensation.

Dataset analysis and evaluation

What do we do with the results?

17. How do you define the annotation quality in your context, and how did you assess quality in your dataset?

We conducted intermittent test-driven consistency and reliability checks during the data collection process, and used the results as a measure of annotation quality. The methods for annotation quality assurance are described in detail in the FHIBE paper’s Methods supplement. The process for testing intra- and inter-vendor annotation reliability is expanded on here:

Sample 70 images.
Distribute sampled images and associated metadata to all vendors. Each vendor should annotate the images three times, with randomly assigned QA workers and annotators reviewing them each time.
The vendors return the annotations so that predefined intercoder reliability checks can be performed. The tests are conducted within each vendor (intra-vendor) and between vendors (inter-vendor). This means that for each image and meta-data value, there will be a score quantifying (a) the reliability of annotations generated by QA workers and annotators from each vendor, and (b) the overall reliability and conformity of annotations across vendors.
Evaluate calculated scores and, if necessary, give vendors guidelines to adapt their annotation practices.

For all tests, the calculation of values took place at the final output of the annotation pipeline, i.e., after the last QA round.

Additionally, we compared the vendor-provided annotations with the average annotations from three internal expert annotators on a randomly sampled set of 500 images for each annotation type.

The type of tests that were conducted depended on the type of annotations, as described below.

Self-reported features that QA cannot change

These features are: age, pronoun, nationality, country of residence, ancestry, disability/difficulty, natural skin tone, natural eye color, natural head hair type, natural head hair color, natural face hair color, weight, height, biologically related subjects, pregnancy status

No reliability tests possible
Checked for systematic errors only

Self-reported features that QA can update

These features are: nonconsensual person visible, apparent skin tone, apparent eye color, apparent head hair type, apparent head hair color, hairstyle, facial hairstyle, apparent face hair color, facial marks, body pose, subject-object interaction, subject-subject interaction, illumination, one-hour time window of capture, date of capture, weather, scene, camera position

For single selection labels: calculate Jaccard Similarity coefficient. Values > 0.75 are generally acceptable.
For multiple selection labels: calculate Jaccard Similarity coefficient. Values > 0.75 are generally acceptable.
For labels that are continuous values: Although the values are numeric, we still calculate the Jaccard Similarity coefficient. The reason for that is that changes on these values are really sparse, and continuous correlation metrics do not provide any meaningful information. Values > 0.75 are generally acceptable.

Pixel-level annotations that QA can change

Keypoints

For each keypoint, we calculate the Object Keypoint Similarity across the values received. Values > 0.9 are generally acceptable.

Segmentation masks

For each mask, we calculate the Sørensen–Dice Coefficient across the sets of values received. Values > 0.9 are generally acceptable.

Face bounding box

For the face bounding box, we calculate mean Intersection over Union (IoU) of the sets of values received. Values > 0.9 are generally acceptable.

18. Have you conducted any analysis on disagreement patterns? If so, what analyses did you use and what were the major findings?

We evaluated the robustness of some of the image/subject attributes as well as the pixel level annotations (face bounding boxes, segmentation masks, and keypoints). The test results are summarized in Section E of the FHIBE paper’s Supplementary Materials.

For pixel level annotations, agreement between collected and in-house expert annotations are above 90 %, at a similar or higher level as related works, proving the robustness and quality of collected annotations. Furthermore, intra- and inter-vendor agreement is also close to or above 90%, confirming a high quality of collected annotations.

For person and image-specific attributes, intra-vendor agreement is above 80% and inter-vendor agreement is at 70%, which indicates that labels are more noisy.

19. Did you analyze potential sources of disagreement?

Test results were discussed with vendors to provide feedback for quality control during annotation. Our ongoing feedback mechanism with vendors ensured that all vendors passed intra-vendor tests regarding annotator agreement of keypoint annotations, mask annotations, union of face bounding boxes, and dice scores of masks. While collecting the data, we corrected disagreement patterns that we observed.

20. How do the individual annotator responses relate to the final labels released in the dataset?

Not assessed.

Dataset release and maintenance

What is the future of the dataset?

21. Do you have reason to believe the annotations in this dataset may change over time? Do you plan to update your dataset?

FHIBE will be updated in connection with consent revocations that impact images in the dataset.

22. Are there any conditions or definitions that, if changed, could impact the utility of your dataset?

With the exception of self-reported demographic labels, all annotations are of physical, observable traits. Culture, language and geographic borders are dynamic. Thus, it is possible that certain labeling choices that were made, such as pronouns, will need to be adapted over time. Similarly, fashion and places change. Thus, clothing, face masks, objects and places of cultural and social significance in this decade may look very different to the same artifacts and places several decades from now. New artifacts may also be invented that are not captured in this dataset. This is to be expected, and the dataset will need to be assessed against how the world has changed in future.

23. Will you attempt to track, impose limitations on, or otherwise influence how your dataset is used? If so, how?

Yes, in order to download FHIBE, dataset users must agree to the license and the terms of use. Furthermore, dataset users also need to state their intended use of the dataset in the access form and provide a contact email address. The license and terms of use are available in full at the FHIBE web platform.

The terms of use stipulate, among other restrictions, that FHIBE cannot be used to train machine learning or artificial intelligence software, algorithms, or other technologies. The sole exception is the usage of the data for the explicit development of tools to assess fairness and mitigate biases in machine learning models and implementations.

24. Were annotators informed about how the data is externalized? If changes to the dataset are made, will they be informed?

Yes, each annotator signed an informed consent form as well as other documentation describing the nature of the FHIBE project and how their data will be used. No personally identifiable information of the annotator will be externalized.

25. Is there a process by which annotators can later choose to withdraw their data from the dataset? Please detail.

No personally identifiable information of the annotators will be disclosed in the dataset. However, crowdsourced annotators can revoke their consent with respect to the personal information controlled by us. Such requests will not result in the removal of other annotations, such as keypoints and segmentation masks that they drew or evaluated for quality assurance.