Table of Links
Abstract and 1 Introduction
1.1 The twincode platform
1.2 Pilot Studies
1.3 Other Gender Identities and 1.4 Structure of the Paper
2 Related Work
3 Original Study (Seville Dec, 2021) and 3.1 Participants
3.2 Experiment Execution
3.3 Factors (Independent Variables)
3.4 Response Variables (Dependent Variables)
3.5 Confounding Variables
3.6 Data Analysis
4 First Replication (Berkeley May, 2022)
4.1 Participants
4.2 Experiment Execution
4.3 Data Analysis
5 Discussion and Threats to Validity and 5.1 Operationalization of the Cause Construct — Treatment
5.2 Operationalization of the Effect Construct — Metrics
5.3 Sampling the Population — Participants
6 Conclusions and Future Work
6.1 Replication in Different Cultural Background
6.2 Using Chatbots as Partners and AI-based Utterance Coding
Datasets, Compliance with Ethical Standards, Acknowledgements, and References
A. Questionnaire #1 and #2 response items
B. Evolution of the twincode User Interface
C. User Interface of tag-a-chat
3 Original Study (Seville Dec, 2021)
In this section, the original study carried out at the University of Seville in December 2021 is reported, including most of the experimental settings which are in common with the external replication performed at the UC Berkeley in May 2022, reported in Section 4.
3.1 Participants
In the original study carried out at the University of Seville in December 2021, the participants were third-year students of the Degree in Software Engineering enrolled in any of the three groups of the Requirements Engineering course taught in Spanish[3]. The final number of valid[4] subjects was 92, arranged in 46 pairs. Only 9 students could not finish the study because of technical problems during the tasks. Considering the 92 valid subjects, 15 identified as woman (16.30%), 1 as non-binary (1.09%), and the rest as man (82.61%) during the registration process.
Note that, although the percentage of women is low, it is above the average percentage in the Degree in Software Engineering at the University of Seville, which unfortunately is close to 11% according to the last academic year official statistics [59]. Note also that, due to the 9 students dropped by technical reasons, the percentage of women could not be kept the same in the control (6 women, 14.29%) and experimental (9 women, 19.57%) groups than in the sample (16.30%), which was our initial intention.
Authors:
(1) Amador Duran, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(2) Pablo Fernandez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(3) Beatriz Bernardez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]);
(5) Aslıhan Akalın, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]);
(6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]).
[3] There is a fourth group of the Requirements Engineering course which is taught in English and in which the enrolled students are approximately 50% Spanish and 50% Erasmus students coming from other countries in the European Union (EU) or from non-UE countries like Israel or Georgia. They were not invited to participate in the study because their command of Spanish was not good enough to chat with a randomly assigned classmate, who would have undoubtedly identified them as foreign students.
[4] The criteria for considering a subject as valid are strongly dependent on properly performing the experimental tasks, which are described in Section 3.2. The criteria themselves are specified in Section 3.6.