Table of Links
Abstract and 1 Introduction
1.1 The twincode platform
1.2 Pilot Studies
1.3 Other Gender Identities and 1.4 Structure of the Paper
2 Related Work
3 Original Study (Seville Dec, 2021) and 3.1 Participants
3.2 Experiment Execution
3.3 Factors (Independent Variables)
3.4 Response Variables (Dependent Variables)
3.5 Confounding Variables
3.6 Data Analysis
4 First Replication (Berkeley May, 2022)
4.1 Participants
4.2 Experiment Execution
4.3 Data Analysis
5 Discussion and Threats to Validity and 5.1 Operationalization of the Cause Construct — Treatment
5.2 Operationalization of the Effect Construct — Metrics
5.3 Sampling the Population — Participants
6 Conclusions and Future Work
6.1 Replication in Different Cultural Background
6.2 Using Chatbots as Partners and AI-based Utterance Coding
Datasets, Compliance with Ethical Standards, Acknowledgements, and References
A. Questionnaire #1 and #2 response items
B. Evolution of the twincode User Interface
C. User Interface of tag-a-chat
6 Conclusions and Future Work
After performing the original study and an external replication, we can conclude that we did not observe any effect of the gender bias treatment, nor any interaction between the perceived partner’s gender and subject’s gender, in any of the 45 response variables in the original study.
With respect to the external replication, we only observed statistically significant effects within the experimental group, i.e. comparing how subjects acted when they thought their partner was a man or a woman, in four of the 45 dependent variables. One variable was related with changes in the behavior (source code deletions), and the other three were related with the relative frequency of different type of chat utterances (informal messages, reflections, and yes/no questions). In the case of the source code deletions, subjects deleted more characters when they perceived their partners as a woman, but the relative frequency of informal messages, reflections, and yes/no questions was higher when they perceived their partners as a man. We also observed a lower effectiveness of the treatment in the replication, that could be caused by the changes in the gendered avatars but also for having used a remote setting instead of a controlled environment like a laboratory session, free of distractions and interruptions. That lower effectiveness of the treatment led to a small number of selected subjects in the experimental group, thus leading to consider the replication results carefully because of the small sample they are based on, and because when FDR adjustments are applied, only the result of the relative frequency of informal messages remains significant.
These outcomes have raised a number of potential research questions that we plan to address in the future and that are briefly described below.
6.1 Replication in Different Cultural Background
The cultural differences between Spanish and U.S. students could have also influenced the outcomes of both studies, so we would like to replicate it other countries and analyze those potential differences caused by cultural backgrounds.
6.2 Using Chatbots as Partners and AI-based Utterance Coding
Another two research lines we would like to explore in the future are the use of chatbots as pair programming partners and the use of deep learning to automatically code chat utterances, thus reducing the manual effort of carrying out a replication.
Inspired by current trends in Psychology [4, 24] and taking into account not only the absence of significant differences between groups in the original study and the replication, but also the difficulties in recruiting a relevant number of subjects, we are considering the possibility of changing from a between-groups design to a within subject design in which each subject performs the pair programming tasks with a chatbot simulating being a man or a woman instead of with another human subject. Obviously, developing such a chatbot is not a trivial task, but current advances in the area, such as LaMDA [10], BERT [14], or GPT-3 [37], make this approach a technical challenge worth exploring. A very relevant aspect in the development of such a chatbot is avoiding gender bias in the training data, as recently studied by [39].
On the other hand, now that we have a relevant number of coded chat utterances in Spanish and English, we could use that labeled dataset to fine train a large language model system similar to those used in chatbots to classify user intents and apply it for the automatic coding of chat utterances, which is one of the most timeconsuming tasks we have had to perform as experimenters in our exploratory study. If the results of such a fine trained system were accurate, future replications would required much less effort than the two presented in this article and experimenter bias would be considerably mitigated.
Datasets
The datasets generated and analyzed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.6783717.
Compliance with Ethical Standards
The authors declared that they have no conflict of interest with any aspect of the reported studies.
The experiment protocols were approved by the Institutional Review Board (IRB) at UC Berkeley. At the University of Seville, only studies involving experimentation animals or biomedical experiments involving humans need to be approved by the Ethics Committee on Experimentation, so no approval was required in this case.
Acknowledgements
We would like to thank the students who volunteered to participate in the pilot studies, the original experiment and the first replication at the Universities of Seville (US) and California Berkeley (UCB). We also want to thank David Brincau (undergraduate student at US) for their support in the development of the twincode platform; Jose Sandoval (Master’s student at US) for developing ´ tag-a-chat, the collaborative tool for tagging chat utterances; and Daewon Kwon and Karim el Refai (undergraduate students at UCB) for their support in the evolutive changes to the twincode platform and in the experiment execution at UCB. We particularly acknowledge Vron Vance (UCB alumnus, Data Analyst at Google) for their assistance regarding inclusive language around gender identity. Last but not least, we would like to thank the anonymous reviewers for their valuable comments and suggestions that helped us improve the quality and clarity of this article.
This work has been partially supported by grants PID2021–126227NB–C21, PID2021–126227NB–C22 funded by MCIN/AEI/10.13039/501100011033 and “ERDF a way of making Europe”; PYC20 RE 084 US, EKIPMENT-PLUS (P18–FR–2895), US-1264651, MEMENTO (US–1381595) funded by Junta de Andaluc´ıa/ERDF,UE; FPU19/00666 funded by MCIN/AEI/10.13039/501100011033 and by “ESF Investing in your future”; and Universidad de Sevilla under the 2021 Grants for the Exchange Mobility of Professors, Researchers, and PhD Students between the University of Seville and the University of California.
References
[1] AAUW (2020) The STEM gap: Women and girls in science, technology, engineering and mathematics. American Association of University Women, URL https://www.aauw.org/resources/research/the-stem-gap/
[2] Akalın A, Weinman N, Stasaski K, et al (2021) Exploring the impact of gender bias on pair programming. In: Proceedings of the 17th ACM Conference on International Computing Education Research, pp 435–437, https://doi.org/10. 1145/3446871.3469790
[3] Al-Jarrah A, Pontelli E (2016) On the effectiveness of a collaborative virtual pair-programming environment. In: International Conference on Learning and Collaboration Technologies, pp 583–595
[4] Bendig E, Erb B, Schulze-Thuesing L, et al (2019) The next generation: Chatbots in clinical psychology and psychotherapy to foster mental health – a scoping review. Verhaltenstherapie https://doi.org/10.1159/000501812
[5] Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4):1165–1188. URL http://www.jstor.org/stable/2674075
[6] Chaparro EA, Yuksel A, Romero P, et al (2005) Factors affecting the perceived effectiveness of pair programming in higher education. In: Proceedings of the 17th Workshop of the Psychology of Programming Interest Group
[7] Choi KS (2013) Evaluating gender significance within a pair programming context. In: Proceedings of the Hawaii International Conference on System Sciences, pp 4817–4825
[8] Choi KS (2015) A comparative analysis of different gender pair combinations in pair programming. Behaviour & Information Technology 34(8):825–837 [9] Cohen L, Manion L, Morrison K (2018) Research Methods in Education, 8th edn. Routledge
[10] Collins E, Ghahramani Z (2021) LaMDA: our breakthrough conversation technology. Google Research, URL https://blog.google/technology/ai/lamda/
[11] Cruz M, Bernardez B, Dur ´ an A, et al (2022) A model-based approach for ´ specifying changes in replications of empirical studies in computer science. Computing URL https://doi.org/10.1007/s00607-022-01133-x
[12] da Silva Estacio BJ, Prikladnicki R (2015) Distributed pair programming: A ´ systematic literature review. Information and Software Technology 63:1–10
[13] Denzin NK (2006) Sociological Methods: A Sourcebook, 5th edn. Aldine Transaction
[14] Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp 4171–4186, https://doi.org/10.18653/v1/ N19-1423, URL https://aclanthology.org/N19-1423
[15] Dimock M (2019) Defining generations: Where millennials end and generation z begins. URL https://pewrsr.ch/2szqtJz
[16] Duran A, Fern ´ andez P, Bern ´ ardez B, et al (2021) Gender bias in remote pair ´ programming among software engineering students: The twincode exploratory study. In: Proceedings of ESEM 2021 – Registered Report Track, URL https: //arxiv.org/abs/2110.01962
[17] Eckles D, Kizilcec R, Bakshy E (2016) Estimating peer effects in networks with peer encouragement designs. Proceedings of the National Academy of Sciences 113(27):7316–7322
[18] El-Refai K, Kwon D, Brincau D, et al (2023) Twincode: An instrumented platform for pair programming research. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 2, p 1264, https://doi.org/ 10.1145/3545947.3573239, URL https://doi.org/10.1145/3545947.3573239
[19] Falessi D, Juristo N, Wohlin C, et al (2018) Empirical software engineering experts on the use of students and professionals in experiments. Empirical Softw Eng 23(1):452–489. https://doi.org/10.1007/s10664-017-9523-3, URL https://doi.org/10.1007/s10664-017-9523-3
[20] Galdo AC, Celepkolu M, Lytle N, et al (2022) Pair programming in a pandemic: Understanding middle school students’ remote collaboration experiences. In: Proceedings of the 53rd ACM Technical Symposium on Computer Science Education V. 1, pp 335–341
[21] Gomez O, Solari M, Calvache C, et al (2017) A controlled experiment on pro- ´ ductivity of pair programming gender combinations: Preliminary results. In: Proceedings of the XX Ibero–American Conference on Software Engineering, pp 197–210
[22] GraphPad (2023) What is the difference between ordinal, interval and ratio variables? why should i care? URL https://t.ly/rxCW
[23] Gravetter FJ, Wallnau LB (2004) Statistics for the Behavioural Sciences, 6th edn. Wadsworth/Thompson Learning
[24] Greer S, Ramo D, Chang YJ, et al (2019) Use of the chatbot “vivibot” to deliver positive psychology skills and promote well-being among young people after cancer treatment: Randomized controlled feasibility trial. JMIR Mhealth Uhealth 7(10)
[25] Hanks B, Fitzgerald S, McCauley R, et al (2011) Pair programming in education: A literature review. Computer Science Education 21(2):135–173
[26] Hannay JE, Arisholm E, Engvik H, et al (2010) Effects of personality on pair programming. IEEE Transactions on Software Engineering 36(1):61–80. https: //doi.org/10.1109/TSE.2009.41
[27] Hawlitschek A, Berndt S, Schulz S (2022) Empirical research on pair programming in higher education: a literature review. Computer Science Education pp 1–29
[28] Hofer SI (2015) Studying gender bias in physics grading: The role of teaching experience and country. International Journal of Science Education 37(17):2879–2905
[29] Hopper J (2014) How to label your 10-point scale. Versta Research, URL https: //verstaresearch.com/blog/how-to-label-your-10-point-scale/
[30] Jarratt L, Bowman NA, Culver KC, et al (2019) A large-scale experimental study of gender and pair composition in pair programming. In: Proceedings of the ACM Conference on Innovation and Technology in Computer Science Education, pp 176–181
[31] Katira N, Williams L, Osborne J (2005) Towards increasing the compatibility of student pair programmers. In: International Conference on Software Engineering, pp 625–626, https://doi.org/10.1109/ICSE.2005.1553618
[32] Kaur Chahal K, Kaur A, Saini M (2021) Research and Evidence in Software Engineering: From Empirical Studies to Open Source Artifacts, Taylor & Francis Group, chap Empirical Studies on Using Pair Programming as a Pedagogical Tool in Higher Education Courses: A Systematic Literature Review, pp 251–287
[33] Kaur Kuttal S, Gerstner K, Bejarano A (2019) Remote pair programming in online cs education: Investigating through a gender lens. In: 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp 75–85, https://doi.org/10.1109/VLHCC.2019.8818790
[34] Kitchenham BA, Pfleeger SL, Hoaglin. D, et al (2002) Preliminary Guidelines for Empirical Research in Software Engineering. IEEE Transactions on Software Engineering 28(8):721–734
[35] Korber P, Motschnig R (2021) The effects of pair-programming in introductory programming courses with visual and text-based languages. In: 2021 IEEE Frontiers in Education Conference (FIE). IEEE Press, p 1–9, https://doi. org/10.1109/FIE49875.2021.9637186, URL https://doi.org/10.1109/FIE49875. 2021.9637186
[36] Kuljit Kaur Chahal MSAmanpreet Kaur (2021) Empirical Studies on Using Pair Programming as a Pedagogical Tool in Higher Education Courses: A Systematic Literature Review. Auerbach Publications
[37] Lim R, Wu M, Miller L (2021) Customizing GPT-3 for your application. OpenAI, URL https://openai.com/blog/customized-gpt-3/
[38] Martell RF, Lane DM, Emrich C (1996) Male-female differences: A computer simulation. American Psychologist 51(2):157–158
[39] McAuliffe A, Hart J, Kuttal SK (2022) Evaluating gender bias in pair programming conversations with an agent. In: 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp 1–4, https://doi.org/ 10.1109/VL/HCC53370.2022.9833146
[40] Navarro D (2018) Learning statistics with R: A tutorial for psychology students and other beginners (version 0.6). URL https://learningstatisticswithr.com/
[41] Newser (2023) This university has the most stressed-out students. URL https: //www.newser.com/story/330315/10-most-least-stressed-college-towns.html
[42] O’Connor C, Joffe H (2020) Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods 19:1–13
[43] Porter AA, Votta LG, Basili VR (1999) Building Knowledge through Families of Experiments. IEEE Transactions on Software Engineering 25(4):456–473
[44] Rodr´ıguez FJ, Price KM, Boyer KE (2017) Exploring the pair programming process: Characteristics of effective collaboration. In: Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education, pp 507– 512
[45] Runeson P (2003) Using students as experiment subjects – an analysis on graduate and freshmen student data. In: Proceedings 7th International Conference on Empirical Assessment & Evaluation in Software Engineering, pp 95–102
[46] Salleh N, Mendes E, Grundy J, et al (2010) The effects of neuroticism on pair programming: an empirical study in the higher education context. In: Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement, pp 1–10
[47] Salleh N, Mendes E, Grundy J (2011) Empirical studies of pair programming for cs/se teaching in higher education: A systematic literature review. IEEE Trans Software Eng 37:509–525. https://doi.org/10.1109/TSE.2010.59
[48] Salleh N, Mendes E, Grundy J (2014) Investigating the effects of personality traits on pair programming in a higher education setting through a family of experiments. Empirical Software Engineering 19(3):714–752
[49] Samara O, Monzon A (2021) Zoom burnout amidst a pandemic: Perspective from a medical student and learner. Therapeutic Advances in Infectious Disease 8. https://doi.org/10.1177/20499361211026717, URL https://doi.org/10.1177/ 20499361211026717
[50] Sfetsos P, Stamelos I, Angelis L, et al (2009) An experimental investigation of personality types impact on pair effectiveness in pair programming. Empirical Software Engineering 14(2):187–226
[51] STEM Women (2021) Percentages of women in STEM statistics. STEM Women, URL https://www.stemwomen.com/ women-in-stem-percentages-of-women-in-stem-statistics
[52] Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680. https://doi.org/10.1126/science.103.2684.677, URL https: //www.science.org/doi/abs/10.1126/science.103.2684.677
[53] Stotts D, Williams L, N N, et al (2003) Virtual teaming: Experiments and experiences with distributed pair programming. In: Conference on Extreme Programming and Agile Methods, pp 129–141
[54] Study International (2016) Students at these u.s. universities are under the most stress. URL https://www.studyinternational.com/news/ students-mental-health-us-universities-stress/
[55] Syed M, Nelson SC (2015) Guidelines for establishing reliability when coding narrative data. Emerging Adulthood 3(6):375–387
[56] Thomas L, Ratcliffe M, Robertson A (2003) Code warriors and code-a-phobes: A study in attitude and pair programming. In: Proceedings of SIGCSE, pp 363– 367
[57] UCLA: Statistical Consulting Group (accessed June 29, 2022) What does cronbach’s alpha mean? URL https://stats.oarc.ucla.edu/spss/faq/ what-does-cronbachs-alpha-mean/
[58] University of California, Berkeley (2021) Demographic information (restricted access). URL https://calanswers.berkeley.edu/home
[59] University of Seville (2021) Statistical yearbook 2020–2021. URL https: //servicio.us.es/splanestu/WS/Anuario2021/AESY20-21.html, english version starts at page 400
[60] Werner LL, Hanks B, McDowell C (2004) Pair-programming helps female computer science students. J Educ Resour Comput 4(1)
[61] Wohlin C, Runeson P, Host M, et al (2012) Experimentation in Software ¨ Engineering: an Introduction. Springer
[62] Xinogalos S, Satratzemi M, Chatzigeorgiou A, et al (2017) Student perceptions on the benefits and shortcomings of distributed pair programming assignments. 2017 IEEE Global Engineering Education Conference (EDUCON) pp 1513– 1521
[63] Ying KM, Martin AC, Rodr´ıguez FJ, et al (2021) Cs1 students’ perspectives on the computer science gender gap: Achieving equity requires awareness. In: 2021 Conference on Research in Equitable and Sustained Participation in Engineering, Computing, and Technology (RESPECT), IEEE, pp 1–9
[64] Ying KM, Rodr´ıguez FJ, Dibble AL, et al (2021) Understanding women’s remote collaborative programming experiences: The relationship between dialogue features and reported perceptions. Proceedings of the ACM on HumanComputer Interaction 4(CSCW3):1–29
Authors:
(1) Amador Duran, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(2) Pablo Fernandez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(3) Beatriz Bernardez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain ([email protected]);
(4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]);
(5) Aslıhan Akalın, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]);
(6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, USA ([email protected]).