By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Unlocking Textual Data: A Beginner’s Journey Through Python, NLTK, and spaCy | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Unlocking Textual Data: A Beginner’s Journey Through Python, NLTK, and spaCy | HackerNoon
Computing

Unlocking Textual Data: A Beginner’s Journey Through Python, NLTK, and spaCy | HackerNoon

News Room
Last updated: 2025/07/15 at 4:28 PM
News Room Published 15 July 2025
Share
SHARE

Table of Links

Abstract and 1 Introduction

2 Related Work

3 A Virtual Learning Experience

3.1 The Team and 3.2 Course Overview

3.3 Pilot 1

3.4 Pilot 2

4 Feedback

4.1 Relentless Feedback

4.2 Detailed Student Feedback

5 Lessons Learned

6 Summary and Future Work, Acknowledgements, and References

A. Appendix: Three Stars and a Wish

3 A Virtual Learning Experience

3.1 The Team

Our team is made up of three early career academics at the University of Edinburgh. Two teaching fellows have a background in Natural Language Processing with PhDs in Computational Linguistics. The third teaching fellow has a PhD in Computer Science and frequently teaches programming to different types of audiences, including business students as well as students outside of higher education. The author list of this paper also includes a fourth (last) author who was a participant of our first pilot, is a lecturer herself, and who has provided us with useful feedback for future iterations of this course (see Section 4.2).

3.2 Course Overview

In our data-driven society, it is increasingly essential for people throughout the private, public and third sectors to know how to analyse the wealth of information society creates each day. Our TDM course gives participants who have no or very limited coding experience the tools they need to interrogate data. This course is designed to teach noncoders how to analyse textual data using Python as the main programming language. It takes them through the required steps needed to be able to analyse and visualise information in large sets of textual document collections, or corpora.

The course takes place over three three-hour sessions and each session introduces participants to a new topic through a short lecture. The topics build on the previous sessions and at the end of each session there is time for discussion and feedback. In the first session we start with Python for reading in and processing text and teach how individual documents are loaded and tokenised. We work with plain text files but do raise the issue that textual data can be stored in different formats. However, to keep things simple we do not cover other formats in detail in the practical sessions.

In the second session we show how this is done using much larger sets of text and add in visualisations. We used two data sets as examples, the Medical History of British India (of Scotland, 2019) made available by the National Library of Scotland[4] and the inaugural addresses of all American Presidents from 1789 to 2017. We show how participants can create concordance lists, token frequency distributions in a corpus and over time as well as lexical dispersion plots and how they can perform regular expression searches using Python. In this session we also explain that textual data can be messy and that a lot of time can be spent on cleaning and preparing data in a way that is most useful for further analysis. For example, we point students at stop words and punctuation in the results and explain how to filter them when creating frequency-based visualisations.

During the third session we cover POS-tagging and named entity recognition. This last session concludes with a lesson on visualisations of text and derived data by means of text highlighting, frequency graphs, word clouds and networks (see some examples in Figure 1). The underlying NLP tools used for this course are NLTK 3 and spaCy which are widely use for NLP research and development. This is also where we put some of the course material in context of our own research to show how it can be applied in practice in a real project. For example, we mentioned our previous work on collecting topic-specific Twitter datasets for further analysis (Llewellyn et al., 2015), on geoparsing historical and literary text (Clifford et al., 2016; Alex et al., 2019a) and on named entity recognition for radiology reports (Alex et al., 2019b; Gorinski et al., 2019).

Figure 1: Visualisations of text explorations created by the students.Figure 1: Visualisations of text explorations created by the students.

In the two pilots, we ran this course over three afternoon sessions on Monday, Wednesday and Friday, with an office hour on the days in-between to sort out any potential technical issues and answer questions. The main learning outcome is that by the end of the course the participants will have acquired initial TDM skills which they can use in their own research and build on by taking more advanced NLP courses or tutorials. A main goal of this course is to teach the material in a clear stepby-step way so all Python code and the examples are specific to each task but do not go in-depth into complicated programming concepts which we believe would confuse complete novices.

Authors:

(1) Amador Durán, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(2) Pablo Fernández, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(3) Beatriz Bernárdez, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]);

(5) Aslı Akalın, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]);

(6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]).


[4] https://data.nls.uk/ data/digitised-collections/ a-medical-history-of-british-india/

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article EmojiTracker returns to former glory to track the most popular emoji around
Next Article Trump AI czar David Sacks urges Musk to reconsider third-party push
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Overall VC Fundraising Drops 33% But AI Investment Surges
News
8BitDo debuts swappable buttons on the updated Pro 3 controller
News
Take the Google 45W Power Charger home for just $24.99
News
00s movie star unrecognizable as she’s spotted in LA – can you guess who she is?
News

You Might also Like

Computing

The HackerNoon Newsletter: Welcome to the Museum of AI Hallucinations (7/15/2025) | HackerNoon

2 Min Read
Computing

We Built a Private Algorand Network to Crack the Code of Transaction Ordering | HackerNoon

16 Min Read
Computing

We Learned, They Learned: The Unwritten Rules of a Successful Online Classroom | HackerNoon

13 Min Read
Computing

$TAC Token Debuts In TVL As TAC Mainnet Goes Live With Leading DeFi Protocols | HackerNoon

5 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?