Introduction
In this blog I use a real dataset of exam results to explore how data can help us understand student performance.
The dataset contains information about 200 students, including:
hours_studied per week
sleep_hours per night
attendance_percent in class
previous_scores from earlier assessments
final exam_score in the current exam
By analysing this data, I look at what seems to matter most for exam success and use it to show the value of data, how data quality and bias can affect results, the basic data analysis process, and how visualisations make patterns easier to see.
1. The value of data
Data is valuable because it turns guesses into evidence. Instead of arguing about what should help students do well, we can measure it and check.
In education, data can be used to:
Spot students who might be struggling early, based on a mix of attendance, previous scores and recent exam results.
Evaluate whether extra support (like study clubs or tutoring) is actually working.
Help teachers adapt their lessons if a whole class is under-performing in certain topics.
In a wider context, data is valuable in:
Business: companies track sales, website clicks and customer feedback to decide what products to improve or remove.
Healthcare: patient data is used to spot patterns in illnesses and to test if treatments are effective.
Everyday life: apps use our data (for example, screen time or step count) to give feedback and recommendations.
Linked to my dataset
In my dataset, data is valuable because it lets us answer questions such as:
Do students who study more hours actually get higher exam scores?
Does sleeping more help, or is it less important than studying?
How much do previous scores predict current performance?
For example, when I calculated the correlation between hours studied and exam score, I found a strong positive relationship. Students who studied more hours tended to have noticeably higher exam results. In contrast, the link between sleep hours and exam score was much weaker, suggesting that in this dataset, sleep doesn’t explain exam results as strongly as study time.
This shows why data is valuable: it allows us to see which factors matter most in this group of students, instead of just guessing.
2. Data quality and data bias
For data to be useful, it has to be good quality and as unbiased as possible.
What makes data high quality?
Good quality data is usually:
Accurate: values are close to the real world (for example, “hours studied” is not randomly made up).
Complete: there are not lots of missing values.
Consistent: the same kind of information is recorded in the same way for every person.
Reasonable: values fall within realistic limits (no one studying 200 hours per week).
Poor quality data can lead to misleading patterns. For example, if half the students forgot to include their study hours, the numbers would not reflect reality.
Bias happens when the data does not represent the wider group properly. For example, only collecting data from very motivated students would overestimate how much the average student studies.
Quality & bias in my dataset
In my dataset:
There are 200 students
There are no missing values in any of the columns.
The ranges are realistic:
hours_studied goes from 1 to 12 hours per week
sleep_hours ranges between 4 and 9 hours per night
attendance_percent is between 50.3% and 100%
previous_scores range from 40 to 95
exam_score ranges from 17.1 to 51.3
This suggests the data has been cleaned and is fairly consistent.
However, there are still possible bias issues:
The dataset only covers one group of students, so it might not represent all schools or different age groups.
Variables like hours_studied and sleep_hours could be self-reported, which means students might exaggerate how much they study or underestimate how often they stay up late.
It does not include other important factors such as subject difficulty, teacher quality, or personal issues, which may also affect exam scores.
Do to these limitations in the data caution should be taken as it may not reflect on all students abilities.
3. The data analysis process
A typical data analysis process can be broken into clear stages:
1. Define the question
o In this project: “Which study habits are most strongly linked to higher exam scores?”
2. Collect the data
Gather data from surveys, school systems, files, or online sources.
3. Clean and prepare the data
o Remove or deal with missing values
o Fix obvious errors
o Make sure each column has the correct data type (for example, numbers, not text)
4. Explore the data
o Use summary statistics like mean, minimum, maximum and standard deviation
o Look for initial patterns and unusual values
o Create simple graphs to understand the spread of the data
5. Analyse the data
o Use calculations such as correlations, group comparisons or other statistics
o Answer the original question using evidence from the data
6. Visualise and present the findings
o Build charts or graphs to make patterns easier to understand
o Write a conclusion that explains what the data suggests, and any limitations
o
How I applied this process to my dataset
Question: I focused on how hours_studied, sleep_hours, attendance_percent and previous_scores relate to exam_score.
Collection: The dataset was collected from Kaggle, a large data publishing website and downloaded as a CSV file named student_exam_scores.csv.
Cleaning: I checked for missing values and unrealistic numbers. None were missing, and all values were within sensible ranges.
Exploration: I calculated averages and correlations for each variable, and looked at the minimum and maximum values.
Analysis: I compared exam scores for groups of students (for example, low vs high study hours) and calculated correlation coefficients.
Presentation: I created a scatter plot graph and then summarised my findings in this blog
4. Data visualisations
Visualisations help turn columns of numbers into something humans can understand quickly.
Using PowerBI i created a graph to easily show the trend of scores being higher the more students studied
Heres what i did
Scatter plot: Hours studied vs exam score
A scatter plot with:
x-axis: hours_studied
y-axis: exam_score
shows a clear upward trend. Students who study more hours generally achieve higher exam scores. There are a few students who study a lot but still get average scores, and a few who study less but do okay, but the overall pattern is still very clear.
5. Chosen dataset
The dataset I used is called “student_exam_scores” and was provided as a CSV file from kaggle. It contains one row per student and six columns:
student_id
hours_studied
sleep_hours
attendance_percent
previous_scores
exam_score
I chose this dataset because:
It is directly related to education and exam performance, which is familiar and easy to understand.
The variables are things we talk about all the time in real life (study hours, sleep, attendance), so it is interesting to see if the data supports what people usually claim.
It is small enough (200 rows) to analyse on a standard computer, but large enough to see patterns.
6. Analysis of the dataset
Using basic statistics and correlations, I found the following:
6.1 Study hours
hours_studied has a strong positive correlation with exam_score (≈ 0.78).
Low-study students (≤ 3.5 hours) scored about 27.9 on average.
High-study students (≥ 9 hours) scored about 41.2 on average.
This suggests that, in this dataset increasing study time is strongly linked to higher exam scores.
6.2 Previous scores
previous_scores has a moderate positive correlation with exam_score (≈ 0.43).
Students with previous scores of 80 or higher averaged about 37.5 in the new exam.
Students with previous scores of 54 or less averaged about 30.1.
This indicates that students who did well before are quite likely to keep performing well, though it is not a perfect prediction.
6.3 Attendance
attendance_percent has a weaker but still positive correlation with exam_score (≈ 0.23).
Students with low attendance (around 62% or less) averaged about 31.3, compared with 36.2 for students with high attendance (at least about 87%).
Attendance seems to matter, but not as strongly as study hours in this dataset.
6.4 Sleep hours
sleep_hours has a small positive correlation with exam_score (≈ 0.19).
Students with low sleep (around 5.3 hours or less) averaged about 32.7.
Students with high sleep (about 8 hours or more) averaged about 35.6.
There is a slight benefit for those who sleep more, but it is not as impactful as study hours.
Overall conclusion from the analysis
The amount of study hours has a much greater impact on exam scores than sleep, but this doesn't prove that studying more causes higher scores as other hidden factors such as students ability, resources or motivation could affect their outcome.
My blog shows how data can be used effectively and answer questions like "Can studying more benefit me". By following a clear and effective data analysis process and visualising data using graphs, it becomes much easier to find out what factors really matter. From the data quality and bias in the data, it reminds us to remember what general limits are and to not over generalise.