(last edit: October 20, 2025)
Due: Thursday, 20, November, 2025, 16:00. Late submissions are not accepted at all (the semester is finishing). This is an individual work project.
Note: The teaching staff will only answer questions (via Ed and consultation hours) regarding this assignment up to the late evening of Wednesday 19/11.
Weights and marking criteria: Total number of points: 100. There are 10 points for handing in according to the hand-in instructions, including a voice recording, neat output, and very importantly the GitHub repo. There are 20 points for creativity and originality. The remaining 70 points are for the individual tasks of project.
Submission format: This project should be submitted via a GitHub Repo, a single PDF file, and a voice recording via Blackboard. The GitHub repo should have a jupyter notebook called analysis.ipynb.
Specific instructions for the GitHub repo are below. It is important that the GitHub repo be made private and the course user name uqMATH2504 be invited as a collaborator prior to submission and ideally as soon as possible.
The PDF file should be a nicely formatted file that has:
Your name, student number, and assignment title (Project 3 - 2025) on the top.
A (clickable) link to your GitHub repo.
A PDF printout of your analysis.ipynb notebook including the graphics from Task 5.
Use a PDF merging utility to create the PDF file. Do not worry about extra white space occurring due to Jupyter notebook PDF printouts.
As with previous assignments you can comment on questions in this PDF (e.g. when asked to answer things not via code). The Jupyter notebook in the GitHub should be "runnable". That is, course staff should be able to download the notebook your repo and run it after activating the environment in the working directory based on your Project.toml file which specifies the dependencies used.
Marking responses will be made by the teaching staff via blackboard.
Individual work: This is an individual work project. Plagiarism will not be accepted. Nevertheless, feel free to consult with friends or classmates via Ed and other means. Feel also free to use any Julia package that you find on the web as long as it is publicly available.
Marking Criteria: There are 100 points total.
10 points are allocated for following instructions and the GitHub repo. Any deviation from the instructions will imply these points are lost.
20 points are allocated for creativity and originality. These points will be given in full to projects that are deemed creative in terms of their analysis and presentation.
70 points are for individual tasks.
In general, points will be deducted for sloppy coding style. Make sure to have your code properly indented, to use sensible and consistent variable names, and to write code that is in general clean and consistent. Nevertheless, in this project you are mostly creating scripting code for data analysis and ML experimentation so it can be looser in nature. That is, in contrast to previous projects, the code may be less generic. Nevertheless, define functions as needed to avoid duplication (copy pasting) of code where possible.
Ideally use the same account you used for previous assignments.
Create a new repo for this assignment. Name the repo exactly as <<FIRST NAME>>-<<LAST NAME>>-2504-2025-PROJECT3. So for example if your name is "Ada Lovelace", the repo name should be Ada-Lovelace-2504-2025-PROJECT3.
Make sure the repo is private.
Invite the course GitHub user, uqMATH2504 as a collaborator (please do this early on). You may do so early on while working on the project, and must do this no later than the project due date.
Do not make any changes (commits) to the repo after the project due date.
Create a local clone of the repo. It is recommended that use use git command line via the shell to work on making changes/additions to the assignment and submitting the changes. However you are free to use any other mechanism (VS-Code, GitHub desktop, etc).
Your GitHub repo should be formatted exactly as follows:
Have a README.md file which states your name, the assignment title, and has a (clickable) link to the assignment instructions on the course website (to this document).
Have a LICENSE.md file. Choose a license as you wish (for example the MIT license). However keep in mind that you must keep the submission private.
Have a .gitignore file.
Have basic running instructions on how to run the code.
Have the main Jupyter notebooks as instructed above. In the notebook have using Pkg and Pkg.activate(".") commands at the top so that you work in an environment of your current directory.
Add dependencies to this environment and this will create a Project.toml file which should also be part of your repo.
A data folder with the housing data and any other data needed.
You deal with a house prices dataset available from Kaggle.
Your overarching purpose is to carry out basic exploratory analysis of this dataset and answer several questions that a data scientist working with such a dataset in the real estate world may wish to answer.
We look at the Melbourne_housing_FULL.csv file. You can download the file from Kaggle and place it in a data folder in your repo. Your scripts should access this file.
Please put extra emphasis on neat graphs, proper choices of axes labels, ranges, colors, legends, etc.
As there are missing entries in this dataset, handle missing values for each of the tasks in the best way that you see. Only drop data if you really have to (don't drop all rows for all tasks). In certain cases, if you see a sensible way to impute (replace/make-up data), feel free to do so and explain your imputation strategy.
Task 1 Exploratory data of single variables (10 pts)
Create several summaries and plots (e.g. histograms, or cumulative distributions, bar plots, ...) of the following variables: Rooms, Price, Method, Distance, and Landsize. Make sure to choose appropriate diagram types as different types of data require different types of plots.
Task 2 Exploratory data analysis comparing variables (10 pts)
Create several plots that present the house price as a function of one or more other variables, e.g. distance to the city, house/land size, rooms, car ports, etc. . In presenting these plots, visually search for relationships between variables.
Task 3 Exploratory data analysis over time (10 pts)
Create several plots that visualize trends in the data over time. Specifically plot the total transaction amount, the number of sales, and the fraction of sales which were houses (Type = "h"). You may aggregate values into months.
Task 4 Linear Regression Models (15 pts)
Review your results of Task 2 and try to fit several linear regression models for predicting house price as a function of variables. Use GLM.jl. See usage examples from Chapter 8 of Statistics with Julia or from elsewhere. (Loosely) asses the quality of the linear regression models either via p-values from the statistical output of the models, or by breaking up the data into a training set and validation set. Determine a linear regression model which you find good for predicting house prices. If you believe variable transformations are needed, carry these out.
Task 5 Location data on map (25 points)
Your goal here is to create a quality visualisation presenting location data of the Melbourne Housing Data. The grapics should use a background metropolitan Map and present the houses being sold in the Melbourne area in the year 2017, namely during the month in which you (!) were born (those born in January - please take the data from 1/2018).
Single house purchases should be represented by circles whose sizes or color scale with the purchase price. Other features are possible also. Presenting additional summarized statistics alongside the graphics is also possible and encouraged.
It is recommended to use Plots.jl and other packages such as Images.jl. You'll need to find out how to make it work without help from the course team.
Finally include a pdf of the graphics and other output of your pdf submission.