Big Data H
Wed 13 Feb 2019
% Contribution to final
Solo or Group ✓ Solo
Anticipated Hours Submission Instructions
10 per group member
Submit via Moodle one zip or compressed tar file
containing your source code directory (uog- bigdata.tgz or uog-bigdata.zip), plus one README.txt text file
Please Note: This Coursework cannot be Re-Done
Code of Assessment Rules for Coursework Submission
Deadlines for the submission of coursework which is to be formally assessed will be published in course documentation, and work which is submitted later than the deadline will be subject to penalty as set out below.
The primary grade and secondary band awarded for coursework which is submitted after the published deadline will be calculated as follows:
- in respect of work submitted not more than five working days after the deadline
- the work will be assessed in the usual way;
- the primary grade and secondary band so determined will then be reduced by two secondary bands for each working day (or part of a working day) the work was submitted late.
- work submitted more than five working days after the deadline will be awarded Grade H.
Penalties for late submission of coursework will not be imposed if good cause is established for the late submission. You should submit documents supporting good cause via MyCampus.
Penalty for non-adherence to Submission Instructions is 2 bands
You must complete an “Own Work” form via https://webapps.dcs.gla.ac.uk/ETHICS for all coursework UNLESS submitted via Moodle
Big Data (H/SIT) 2018-19
1st Assessed Exercise: HDFS/MapReduce
The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data crunching tasks using Hadoop/MapReduce. You will be required to design and implement algorithms for parsing, filtering, projecting, and transforming data, over a relatively large dataset, executing your code on a shared Hadoop cluster.
Dataset and Infrastructure
You will be working on a parsed version of the complete Wikipedia edit history as of January 20081. This is a single large text file (around the 300GB mark in its entirety), in a tagged multi-line format. Each revision history record consists of 14 lines, each starting with a tag and containing a space/tab- delimited series of entries. More specifically, each record contains the following data/tags, one tag per line:
- REVISION: revision metadata, consisting of:
o article_id: a large integer, uniquely identifying each page.
o rev_id: a large number uniquely identifying each revision.
o article_title: a string denoting the page’s title (and the last part of the URL of the page, hence also uniquely identifying each page).
o timestamp: the exact date and time of the revision, in ISO 8601 format; e.g., 13:45:00 UTC 30 September 2013 becomes 2013-09-12T13:45:00Z, where T separates the date from the time part and Z denotes the time is in UTC. (Note: a class that translates such dates into numerical form is provided in the skeleton code).
o [ip:]username: the name of the user who performed the revision, or her DNS-resolved IP address (e.g., ip:office.dcs.gla.ac.uk) if anonymous.
o user_id: a large number uniquely identifying the user who performed the revision, or her IP address as above if anonymous.
- CATEGORY: list of categories this page is assigned to.
- IMAGE: list of images in the page, each listed as many times as it occurs.
- MAIN, TALK, USER, USER_TALK, OTHER: cross-references to pages in other namespaces.
- EXTERNAL: list of hyperlinks to pages outside Wikipedia.
- TEMPLATE: list of all templates used by the page, each listed as many times as it occurs.
- COMMENT: revision comments as entered by the revision author.
- MINOR: a Boolean flag (0|1) denoting whether the edit was marked as minor by the author.
- TEXTDATA: word count of revision’s plain text.
- An empty line, denoting the end of the current record.
To execute and test your implementations, you will be using Hadoop/MapReduce. The dataset is already stored on HDFS, under the path “/user/enwiki/”. For the sake of time, you will be working on random samples of this dataset; however, your code should be able to process the larger file
without any major changes, hence you should try to design for scalability and performance. There are several versions of the datasets under said folder:
- enwiki-20080103-sample.txt: A small random sample (~0.05%) of the complete dataset.
- enwiki-20080103-largersample.txt: A relatively larger (1%) random sample of the complete dataset.
- enwiki-20080103-perftest.txt: A 10% random sample of the complete dataset; this will not be accessible until later this semester.
Try to implement MapReduce programs to process the following types of queries over the dataset:
- Execute a “wordcount” on the whole data file.
- Execute a “wordcount” but only counting occurrences of article_id’s.
- Execute a “wordcount” but only counting how many MAIN outlinks a page has.
These tasks are unassessed and are only provided as ideas for you to practice with MapReduce and the datasets. Do NOT submit source code files for the above.
Your main task is to implement a watered-down version of the PageRank algorithm. Looking at the record you have for each revision, you can pull out:
- The article_title from the REVISION line
- The list of titles of pages linked to (i.e., out-links) from the MAIN line
Now, given a set S of such links, in the form: <source article> <target article>, the PageRank score value for any page u can be expressed as:
PR(u)=0.15 + 0.85 * Sum(PR(v)/L(v)), v: (v,u) S, where L(v) is the number of out-links of page v.
The PageRank score is then computed in rounds. For the first round, all pages have a score of 1.0, and assume any given page has a score of p in a subsequent round. On each round, each page v “contributes” p/L(v) to the PageRank of each of the pages it links to. At the end of each round, the PageRank score of a page u is equal to the sum of the contributions of all pages with an out-link to u, times 0.85, plus 0.15 (= 1 – 0.85). The 0.85 factor is called the “damping factor” of PageRank.
You will be computing PageRank scores at a user-defined point in time. That is, the user of your program will supply (through a command line argument) a date Y; your code should then compute the PageRank score only for article revisions valid at time Y. An article revision is deemed valid iff it is the latest revision for that article that predates Y; i.e., iff (a) the revision’s creation date C predates Y and (b) there is no other revision for the same article with a creation date C’ so that C<C’<Y. You also need to make sure that the graph on which you will execute the PageRank algorithm is a simple graph; i.e., there is only one directed edge between any pair of source and destination nodes (if there are more, simply ignore them) and no self-loops. This means that you will need to filter out self-loops and multiple occurrences of outlinks. Last, treat articles that appear only as the targets of
outlinks, as existing articles with no outlinks; as in this case there will be no revision/date information for these articles, assume that they have a single revision whose creation time equals the time supplied by the user.
You will need to implement the above algorithm using Hadoop MapReduce and to test it out on the provided datasets. Your code should be such that it can be executed against the dataset with a single command, taking five arguments: (a) the path to the input file, (b) the path to the output directory (assumed non-existent), (c) the number X of iterations for the PageRank algorithm (integer >= 1), and (d) the date Y for which the PageRank scores will be computed (in ISO8601 format); i.e., the code should be executable like so:
$ env HADOOP_CLASSPATH=/path/to/your.jar hadoop MainClass \
/path/to/input.txt /path/to/output X Y
Your output files should contain one line of text per result, containing the article title and its PageRank score separated by a single space; i.e.:
Article_1 score1 Article_2 score2 Article_3 score3
Do not worry about sorting/merging of the final output files.
Solutions against the sample input files will be provided closer to the submission deadline to avoid distracting you from your coding. While working on your implementations, do all your testing against the smallest of the three files (also included in the VM’s HDFS). Once you are confident that your code works correctly and efficiently, move up to the next larger sample (on the cluster). Only execute your code against the perftest file iff you are 100% certain that there are no bugs or other problems. Along the same lines, beware of the resources you request through your code. Teams that hog the cluster will be “named and shamed”…
What to hand in
Use Moodle to submit a single zip or compressed tar file with the contents of the uog-bigdata folder post cleaning, plus a separate README.txt file.
For the former, do NOT submit any class files or other binary files (e.g., jar files)! That is, before creating the first submission file, make sure you execute ‘mvn clean’ in your project’s directory (and that Eclipse or other IDE is closed beforehand so as to avoid having the class files rebuilt in the background). To aid in testing and assessing of your code, please make sure that:
- Your submission file is named uog-bigdata.tgz or uog-bigdata.zip
- When uncompressed, your files will be in a folder named uog-bigdata
- No binaries are included in the submission, unless you have used a third-party library (whose use you should document and justify in your report (more on this shortly)
- No version control system-specific directories are included; e.g., if you are using git to track your code changes, don’t include the .git directory in your submission, etc.
Your README.txt file should outline (a) your solution (key-value format for mappers and reducers, as well as any other configuration details for your jobs), (b) any interesting aspects of your solution (e.g., assumptions you’ve made, optimisations that you thought of, etc.), as well as (c) any modifications you may have done to the provided project/maven build files. Please note that if your solution produces correct results, but this is only true under assumptions that were not explicitly stated in your README.txt file, there may be a deduction of marks.
Your submission will be tested for plagiarism with a specialised off-the-shelf tool. Plagiarism cases will be dealt on a case-by-case basis, but suffice to say there will be little tolerance.
How this exercise will be marked
Following timely submission on Moodle, the exercise will be given a numerical mark between 0 (no submission) and 20 (perfect in every way). The numerical marks will then be converted to a band (A5, A4, etc.). The marking scheme is as follows:
- 3 marks for the quality and readability of the code; make sure that you use appropriate names for variables/methods/classes and that your source code is properly structured in packages and classes, with common functionality abstracted out.
- 3 marks for documentation in the code and your README.txt file; make sure that you
document in your source files, at the very least, the basic steps taken and the key-value pair formats. You can use Javadoc-compliant comments or just plain comments. That said, do not make an essay of your code; use your README file to discuss further details.
- 9 marks for a solution that produces the correct results; partial marks will be awarded if your solution is on the right track but produces incorrect results due to minor errors.
- 5 marks for any optimisations you have come up with, e.g., to reduce the number of jobs
and the amount of data read from disk and/or transferred over the network, to improve parallelism, etc.