2019-01-08 R语言数据分析



For the course assignment, you will be expected to retrieve, clean, and analyse a data set. In this document we provide the primary research questions to be answered, information on the structure and format of the final report, information on code that should be submitted, and a brief overview of the marking criteria. You can find the codebook for the data set and the R script template on LEARN. The data for this assignment come from the Timed picture naming in seven languages study (Bates, et al., 2003) available as part of the (International Picture Naming Project)[https://crl.ucsd. edu/experiments/ipnp/]

It can be tempting to over-complicate assessments like this, particularly if you have a long time to complete them. The labs have been designed to prepare you for this assignment: to explore data, to conduct appropriate analyses for given data types, and to make decisions that you can justify. Bear in mind that completing this assessment does not require any knowledge that wasn’t covered in lectures, labs, and readings.

What you need to submit

For your assessment you need to submit two documents: your report and your R code. More instruc- tions on how to submit are below. Here, we provide more detail on what to submit.


You need to produce a report answering the assignment questions below. Your report should include appropriate analyses to provide answers to these questions while describing the process and utilising graphics where necessary to illustrate your points.

  • Your report should clearly identify the decisions you made in analysing the data, as well as summarising what can be concluded from your analysis.
  • Figures and tables should be numbered and captioned, and referred to in the text; important statistical outcomes should be summarised in the text.
  • Reporting should follow APA 6th Edition guidelines for the presentation of tables, figures, and statistical results (see final lecture for more information). Alternative style is acceptable so long as it is clear and consistent.
  • Your report should be a maximum of 4 sides of A4 (including tables and figures), in a standard font, size 12, with normal 1 inch margins.


    Your report must be accompanied by an R script (a text file with the extension .R, the default file type when saving a script from R-Studio) which can be used to exactly reproduce the results set out in your submitted report. It should include all steps taken in data cleaning and all analyses. Every answer to the assignment tasks/questions given below must be accompanied by code used to find out the answer. You should provide clear and informative comments within the file describing the steps taken. Please download the script template from LEARN and use it to write your script.

    Important: Do not edit the lines of code in the script template that read in the data sets!

    This lines will obtain the data to be used for this assignment from the internet and assign them to data frames.

    We will check that the code runs and produces the results presented in your report.

    Any code copied and pasted or otherwise adapted from internet examples should be cited appro- priately in the comments. An appropriate citation should include the URL where the code was found, the name of the website or blog, and the original author’s name. In the absence of a proper name, you can cite the contributor’s nickname or alias.

    You can work on the R-script in small groups (no more than 4 students) if preferred. If you do this, it is important that you take a couple of steps:

    1. At the very start of your script include a comment line (line starting with #) which includes the exam numbers (not the names) of those you worked with. For example:

      # Produced in collaboration with students B045329 and B018429

    2. Within the script point out (again using comments) which blocks of code are shared.
    3. Please ensure that your acknowledgements match those of others in your group (if you say you produced the script in collaboration with B045329, we expect B045329 to acknowledge you).

    Important: While the code can be worked on in small groups, the written report must be produced entirely independently. It is not OK to include sections in the written report that are written collabora- tively.

    Submission and Marking

    Submitting your work

    All coursework must be submitted before 12:00 (noon) on Monday the 21st of January 2019 via Turnitin. You can access it by clicking on the “Assessment details and submission” tab of the course page on LEARN. There are two sections there, one for each of the two files you are required to submit. You will be asked to provide your name and submission title. The submission title must be your exam number (and nothing more). Your name will not appear anywhere in the documents accessed by the markers. To ensure that the marking is entirely anonymous, please do not include your name or student number anywhere in either of the submitted files.

    Remember, the files you are required to submit are:

  • Report, as described above. The filename must be your exam number with whatever extension is provided by your chosen word processor (e.g., ‘B045329.docx’). The file you create should have your exam number on each page (e.g., in the header or footer).
    1. Appropriate cleaning of the data set and key variables of interest, making appropriate and justified decisions on the steps you take.
    2. Selection of appropriate statistical tests and variables to answer the primary research question and the justifications provided for your selections.
    3. Interpretation of the results of the selected analyses.
    4. R-code that runs without errors all the way through, is clear and appropriately commented.

      For handy tips on writing good code, see http://adv-r.had.co.nz/Style.html (no need to stick religiously to the guidelines but following them does make code nice and tidy).

    5. Last but not least: Clarity of writing and formatting. The report should conform to the APA 6th Edition style guidelines for formatting text, tables, and figures, reporting results of statistical analyses, writing style, etc. However, alternative style is acceptable provided it is comparably

      R script which runs all of the data cleaning and the final analyses reported. The filename must

      be your exam number with the .R extension (e.g., ‘B045329.R’).

      Please ensure that you name your documents exactly as above. File names such as ‘R Script for B04329.R’ or ‘B044329 Report final.docx’ slow down document matching and marking and will result in loss of marks.

      Please check LEARN for detailed instructions on the submission process prior to submitting.

      Marking Criteria

      The code is worth 30% of the coursework marks, and the report is worth 70% of the coursework marks. Work will automatically fail (max mark of 30%) unless both components are submitted.

      You will be assessed on the following:


    You are given four separate data sets:

  • df_e is a data set of 520 pictures and their associated variables in a English language picture naming study

  • df_c, df_h, and df_s are data sets of 173 pictures in a Chinese, Hungarian, and Spanish

    language picture naming study, respectively. Each data set uses a different picture set. The code book for the data sets can be found on LEARN.

    Assignment Questions

    Question 1

    Is there a relationship between the frequency of a target word in the English corpus and reaction time (RT) on a picture-naming task? Once you are content that the data are appropriately cleaned, run the following model:

    m1 <- lm(rttar ~ lnfreq, data = df_e)

    Question 1.1

    Concisely report and interpret the results of the model.

    Question 1.2

    What is the predicted RT for a word with a frequency of 20?

    Hint: Don’t forget the the frequency variable is log-transformed (see codebook for details) and that, in R, exp() is the inverse function of log().

    Question 1.3

    Produce and interpret a diagnostic plot of the model that shows whether or not the model residuals are normally distributed.

    Question 2

    Do target word length and the number of synonyms a word has have additional effects on RT above and beyond that of word frequency in the English language data set?

    Question 2.1

    Fit an appropriate model to test this question.

    Question 2.2

    Run model diagnostics and, if needed, re-fit the model.

    Question 2.3

    Report and interpret the results of the final model.

    Question 3

    Do variables entered as predictors in the Question 2 model predict whether or not at least one par- ticipant will produce an error response on a picture naming task in the English language data set? Pictures for which there are any invalid or incorrect responses should be coded as containing errors.

    Question 3.1

    Fit, report, and interpret an appropriate model to test this question.

    Question 3.2

    What is the predicted probability of a correct response on a picture whose name has a frequency of 12, is 3 syllables long and has only one form?

    Question 4

    Does the effect of target word frequency on RT vary significantly between Chinese, Hungarian, and Spanish?

    Hint: You will need to construct a single data frame to answer this question.

    Question 4.1

    Fit an appropriate model to test this question. Run model diagnostics and re-fit the model if needed.

    Question 4.2

    Report and interpret the results of the final model.

    Question 4.3

    Which language has the weakest effect of frequency on RT? Describe it in terms of unit change in RT as a result in a unit change in log-frequency.


  • Explore and describe the data.
  • Build appropriate models, evaluate them and their associated assumptions, and interpret the results.
  • Let your models be informed by the research questions they are supposed to address. There is seldom a need for mind-bogglingly complex and borderline uninterpretable 6-way interaction models.