Analysis of Test
Results
    Reliability, Validity, and Item
    Analysis
Learning Content
   Levels of Measurement
   Correlation Coefficient
   Reliability
   Validity
   Item Analysis
Objectives
   1. Determine the use of the different
    ways of establishing an assessment
    tools’ validity and reliability.
   2. Familiarize on the different methods
    of establishing an assessment tools’
    validity and reliability.
   3. Assess how good an assessment
    tool is by determining the index of
    validity, reliability, item
    discrimination, and item difficulty.
Levels of Measurement
   Nominal
   Ordinal
   Interval
   Ratio
Correlation Coefficient
   Relationship of two variables (X &
    Y)
   Direction
   Positive             Negative


X


         Y
Degree of Relationship
   0.80 – 1.00   Very High
                  relationship
   0.6 – 0.79    High Relationship
   0.40 – 0.59   Substantial/Marked
                  relationship
   0.20 – 0.39   Low relationship
   0.00 – 0.19   Negligible
                  relationship
Testing for Significance
   Nominal: Phi Coefficient
   Ordinal: Spearman rho
   Interval & Ratio: Pearson r
   Interval with nominal: Point biserial
   Decision rule:
   If p value < =.05: significant
    relationship
   If p value > =.05: no significant
    relationship
Variance
   R2
   Square the correlation coefficient
   Interpretation: percentage of time
    that the variability in X accounts for
    the variability in Y.
Reliability
   Consistency of scores Obtained by
    the same person when retested
    with the identical test or with an
    equivalent form of the test
Test-Retest Reliability
   Repeating the identical test on a second
    occasion
   Temporal stability
   When variables are stable ex: motor
    coordination, finger dexterity, aptitude,
    capacity to learn
   Correlate the scores from the first test
    and second test.· The higher the
    correlation the more reliable
Alternate Form/Parallel
Form
   Same person is tested with one form on
    the first occasion and with another
    equivalent form on the second
   Equivalence;
   Temporal stability and consistency of
    response
   Used for personality and mental ability
    tests
   Correlate scores on the first form and
    scores on the second form
Split half
   Two scores are obtained for each person
    by dividing the test into equivalent halves
   Internal consistency;
   Homogeneity of items
   Used for personality and mental ability
    tests
   The test should have many items
   Correlate scores of the odd and even
    numbered items
   Convert     the    obtained      correlation
    coefficient into a coefficient estimate
    using Spearman Brown

Kuder Richardson
(KR #20/KR #21)
   When       computing     for binary
    (e.g., true/false) items
   Consistency of responses to all
    items
   Used if there is a correct answer
    (right or wrong)
   Use KR #20 or KR #21 formula
Coefficient Alpha
   The reliability that would result if all
    values for each item were
    standardized (z transformed)
   Consistency of responses to all
    items
   Homogeneity of items
   Used for personality tests with
    multiple scored-items
   Use the cronbach’s alpha formula
Inter-item reliability
   Consistency of responses to all
    items
   Homogeneity of items
   Used for personality tests with
    multiple scored-items
   Each item is correlated with every
    item in the test
Scorer Reliability
   Having a sample of test papers
    independently scored by two examiners
   To decrease examiner or scorer variance
   Clinical   instruments    employed      in
    intensive individual tests ex. projective
    tests
   The two scores from the two raters
    obtained are correlated with each other
Validity
   Degree to which the test actually
    measures what it purports to
    measure
Content Validity
   Systematic examination of the test
    content to determine whether it covers a
    representative sample of the behavior
    domain to be measured.
   More appropriate for achievement tests
    & teacher made tests
   Items are based on instructional
    objectives, course syllabi & textbooks
   Consultation with experts
   Making test-specifications
Criterion-Prediction
Validity
   Prediction from the test to any
    criterion situation over time interval
   Hiring job applicants, selecting
    students      for    admission       to
    college,       assigning       military
    personnel to occupational training
    programs
   Test scores are correlated with
    other criterion measures ex:
    mechanical aptitude and job
    performance as a machinist
Concurrent validity
   Tests are administered to a group
    on whom criterion data are already
    available
   Diagnosing for existing status ex.
    entrance exam scores of students
    for college with their average grade
    for their senior year.
   Correlate the test score with the
    other existing measure
Construct Validity
   The extent to which the test may be said
    to measure a theoretical construct or
    trait.
   Used for personality tests. Measures that
    are multidimensional
           Correlate a new test with a similar
    earlier test as measured approximately
    the same general behavior
          Factor analysis
           Comparison of the upper and lower
    group
           Point-biserial correlation (pass and
    fail with total test score)
          Correlate subtest with the entire test
Convergent Validity

   The      test     should    correlate
    significantly from variables it is
    related to
   Commonly          for    personality
    measures
   Multitrait-multidimensional matrix
Divergent Validity
   The test should not correlate
    significantly from variables from
    which it should differ
   Commonly          for    personality
    measures
   Multitrait-multidimensional matrix
Item Analysis
   Item Difficulty – The percentage of
    respondents who answered an
    item correctly
   Item Discrimination – Degree to
    which an item differentiates
    correctly among test takers in the
    behavior that the test is designed
    to measure.
Difficulty Index
   Difficulty Index   Remark
   .76 or higher      Easy Item
   .25 to .75         Average Item
   .24 or lower       Difficult Item
Index Discrimination
   .40 and above - Very good item
   .30 - .39     - Good item
   .20 - .29     - Reasonably Good
                  item
   .10 - .19     - Marginal item
   Below .10     - Poor item

Lesson 4 analysis of test results

  • 1.
    Analysis of Test Results Reliability, Validity, and Item Analysis
  • 2.
    Learning Content  Levels of Measurement  Correlation Coefficient  Reliability  Validity  Item Analysis
  • 3.
    Objectives  1. Determine the use of the different ways of establishing an assessment tools’ validity and reliability.  2. Familiarize on the different methods of establishing an assessment tools’ validity and reliability.  3. Assess how good an assessment tool is by determining the index of validity, reliability, item discrimination, and item difficulty.
  • 4.
    Levels of Measurement  Nominal  Ordinal  Interval  Ratio
  • 5.
    Correlation Coefficient  Relationship of two variables (X & Y)  Direction  Positive Negative X Y
  • 6.
    Degree of Relationship  0.80 – 1.00 Very High relationship  0.6 – 0.79 High Relationship  0.40 – 0.59 Substantial/Marked relationship  0.20 – 0.39 Low relationship  0.00 – 0.19 Negligible relationship
  • 7.
    Testing for Significance  Nominal: Phi Coefficient  Ordinal: Spearman rho  Interval & Ratio: Pearson r  Interval with nominal: Point biserial  Decision rule:  If p value < =.05: significant relationship  If p value > =.05: no significant relationship
  • 8.
    Variance  R2  Square the correlation coefficient  Interpretation: percentage of time that the variability in X accounts for the variability in Y.
  • 9.
    Reliability  Consistency of scores Obtained by the same person when retested with the identical test or with an equivalent form of the test
  • 10.
    Test-Retest Reliability  Repeating the identical test on a second occasion  Temporal stability  When variables are stable ex: motor coordination, finger dexterity, aptitude, capacity to learn  Correlate the scores from the first test and second test.· The higher the correlation the more reliable
  • 11.
    Alternate Form/Parallel Form  Same person is tested with one form on the first occasion and with another equivalent form on the second  Equivalence;  Temporal stability and consistency of response  Used for personality and mental ability tests  Correlate scores on the first form and scores on the second form
  • 12.
    Split half  Two scores are obtained for each person by dividing the test into equivalent halves  Internal consistency;  Homogeneity of items  Used for personality and mental ability tests  The test should have many items  Correlate scores of the odd and even numbered items  Convert the obtained correlation coefficient into a coefficient estimate using Spearman Brown 
  • 13.
    Kuder Richardson (KR #20/KR#21)  When computing for binary (e.g., true/false) items  Consistency of responses to all items  Used if there is a correct answer (right or wrong)  Use KR #20 or KR #21 formula
  • 14.
    Coefficient Alpha  The reliability that would result if all values for each item were standardized (z transformed)  Consistency of responses to all items  Homogeneity of items  Used for personality tests with multiple scored-items  Use the cronbach’s alpha formula
  • 15.
    Inter-item reliability  Consistency of responses to all items  Homogeneity of items  Used for personality tests with multiple scored-items  Each item is correlated with every item in the test
  • 16.
    Scorer Reliability  Having a sample of test papers independently scored by two examiners  To decrease examiner or scorer variance  Clinical instruments employed in intensive individual tests ex. projective tests  The two scores from the two raters obtained are correlated with each other
  • 17.
    Validity  Degree to which the test actually measures what it purports to measure
  • 18.
    Content Validity  Systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured.  More appropriate for achievement tests & teacher made tests  Items are based on instructional objectives, course syllabi & textbooks  Consultation with experts  Making test-specifications
  • 19.
    Criterion-Prediction Validity  Prediction from the test to any criterion situation over time interval  Hiring job applicants, selecting students for admission to college, assigning military personnel to occupational training programs  Test scores are correlated with other criterion measures ex: mechanical aptitude and job performance as a machinist
  • 20.
    Concurrent validity  Tests are administered to a group on whom criterion data are already available  Diagnosing for existing status ex. entrance exam scores of students for college with their average grade for their senior year.  Correlate the test score with the other existing measure
  • 21.
    Construct Validity  The extent to which the test may be said to measure a theoretical construct or trait.  Used for personality tests. Measures that are multidimensional  Correlate a new test with a similar earlier test as measured approximately the same general behavior  Factor analysis  Comparison of the upper and lower group  Point-biserial correlation (pass and fail with total test score)  Correlate subtest with the entire test
  • 22.
    Convergent Validity  The test should correlate significantly from variables it is related to  Commonly for personality measures  Multitrait-multidimensional matrix
  • 23.
    Divergent Validity  The test should not correlate significantly from variables from which it should differ  Commonly for personality measures  Multitrait-multidimensional matrix
  • 28.
    Item Analysis  Item Difficulty – The percentage of respondents who answered an item correctly  Item Discrimination – Degree to which an item differentiates correctly among test takers in the behavior that the test is designed to measure.
  • 29.
    Difficulty Index  Difficulty Index Remark  .76 or higher Easy Item  .25 to .75 Average Item  .24 or lower Difficult Item
  • 30.
    Index Discrimination  .40 and above - Very good item  .30 - .39 - Good item  .20 - .29 - Reasonably Good item  .10 - .19 - Marginal item  Below .10 - Poor item