The role of task type in L 2 vocabulary acquisition : a case of Involvement Load Hypothesis

Based on Involvement Load Hypothesis (LAUFER; HULSTIJN, 2001), current study examined the effect of involvement load and task type on vocabulary acquisition. Six classes of EFL learners were assigned to one of six experimental groups with different involvement loads, thus leaving three groups with receptive tasks and three with productive tasks. Learners read a text and completed 10 vocabulary tasks focused on the target words while time on task was controlled across groups. The knowledge of target words was tested in two post-tests. Predictably, the findings indicated that tasks with higher involvement loads were more effective for vocabulary learning than tasks with lower involvement loads. Receptive tasks were also compared with productive ones of the same load condition. Contrary to the Involvement Load Hypothesis, productive tasks were more effective than receptive ones. Results show that the time on task does not have any effect on task efficacy.


Introduction
There is a general agreement among most researchers that vocabulary is one of the main constituents of a language, and acquiring L2 vocabulary is the prerequisite of second language learning.In fact, it is the vocabulary by which L2 learners will be able to do each of the four skills (RICHARDS; RENANDYA, 2002).The importance of vocabulary knowledge in academic situations (DONLEY;REPPEN, 2001), in reading ability (KITAJIMA, 2001;MEARA;FITZPATRICK, 2000) and in human communications (COADY; HUCKIN, 1997) has been greatly emphasized.About the number of L2 words to be learned, some researchers propose that 5,000 words is the lowest lexical necessity for a L2 learners of English to understand general, non-specialized (LAUFER, 1997;NATION, 1990) or unsimplified texts (HIRSH; NATION, 1992).However, for the understanding of specialized and academic texts, 7,000 (GROOT, 1994cited in GROOT, 2000) or 10,000 (SCHMITT, 2000) word stock is required.Similarly, 5,000 words is the prerequisite for communicative skills in a second or foreign language (NATION, 1993cited in PRINCE, 1996).Accordingly, the first step for many foreign or L2 learners is to grab and memorize a large stock of vocabulary.However, the issue is how?
The accepted view among most researchers (e.g., NAGY; HERMAN, 1987;SWANBORN;DE GLOPPER, 1999) is that it is not possible for L2 learners to learn such a large stock of vocabulary merely through the explicit instruction of vocabulary.According to Schmitt (2000), this would be very time-consuming and too laborious.In addition, as Krashen (1989, cited in KEATING, 2008) states, the majority of word learning by L2 learners occurs incidentally.Shortly after developing the incidental vocabulary learning hypothesis by Nagy and Herman (1985), a wide variety of studies were performed to discover the most effective factors on incidental word learning during different kinds of tasks.The results of studies in this area revealed an extensive diversity of factors which were effective in promoting incidental word learning.For example, one type of research related to incidental vocabulary learning put emphasis mainly on learner factors (PRINCE, 1996;SWANBORN;DE GLOPPER, 2002).On the other hand, while some studies examined the impact of contextual cues such as marginal glosses (HULSTIJN, 1992;WATANABE, 1997), others considered the use of the dictionary (KNIGHT, 1994;LUPPESCU;DAY, 1993) as an issue affecting incidental vocabulary learning.Another type of studies investigated the effects of text-based (JOE, 1995(JOE, , 1998)), word-focused (LAUFER, 2001;PARIBAKHT;WESCHE, 1997;WESCHE;PARIBAKHT, 2000) and interactional tasks (ELLIS, 1995;LOSCHKY, 1994;NEWTON, 1995) on incidental vocabulary learning.Additionally, the use of negotiation and interaction (DE LA FUENTE, 2002;ELLIS;HE, 1999) was also considered as an effective factor on incidental learning of L2 vocabulary.
In each of these studies, one task was superior to another in terms of incidental vocabulary learning.In explaining this superiority, most authors indicated that the more effective task requires a 'deeper level of processing' (CRAIK; LOCKHART, 1972) than the other task.Nonetheless, Craik and Lockhart's (1972) depth of processing has been criticized for not having a clear-cut and simple definition about different levels of processing (BADDELEY, 1999;CRAIK;TULVING, 1975;EYSENCK, 1978;LAUFER;HULSTIJN, 2001;NELSON, 1977).Accordingly, the Involvement Load Hypothesis was formulated by Laufer and Hulstijn (2001) to provide a more clear-cut definition of processing depth.

The Involvement Load Hypothesis
The Involvement Load Hypothesis is an incidental vocabulary learning theory that formulated the criteria which explain why some specific tasks lead to better vocabulary retention than others.In their hypothesis, the authors proposed the construct of task-induced involvement load which calculated the amount of task efficacy in the retention of new L2 vocabulary in an incidental condition.This construct comprised three principal components: 'need', 'search', and 'evaluation'.The need component refers to whether, for task completing, the learner is supposed to know the meaning of the new words.Two levels of importance for need were offered: moderate and strong.Need is moderate when it is externally enforced by the teacher or the task, and strong when it is intrinsically imposed by the learner.
Search, as opposed to need, signifies the endeavor of discovering the meaning of a new L2 word or discovering the L2 form of a word in L1.Unlike need, search may be present or absent.While learners attempt to discover the meaning of unfamiliar words to complete a task, the search is present; however, it is absent while such an attempt does not exist.Evaluation entails reaching a conclusion about the meaning of a word during tasks, for example, […] a comparison of a given word with other words, a specific meaning of a word with its other meanings, or combining the word with other words in order to assess whether a word does or does not fit its context (LAUFER; HULSTIJN, 2001, p. 14).
Like need, evaluation can also be moderate or strong.Evaluation is moderate while the learners are required to compare several lexical items with each other (as in matching tasks), or compare different meanings of a lexical item in a provided text (as in a homonym).However, strong evaluation makes learners to combine new lexical items and create novel sentences.
Combining all the three factors with their levels of importance in a task makes the task-induced involvement load.Laufer and Hulstijn (2001) declared that tasks with higher involvement loads promoted better vocabulary retention than tasks with lower involvement loads.But how may we determine one task's involvement load in a numerical fashion?Accordingly, in order to compare different tasks with each other in a numerical fashion, Hulstijn and Laufer (2001) offered the involvement index, which appointed numerical weights in which "[…] absence of a factor is marked as 0, a moderate presence of a factor as 1, and strong presence as 2" (p.544).Therefore, each task can have an involvement index of 0 (lowest index) to 5 (highest index).
In their hypothesis, Laufer and Hulstijn (2001) declared that any special task type (e.g., output) does not consider more effective than other type (e.g., input).They asserted that this is just the level of involvement load of a task which determine task's efficacy.In other words, they stated that two input and output tasks with the same load conditions will act equally on vocabulary acquisition.So, the equality of the involvement loads for different tasks types (e.g., input vs. output) calls for further research.

Empirical studies on Involvement Load Hypothesis
Right after developing the Involvement Load Hypothesis, Hulstijn and Laufer (2001) investigated the effect of involvement load on short-and longterm retention of 10 unfamiliar words by advanced EFL learners in two different experiments.They compared three learning tasks with varying involvement loads: reading comprehension with marginal glosses (index = 1), reading comprehension plus fill-in (index = 2), and writing a composition and using the target words (index = 3).Immediately after treatment, the learners were asked to write L1 translations or L2 definitions for the 10 target words in order to measure their short-term retention of target words.The same post-test was again administered 1 or 2 weeks later in order to measure student's long-term retention.The results of the Hebrew-English Experiment which provided strong support for the Involvement Load Hypothesis revealed that, on both post-tests, the composition group scored significantly higher than the fill-in group, and the fill-in group scored significantly higher than the reading group.Nonetheless, the results of the Dutch-English Experiment which provided partial support for the hypothesis reported that, on both post-tests, the composition group performed significantly better than the fill-in and the reading groups; yet, the fillin group did not perform significantly better than the reading group.
By stating the limitations of Hulstijn andLaufer's (2001) study, Keating (2008) investigated whether the low-proficiency learners may also benefit from the more involving tasks, and whether the learners may gain the same word knowledge on passive and active tests.In order to have certainty about these questions, the low-proficiency learners of Spanish randomly completed one of the three tasks with different involvement loads: reading comprehension with marginal glosses (index = 1), reading comprehension plus fill-in (index = 2), and writing original sentences by using target words (index = 3).After task completion and two weeks later, the learners' knowledge of target words was assessed through two passive and active tests.Partially confirming the Involvement Load Hypothesis, the results of both immediate and delayed passive tests reported that Task 2 and 3 resulted in higher retention scores compared to Task 1.However, Task 3 was not more effective than Task 2. On the other hand, the results of immediate active test which firmly supported the hypothesis revealed that learners in Task 2 and 3 promoted better word retention than those in Task 1, and learners in Task 3 also performed better than those in Task 2. In the delayed active test, however, learners in Task 3 did not perform much better than those in Task 1 or Task 2. In short, Keating's (2008) study claimed that the Involvement Load Hypothesis may be generalized to low proficiency learners and may also affect similarly the learner's passive and active word knowledge.
Following Keating (2008), Kim (2011) tested the Involvement Load Hypothesis in an ESL setting, across different task types and proficiency levels with a controlled time on task.In his first experiment, he tested the efficacy of three tasks with varying involvement loads within two different proficiency levels.In each proficiency level, learners randomly completed one of three tasks: Reading, Gap-fill and Composition.In order to assess L2 learners' initial learning and retention of target words, two immediate and delayed post-tests were conducted.The results of both post-tests showed that the Composition group (index = 3) yielded significantly higher scores than the Reading (index = 1) and Gap-fill (index = 2) groups.However, the participants in Gap-fill group gained significantly better scores than those in the Reading group just on the delayed post-test.In a nutshell, the results of immediate post-test partially supported the Involvement Load Hypothesis while those of delayed post-test fully supported this hypothesis.In his second experiment, Kim (2011) investigated whether two tasks with equal involvement loads affected similarly learning of target words.The author, therefore, compared the writing composition (index = 3) with the writing sentence task (index = 3).The results of both post-tests revealed that these two tasks with identical involvement loads affected equally the initial learning and retention of target words across two different proficiency levels, a claim which was supported by the Involvement Load Hypothesis.
By considering the research studies of the Involvement Load Hypothesis, it appears that time on task has not been well considered in these studies.Initially, Folse (2006) claimed that the efficacy of one task over another might be due to the length of time needed for task completion.Similarly, Keating (2008)  Further research is also needed to investigate the underlying assumptions of the Involvement Load Hypothesis.For example, Laufer and Hulstijn (2001) claimed that any particular task type-be it input or output-is not considered superior or more effective, and that the only influential factor in task efficacy is the task's level of involvement load.Consequently, more research is needed to examine whether tasks with similar levels of involvement load but from different types-input vs. outputwill have similar effects on vocabulary acquisition.To meet these two purposes, the researchers designed three receptive and three productive vocabulary tasks with varying involvement loads.In the light of the purposes of the study, the following research questions were posed: 1. Given English receptive vocabulary tasks, will Iranian EFL learners obtain better initial learning and retention of new vocabulary in higher task load conditions compared to lower ones?If so, will the benefits of tasks hold up over time?
2. Given English productive vocabulary tasks, will Iranian EFL learners obtain better initial learning and retention of new vocabulary in higher task load conditions compared to lower ones?If so, will the benefits of tasks hold up over time?
3. Given English receptive and productive vocabulary tasks with the same levels of involvement index, will Iranian EFL learners obtain the same initial learning and retention of new vocabulary on both types of tasks?

Participants
Six intact classes of second-year English major university students, homogenized by the TOEFL exam, were selected for this study.All of them were EFL learners and their first language was Persian.Initially, 179 students took part but not all of them were present for the delayed post-test because they were not informed of the delayed post-test due to the incidental learning nature of this study.
Moreover, the data from two subjects were excluded from the study because they had the knowledge of more than two target words.Accordingly, the final number of students taking part in this study was 162.Each of the six intact classes was randomly assigned to one of the six experimental groups, thus leaving three groups with receptive tasks, and the other three with productive tasks.

The target words
The 10 target words which were supposed to be unfamiliar to the learners were chosen for examination from the reading text 'Coping with Procrastination' from Kim's ( 2011) study.To make sure that the text would be of an appropriate level for the participants, the length and complexity of the text were modified by the researchers.The unfamiliarity of target words was checked through the pilot study with a group of participants who would not participate in the experiment.These participants, who had the same proficiency level of ours, were given a list of 10 target words and asked to translate them.The overall mean score was 0.2 out of 10 target words.Thus, the target words were unfamiliar within this proficiency level.However, as a final confirmation, the preknowledge of the participants in the main study was also checked in the immediate post-test.The ten target words chosen from the text are: apprehensive, oration, vexed, spawn, envision, abate, caveat, assiduous, stymie, and divulge.It should be mentioned that the 10 target words were emphasized by printing them in a bold face, and glossed in L1 (Persian) as well as L2 (English) in the margin of the text.

The graphic organizers
A set of graphic organizers designed by Kim (2011) were also taken and modified according to the revised text.The graphic organizers, which did not explicitly focus on any of the target words, were used in the study as a part of the comprehension activity to control the time on task across the six experimental groups.So, the participants in Truefalse, Matching and Fill-in-the-blank task conditions were asked to answer the graphic organizers because they took less time than the other groups, as it was discovered in the pilot study.

Vocabulary task conditions
To address the first research questions, the researchers designed three receptive vocabulary tasks with varying involvement loads: True-false, involvement = 1; Matching, involvement = 2; and Multiple-choice, involvement = 3.

True-false task condition
Participants assigned to the True-false task condition were asked to read the marginally glossed text and then complete the graphic organizers to control for time on task across all groups.Afterwards, they were given the 10 True-false vocabulary tasks focused on the target words.In terms of the Involvement Load Hypothesis, this task induced a moderate need (the knowledge of target words was relevant to answering the tasks), but neither search nor evaluation.Its involvement index was thus 1 (1 + 0 + 0).

Matching task condition
Participants in the Matching task condition were also asked to read the text and complete the graphic organizers.After that, they were given 10 Matching vocabulary tasks focused on the target words.This task induced moderate need, no search, and moderate evaluation.Evaluation was moderate because the participants had to distinguish among different definitions to answer the Matching vocabulary tasks.Therefore, the involvement index of this task was 2 (1 + 0 + 1).

Multiple-choice task condition
Participants in the Multiple-choice task condition were provided with the same text given to the last two groups; however, the text was not marginally glossed.The participants' task was to read the text by looking up the target words in a dictionary; afterwards, they were given 10 Multiplechoice vocabulary tasks focused on the target words.This task induced moderate need and moderate evaluation (because four options in each of the Multiple-choice vocabulary tasks must be assessed against each other).The search factor was also present here.Therefore, the involvement index of this task was 3 (1 + 1 + 1).
To address the second research questions, the researchers designed three productive vocabulary tasks with different involvement loads: Shortresponse, involvement = 1; Fill-in the blank, involvement = 2; and Sentence writing, involvement = 3.

Short-response task condition
Participants in the Short-response task condition received the same marginally glossed text to read, and then to complete the 10 Short-response vocabulary tasks focused on the target words.The involvement index of this task was 1 (1 + 0 + 0).Need was moderate, but search and evaluation were absent.

Fill-in-the-blanks task condition
Participants performing the Fill-in-the-blanks task condition were asked to read the same text and then complete the graphic organizers.Afterwards, the learners were required to complete the 10 Fillin-the-blanks vocabulary tasks focused on the target words with the most suitable word from 15 glossed words-10 target words plus 5 additional words-in the reading text.Consequently, the students could not narrow down the choices as they progressed through the vocabulary tasks simply by omitting the words that they have already used.This task induced moderate need, no search, and moderate evaluation.The evaluation was moderate because the 15 glossed words must be assessed against each other.Thus, the involvement index of this task was 2 (1 + 0 + 1).

Sentence writing task condition
Participants in the Sentence writing task condition received the same marginally glossed text, and were asked to read the text.Then, they were required to write L2 (English) sentences by using the 10 target words.The involvement index of this task was 3 (1 + 0 + 2), that is, a moderate need, no search, and a strong evaluation.The evaluation was strong because the participants were required to assess the target words within appropriate collocations in order to generate a new context.

Vocabulary tests
The present study administered two immediate and delayed post-tests to assess the learners' initial learning and retention of target words, respectively.Upon the completion of tasks, and three weeks later, we unexpectedly tested the participants' knowledge of target words through a modified version of the Vocabulary Knowledge Scale (PARIBAKHT; WESCHE, 1997) in all six task conditions (Figure 1).The scoring procedure of the modified VKS may be presented in this way: -1 point: the word is not familiar at all (category I).
-2 points: the word is familiar but its meaning is unknown (category II).-3 points: A correct synonym or translation is given (category III).
-4 points: The word is used with semantic appropriateness in a sentence (category IV).
-5 points: The word is used with semantic appropriateness and grammatical accuracy in a sentence (category IV).
It should be noted that wrong responses in selfreport categories III or IV will lead to a score of 2. Overall, the possible test score for both post-tests was 10-50.On both post-tests, we gave the learners the 10 target words in the form of VKS, and asked them to complete it.The learners were also asked to point out if any of the words were familiar to them before doing the task.If a learner was previously familiar with more than two words, the data collected from that learner was removed from the analysis.

Procedure
This research was accomplished on two separate days.In the first day, we administered the treatment and the immediate post-test.The delayed post-test was carried out three weeks later.On the treatment day, each of the six intact classes was randomly asked to complete one of the following task conditions: True-false, Matching, Multiple-choice, Shortresponse, Fill-in-the-blanks, or Sentence writing.In each group, the participants were asked to read the text and complete the 10 vocabulary tasks.To control the time on task, we also added a set of graphic organizers to the True-false, Matching, and Fill-in-the-blanks groups.Each of the six task conditions took 50 minutes to complete.Due to the nature of the study, incidental learning, the participants were not informed of the upcoming immediate or delayed post-tests because according to Hulstijn (2001), test announcement is an indication of intentional word learning.Accordingly, after task completion, and three weeks later, the participants were unexpectedly given the immediate and delayed post-test in a modified form of VKS in order to measure the initial learning and retention of target words, respectively.

Data analysis
The first two research questions were posed to assess if the level of involvement load affected the initial learning and retention of new vocabulary when tasks with different involvement loads were administered.The dependent variable for these two questions was the scores of the immediate and delayed post-tests, and the independent variable was the level of involvement load.In order to examine the impact of the independent variable on the dependent variable, the VKS scores of both post-tests were submitted to four, one-way ANOVAs.The Scheffe post hoc contrasts were then computed to locate significant differences among pairs.Additionally, six paired samples t-tests were performed to further investigate if the benefits of tasks will hold up over time.Unlike the first two, the third research question examined whether the type of vocabulary task affected the initial learning and the retention of new words when two different types of task (receptive or productive) with the same involvement loads were administered.The dependent variable in this question was the scores of both post-tests, and the independent variable was the type of vocabulary task at two levels: receptive and productive.Six independent samples t-tests were performed to compare the receptive tasks with the productive ones of the same load condition.The alpha level was set at 0.05 when significant results were found.

Data analysis of three receptive tasks
The descriptive statistics of the three receptive vocabulary tasks in Table 1 demonstrate that, on both post-tests, the Multiple-choice group performed better than the Matching group, which, in turn, performed better than the True-false group.To determine if these differences were statistically significant, the scores of each posttest were then submitted to a one-way ANOVA.Note.The involvement index for each task is indicated in parentheses.The possible VKS scores in all three vocabulary tasks ranged from 10 to 50.
The results of both ANOVAs revealed a main effect for the level of task's involvement load for both the immediate, F (2, 80) = 162.519,p < 0.001, and the delayed post-test, F (2, 80) = 134.678,p < 0.001.In fact, there was a significant difference among the tasks with different levels of involvement load on both post-tests.The results of two Scheffe post hoc tests also indicated that the Multiple-choice group significantly outscored both the Matching and the True-false groups, and the Matching group also significantly outscored the True-false group.
Comparing the means of the immediate with those of the delayed post-test for each of the three receptive vocabulary tasks, the results of three paired samples t-tests revealed that there was a significant decrease in the mean scores of the delayed posttest for all the three receptive vocabulary tasks, that is, for the True-false task [t (28) = 11.471,p < 0.001], for the Matching task, [t (26) = 8.980, p < 0.001], and for the Multiple-choice task, [t (26) = 12.486, p < 0.001].In addition to the t-tests results, the sharp decline of the lines in Figure 2 also showed that the performance in all the three groups degenerated significantly on the delayed post-test.

Data analysis of three productive tasks
The descriptive statistics of the three productive vocabulary tasks in Table 2 suggested that the mean score of the Sentence writing group was higher than that of the Fill-in-the-blanks and the Short-response groups on both post-test; however, there was no great difference between the mean scores of the latter two groups on the delayed post-test.Note.The involvement index for each task is indicated in parentheses.The possible VKS scores in all three vocabulary tasks ranged from 10 to 50.
To determine the statistical differences among groups, two one-way ANOVAs were conducted.The ANOVA results indicated that significant differences were found among the three productive vocabulary tasks on both the immediate, F (2, 76) = 47.780,p < 0.001, and the delayed post-test, F (2, 76) = 65.653,p < 0.001.The results of Scheffe tests also demonstrated that the Sentence writing group performed significantly better than the Fill-in-theblanks and the Short-response groups on both posttests, but the Fill-in-the-blanks group performed significantly better than the Shortresponse group only on the immediate post-test.
Regarding the means of the immediate and delayed post-tests for each of the three productive vocabulary tasks, the t-tests results revealed a significant decrease in the mean score of the delayed post-test for the Short-response [t (25) = 11.781,p < 0.001], for the Fill-in-the-blanks [t (27) = 11.588,p < 0.001], and for the Sentence writing group [t (24) = 14.975, p < 0.001].In addition to the t-tests results, the downward lines in Figure 3 also revealed a general degeneration of the performance in all three groups on the delayed posttest, suggesting that the interval between the two post-tests may be a main reason for the decline in performance of all groups.

The comparison between receptive and productive tasks
Regarding the comparison between the true-false (load = 1) and the short-response group (load = 1), the results of two independent t-tests showed a significantly better performance for the Shortresponse group on both the immediate, t (53) = -11.450,p < 0.001, and the delayed post-test, t (53) = -7.084,p < 0.001.In the case of comparison between the matching (load = 2) and the Fill-in-the-blanks group (load = 2), the results of t-tests revealed that the Fill-in-the-blanks group performed significantly better than the Matching group on the immediate post-test, t (42) = -7.134,p < 0.001; however, this preference of the Fill-in-the-blanks group was not observed in the delayed post-test, t (36) = -1.927,p = 0.062 > 0.05.Unlike the last two pairs, the ttests' results of the comparison between the multiplechoice (load = 3) and the sentence writing group (load = 3) revealed that there was no significant difference between these two groups on both the immediate, t (50) = -.779,p = 0.440>0.05,and the delayed post-test, t (50) = -1.534,p = 0.131 > 0.05.

Discussion
The first two research questions were framed to investigate whether tasks with a higher involvement load achieved better vocabulary scores than tasks with a lower involvement load while time on task was controlled across different groups.The results of the first research question on both post-tests fully supported the Involvement Load Hypothesis in that the Multiple-choice group with the highest involvement load (3) produced better initial learning and retention of target words than the Matching group with the lower involvement load (2), which, in turn, performed better than the True-false group with the lowest involvement load (1).However, the results of the second research question partly supported the Involvement Load Hypothesis in that the Sentence writing group (involvement = 3) performed significantly better than the Shortresponse (involvement = 1) and the Fill-in-theblanks group (involvement = 2) on both post-tests, but the Fill-in-the-blanks group performed significantly better than the Short-response group only on the immediate, but not the delayed post-test.
Unlike the first two, the third research question was constructed to investigate Laufer and Hulstijn's (2001) claim that no particular task type-be it input or output-was considered superior or more effective, and that the only determining factor in task efficacy was the degree of involvement load that a task induced.To meet this end, the researchers compared the receptive tasks with the productive ones of the same load condition.Contrary to the predictions of the Involvement Load Hypothesis, the results of the first pair comparison revealed the better performance of the Short-response (a productive task) over the True-false (a receptive task) on both posttests.Similarly, contrary to the Involvement Load Hypothesis, the Fill-in-theblanks (a productive task) performed significantly better than the Matching (a receptive task) on the immediate post-test; however, this preference of the Fill-in-the-blanks group was not observed on the delayed post-test.Unlike the last two pairs, the results of the third pair comparison completely fulfilled the predictions of the Involvement Load Hypothesis in that the Sentence writing (a productive task) performed as well as the Multiplechoice (a receptive task) on both post-tests.
Overall, the results of the first research question on both post-tests, and the results of the second research question on the immediate post-test were in harmony with those obtained in Hulstijn andLaufer's (2001) Hebrew-English Experiment, Kim's (2011) first Experiment on the delayed post-test, and Keating's (2008) active word recall on the immediate post-test in that they all supported the Involvement Load Hypothesis.Similarly, the results of the second research question on the delayed post-test were exactly the same as those obtained in Hulstijn andLaufer's (2001) Dutch-English Experiment andKim's (2011) first Experiment on the immediate post-test.Nevertheless, the results of the third research question were considerably in conflict with the predictions of the Involvement Load Hypothesis.This hypothesis did not predict that any output task would lead to better results than any input task when they both had the same involvement load.On the contrary, we found that despite the involvement load induced by the task, the type of task was also effective in learning new words.In other words, two different types of tasks (receptive and productive) with the same level of involvement load might not have the same results in L2 vocabulary retention.This explanation provided support for Swain's (1985) Output Hypothesis which claimed that the act of production demanded deeper cognitive effort and could contribute more to word learning than the mere reading of a text which is an act of reception.
In general, the findings of this study clearly run counter to some of the previous studies (e.g., ELLIS, 1995;FOLSE, 2006;HULSTIJN;LAUFER, 2001;KEATING, 2008;WEBB, 2005) which claimed that controlling for time on task would diminish the effect of more involving tasks on vocabulary learning.However, similar to Kim (2011), we found that even if the time on task was controlled across different groups, the more involving tasks would perform better than the less involving ones in vocabulary scores.Apart from a significant task's involvement load effect, the results of the study also showed a significant decrease in the performance of all six groups on the delayed posttest, as observed in some of the previous studies (e.g., HULSTIJN;LAUFER, 2001;KEATING, 2008;WATANABE, 1997).This finding may be explained by Hulstijn's (2001) claim that a decrease in knowledge over time is natural in the absence of repetitive practice or additional exposure to the newly learned material.

Conclusion
Taken together, it is reasonable to conclude from the results of this study that task-induced involvement load, as applied to the same types of tasks, is the major factor of task efficacy in terms of vocabulary retention.However, in testing the hypothesis with different types of tasks, the involvement load is not the only factor of task efficacy.Since the factor of task type also has some role in vocabulary retention, it should be given more serious attention in studies of task efficacy in the area of incidental vocabulary learning.

Figure 2 .
Figure 2. The scores of the immediate and delayed post-tests for the three receptive vocabulary tasks.

Figure 3 .
Figure 3.The scores of immediate and delayed post-tests for the three productive vocabulary tasks.
taken into account, the benefits connected to more involving tasks faded.As opposed toFolse (2006)andKeating (2008), Kim (2011) empirically tested the Involvement Load Hypothesis with controlled time on task across groups, and proved that the results of his study were in pattern with the predictions of the Involvement Load Hypothesis.Due to these contradictory results about the role of time on task, we are still in need of further research to test this hypothesis with a controlled time on task from the outset of the study.

Table 1 .
Descriptive statistics of the immediate and delayed Posttests for the three receptive vocabulary tasks.

Table 2 .
Descriptive statistics of the immediate and delayed Posttests for the three productive vocabulary tasks