TELeurope
Login or Register | Lost password | Help

Log in



Lost password

Log in using OpenID

Not a member yet? Join today!






A short description what your interest is. This field is mainly used for the registration process.

[ x ] close panel
 
X
A place where the discussion on the TEL dictionary can focus on the needs for communication between researchers and all users

Share |
Discussions > A first exploration of TEL research trends by analysing its discourse

A first exploration of TEL research trends by analysing its discourse

Nicolas Balacheff
275 days ago

The recent Alpine Rendez-vous 2011 (http://www.stellarnet.eu/programme/wp3/rendez-vous) has hosted a series of workshops which each has produced a white paper. As an exercise, our intention being to explore large corpora like the one of TeLearn, we have carried out an analysis of the terms and expressions of this small set of texts. Altogether there are 8 texts using 38625 terms among of which 5450 are different. Indeed, the most used term is "learning" (score 436). We have then selected those terms used at least 10 times, this has reduced their number to 621. Within this set we have selected the terms which are (in our opinion) in relation to the TEL research area, that is : 132 terms (by the way, a bit more than 0.2%).


Since words make sense when associated with other words, we have looked at the context of use of each single word (with a threshold of 50 words before or after -- let say that they are more or less in the same paragraph). We have decided to treat case by case the most used terms (e.g. learning or technology) because in such short texts they appear in almost all contexts and then it is not significant. Then we have drawn a map of the relationships that you can see at the following URL : http://maps.telearn.org/contexteARV2011/ . For the sake of visibility of the map we have limited the visualisation to the top 3 associations for each term.

Any comment to share?
Can this approach tell us something we were not aware of?

All the Alpine Rendez-vous 2011 white papers are available at the following URL : http://tinyurl.com/3vka8e7
The treatments have been performed by Emilie Manon with the support of Jérôme Zeiliger and Boris Morel.

Christian
275 days ago

Hi Nicolas, interesting work! We are doing some similar things in TELmap (also using Gephi) and it's an absolutely important question " does it tell us anything new?" At the moment I would say 'No'. But that wouldn't mean that it is not useful - it's just not novel in itself as far as the TEL domain is concerned.

I would see it as an analysis of what happened during the ARV - and understanding better what happens during workshops and expert meetings becomes increasingly more important. Actually, it would be interesting to compare this view with Peter's Twitter visualisations ... http://goo.gl/36xSj

However, it's a bit like "a solution in search of a problem" ;)

Cheers, Christian 

 

Fridolin Wild
275 days ago

Would be interesting to see how it clusters when removing the for this field generic terms such as 'learning'! Also interesting: the clusters 'data' and 'representation' did not appear in the analysis of the 2008 EDMEDIA papers...

Nicolas Balacheff
275 days ago

Just had a look at the EDMEDIA analysis, one question: is the analysis made on full texts or titles only? How are built the dictionaries? Is it possible to get a more detailed report on this analysis?

Fridolin Wild
274 days ago

It was on the titles only (as this was already a large data set: EDMEDIA has a lot of papers). I mailed you the final article and can mail you the code for R used in the analysis.

Peter Kraker
273 days ago

Hi Nicolas and all,

this is indeed very interesting! Reading through your post I also thought about comparing the ARV and the EDMEDIA analysis which Fridolin has already pointed out to you.

As for the question of the value of such an analysis, I think it is the comparison to other corpora and also the development over time that is most beneficial. For me, it would indeed be interesting to see how your clusters differ from clusters derived from Twitter. We are currently preparing for I-KNOW (and therefore there is little time for that), but afterwards I owuld really like to follow up on this.

One question for Nicolas: how did you arrive at the different colors?

Best, Peter

Nicolas Balacheff
272 days ago

In my opinion the comparison between what we have done for the ArV and what Fridolin has done for EDMEDIA may not be very relevant. Titles are very specific species in scientific communication, especially in our field. There is often a serious discrepency between titles and actual content of the communication. We could look at what are the differences with the analysis of Twitter communications, but first it will be important to have a model of the corpora so that we can see if comparison makes sense (who are the communities, the context, the objective of the communication). If you have in mind the ArV Twitter corpus, what I guess, then it would be good first to have an idea at least of the sub-community and its distribution in the workshops.

Concerning the colors, Emilie tells me that she could color nodes (Gephi function) and then the related nodes inherited this color. So, she colored the main nodes (e.g. learning) and the rest followed.

Fridolin Wild
272 days ago

Just a side-note: In my experiments I have found that full-texts of papers have a couple of similar problems: very often the actual contribution of a paper is hidden in the mid of motivation, methodology, limitations, and the like. When analysing such semantic networks of full papers, the actual research topics get buried within generics about society, methodology, and statistics -- and topics get connected by e.g. methodology that are otherwise completely unrelated! With ARV white papers, this is obviously a bit different and probably not so bad, as they do not follow the typical paratext of scientific papers.

For this reason, many analyses I have seen in different fields use only the title, keywords, and abstract (but then - as in the case of the EDMEDIA papers - a larger number of them).

Experimenting with the ECTEL full texts has shown me that splitting the full texts into bags of words (or - as you did - using a threshold for the number-of-words distance) is a good idea to at least ensure that one full text is not treated as one single unit of analysis - which would result in everything being connected to everything.

Now I'm curious: did you try other context bag sizes as well (not only 50)? Why did you choose 50? And for the manual selection of TEL-related terms (a reduction from 5450 features to 621 and then manually to 132 is quite big): why did you eliminate the others and on what grounds?

The approach chosen seems to work well and seems to produce interesting results. Would be good to externally validate these results by comparing with the results produced with another statistical method ...

Nicolas Balacheff
271 days ago

@ Fridolin, after have read "analysis of the 2008 EDMEDIA papers"

Still, I wonder which bias limiting the analysis to the titles introduces. The title is often a compromise between advertising and scientifically communicating, as well as positioning within the dominant trends as witnessed by the composition of the programme committee and the specific call. So, in my opinion the exploration of full texts is preferable. I recognise that the texts are not perfect, but at least they represent a sincere effort to present the work done. Anyway, I think that the type of analysis presented in your paper is very close to what we did for the ArV texts, just that the latter is performed on a very limited sample and hence the picture may be more contingent. We are currently doing the same with all the documents available in the TeLearn Open Archive. Let see what we get...

Concerning your paper, here are a few questions:
- did you publish the dictionaries somewhere, so that we can have look at them?
- the size of the 2004 dictionary is very close to that of the 2008, how are their content? What about the comparison between 2000 and 2004?
- The 2006 dictionary is smaller than the 2004, but comparable to the 2000. Any comment on the comparison of their content?
- The sizes of the 2000 and 2006 dictionaries are comparable. Is the content of the same kind? Which changes did you observe?

Concerning the nice graphs, it is difficult to comment really apart that I notice that if "teach" and "learn" are present in the 2000 graph, they disappear from the 2008 where we see "student" appearing as a kind of synthesis, I would say (actually "learn" is present in a peripheral way in 2008, associated to "traditional forum", and "teach" is in the "student" circle...).

But the most important question is about your conclusion: "the field clearly can be asserted to be [...] extinguishing certain terminology as interests shift and research has produced solution for some underlying problems". So, among the 21% disappearing terms between 2000 and 2008, what are those for which underlying problems have got solutions?

Nicolas Balacheff
271 days ago

@ Christian and Peter, after a look at http://goo.gl/36xSj

I am unsure about what to say after a look at the ppt of the ArV's Twitter visualisation. May be not enough information, does it exist a more complete paper about it?

Nicolas Balacheff
268 days ago

@ Fridolin, about the terms we eliminated:

The fall from 5450 to 621 terms is due to the choice of the threshold of 10 occurrences for a term to be kept in the list.  Then terms like "recent", "looking", "features", "facilitate", "fischer", "netherland", "plenary", "perspective" were still in the list and not that relevant. So, manually we dropped them and indeed a lot many others to limit the list to those terms which are significant for the TEL research area. In the end, 132 were left.

Peter Kraker
251 days ago

Hi Nicolas and all,

sorry for not replying earlier, I had forgotten to set the notifications for this group. As for the analysis of tweeting activity, we wrote a paper which will be presented at the EC-TEL next week (see http://know-center.tugraz.at/download_extern/papers/science_intelligence.pdf for the preprint). You can also try out the visualizations yourself:  http://stellar.know-center.tugraz.at/vis/ (best viewed with Firefox 4 and up).

At the moment one cannot go back to the Alpine Rendez-vous in the live version, but we are working to get the data back online. There is an image in the paper which shows the relationships between hashtags in the tweets from the Alpine Rendez-vous this year. As you can see, the arv11-Hashtag sits in the middle. The hashtags directly related to the arv11-Hashtag are the hashtags from the individual workshops, such as dataTEL or 3T. On the next level, there are some hashtags describing the content of these workshops, e.g. "agency" and "PLE". We cannot only do this for hashtags, but also for nouns. So, in my opinion, it would be interesting to compare your contextual map from the whitepapers to the weighted graph of nouns, and see whether there are any overlaps. What do you think about that?

Best,
Peter

Nicolas Balacheff
212 days ago

Eventually, I have been back to this issue after reading your paper "On the way to a science intelligence" you pointed to us. As you know, we had an occasion to share this idea, it is a bit early to make serious comparisons. Looking at your results and ours, it is not possible to tell much. But, still it is possible to foresee were might be the biggest issues and challenges. Indeed, tweets provide us with a lot of data (words) with some contextual indications (authors, ashtag, etc.) but possibly with more noise than information. Unlike the white papers of the Alpine Rencez-vous or the Grand Challenge Problems, we don't have very accurate information about the objectives and functions of the tweets -- e.g. between sharing information, socialising and warning on events. If we want to make serious progress, we need to model the sample and have a way to assess the contingency of what we get in order to understand where might be possible bias. Indeed too much noise could be misleading. In short, "better data than no data", but it is not enough to get anything. In my opinion, the challenge ahead of the tools (tweeter crowler) is methodological.

Peter Kraker
203 days ago

I completely agree with you that we need to model the sample carefully, and that we need to expose potential biases of such an analysis. I would go even further in that we need to be very cautious with coword analysis, as there is serious doubt that such an analysis can model the development of sciences (Leydesdorff 1996).

On the other hand, tweets offer a lot of timely information, and there is evidence that researchers use it to convey information about their field of expertise (Letierce et al. 2010). Furthermore, citation analysis is biased as well, and there are many potential causes for a citation. Therefore it seems to me that it is worth to study scientific tweets and see what we can learn from them, while keeping their limitations well in mind.

Peter Kraker
181 days ago

@Nicolas @Emilie: Would it be possible to get the Gephi file for the ARV map? We are currently exploring related tweets and it would help us to have your corpus. Of course, we will share our insights afterwards in this forum.

Thanks, Peter

Emilie Manon
181 days ago

Hi Peter !

It's in your mailbox !

Emilie