Wednesday, April 25, 2012

Sharing Data in Archaeology

...I know it has been a long time since my last post and many, many things happened.  Our CAA session on Uncertainty was great (many thanks to the contributors, and actually many thanks to the organisers and  all the staff involved in the conference organisation!), lots of nice paper on a very wide range of topics from data collection to dissemination, going through all stages of analysis and data representation....

One thing that came out from the roundtable discussion and from some of the contributions was very much related to data format, standards and their representation. A big issue in archaeology where you struggle to find datasets and you are often tied to social obligations, nasty replies, and even when you get access to the data you find them useless as the data structure is comprehensible only to its creator, and sometime not even to him/her....

We tend to create data for the purpose to do something with it, and very rarely we care about others, and whether they are able to re-use our data-set and examine if the results can be re-created (see an interesting example of this for ABM here). This is profoundly unscientific, as not allowing the user to recreate the experiment or the analysis is essentially cheating. The problem is that creating datasets that can be used by other people is hard, damn hard. Personally I don't think making dataset re-usable for other people is the most exciting part of our research, but I love when I find data-sets that can be easily re-used and can be understood by anyone.

Then I discovered that my own university started this great project called REWARD and that CAA introduced a recycle award. Both were aimed to encourage researchers to make their dataset available, and to reuse them either to re-assess given knowledge, or to provide new analysis and possibly new interpretation. But how to do this? The currency of the academic word is citation (alas) and so the best way to change the system is to exploit such structure. If people can publish their data as papers, other people using these data have to reference them. That's the price you pay, a citation entry. Data becoming available in these format will need to be re-usable, and by doing this they will acquire visibility. People will start to choose these data, instead of asking dataset to some mean professor replying you with a nasty email. This will also create a feedback mechanism, the more a paper is being referenced the more it will be advertised. So scholars will start to submit their data paper to acquire the same visibility. If you fast-forward this everybody will submit data papers, providing the availability a vast amount of data which were previously hidden inside dusty cupboards. This can change the discipline. If there is a journal doing this....

Ah. I forgot to mention. Turns out that there is a new journal doing this now. The Journal of Open Archaeology Data  is exactly aiming to do what I wrote in the paragraph above:

"The Journal of Open Archaeology Data (JOAD) features peer reviewed data papers describing archaeology datasets with high reuse potential. We are working with a number of specialist and institutional data repositories to ensure that the associated data are professionally archived, preserved, and openly available. Equally importantly, the data and the papers are citable, and reuse will be tracked. While still in beta phase, the journal is now accepting papers. We will also be adding new functionality over the next few weeks, and refining the look and feel."


I published a paper myself, using the settlement data I have used for this paper which I published (along with Andy Bevan and Mark Lake) on the Journal of Archaeological Science a couple of years ago. The data is slightly different (it's updated to a new chronology) but I attached an R script into the paper which should allow anybody to update the core components of the paper. The entire experience has been very thoughtful, especially as much of the stuff has been created 5 years ago... Trying to find the right version of the file, check if everything was matching is everything else...  tough job. But definitely worth it, and I'm looking forward to seeing somebody using my data-set, perhaps proving that I was wrong in my conclusions !

So open you dusty cupboards, look back at your data, fix all the small (and big) errors you'll find in there, and share your data by submitting a paper to the Journal of Open Archaeology Data.


source: http://openarchaeologydata.metajnl.com/about/

Thursday, December 1, 2011

CAA2012: Deadline Extension

Just a short post to remind you of the CAA2012 conference in southampton. The deadline has been extended to the 7th of December, so you still have one week to present your paper, and if you have something on uncertainty, go and submit to the session "Embracing Uncertainty in Archaeology"!


Wednesday, November 9, 2011

CAA 2012 [Call for Roundtable]: Models and Simulations in Archaeology: where we are and where we are headed?

Following the previous post, I just wanted to let you know that we (Mark Lake, Bernardo Rondelli, Xavier Rubio and myself) are also proposing a roundtable (details below). This year's (again, academic year) CAA will have two sessions on Simulation ("Archaeological Simulation Modelling as Computational Social Science: Next Steps Forward" and "Artificial Societies in Prehistory and Ancient Times") so it's a good opportunity to discuss "hands-on", having solid basis to start from.... In the last couple of years the number of publications using ABM, and computer simulations in general is growing, so we really need to sit back and think carefully about it...


Here's the details of our proposal:



Models and Simulations in Archaeology: where we are and where we are headed?


Computer simulation is a well-established technique which has long provided substantive insights in the physical sciences and increasingly does so in the life sciences. In addition, the field of social simulation has developed rapidly since the mid 1990s. The archaeological application of computer models dates back to the early 1970s, but despite forty years of activity, the impact of simulation has, with a few exceptions, been relatively slight, and it is largely still viewed as a fringe activity (e.g. Lake 2010). However, the first decade of the new ,millennium has witnessed a resurgence of interest, made manifest in a growing number of publications which provide the potential to change this perspective. In large part this is due to the way that agent-based modelling has captured archaeologists’ imagination, especially given the increased availability of simple software environments (e.g. NetLogo) that do not require advanced knowledge in programming and can easily be run on desktop computers.

Since this increasing “democratisation” of simulation seems likely to lead to more widespread application of the technique to substantive archaeological problems this seems a good moment to take stock and consider whether the appropriate epistemic foundations are in place to support the growth of productive archaeological simulation. It is vitally important to recognise that the availability and accessibility of user-friendly software environments does not solve per se issues arising from the complexity of the model building process and, in particular, the validation and portability of the experiment’s results. Consequently, an epistemological reflection and methodological overview of the use of modelling and simulation in archaeology is very timely. We think that a discussion between specialist and non-specialist, experienced and non-experienced users can stimulate a reflection on where we are, and more importantly, where we are headed in the application of computer simulation to archaeological research questions.

Topics:

Over tthe last decade the archaeological application of computer simulation has taken two distinct directions. On the one hand, a number of models have been developed to test specific hypotheses. This approach quantitatively or semi-quantitatively compares a existing archaeological data with artificial data produced by the simulations. The other direction is the development of abstract models derived from assumptions developed within our discipline or elsewhere (behavioural ecology, evolutionary anthropology, biology etc.), which have been used in to generate new
theories, or to explore the implications of previously formulated ones in a dynamic and computational environment.

The two directions are not mutually exclusive, but each leads to a series of important question: How can we validate abstract models? How can we translate archaeological and anthropological theories in terms computational and/or mathematical algorithms? How do the limits of computational representation affect our model building exercise? How does the scientific audience evaluate extremely complex and realistic models? How do we communicate our models, especially to non- specialists?
By discussing these concrete questions we hope that the roundtable will approach deeper questions, such as: will computer simulation have an impact on our discipline or it will remain a fringe activity? If the former, then will increased use of simulation change mainstream archaeology or it will lead to the emergence of a “new” sub-discipline?

We are particularly interested in the following topics:

  • Abstract vs. Realistic Models: An adversative or complemental epistemology? 
  • Mathematical Models vs. Agent Based Models: two faces of the same coin or alternative pathways?
  • Validation and Verification of Computer Models. 
  • Communicating Models: looking for a common protocol or a language? 
  • Potential pitfalls and common problems of modelling in archaeology.
  • Why computer models are still an outsider?


Tuesday, November 8, 2011

CAA 2012 [Call for Papers]: Embracing Uncertainty in Archaeology

So, it's time for submitting your paper at CAA2012!!! This year (academic year I mean) the University of Southampton will host the event which will back in Europe after being hosted in China. I'm quite excited about this, since the last time I went it was Budapest 2008... And this time we (Andrew BevanEugenio BortoliniEleonora Gandolfi, Mark Lake and myself) have also proposed the following session:

Embracing Uncertainty in Archaeology
(Session Code: Theory1)

This session aims to develop greater awareness of approaches to uncertainty in archaeology by bringing together both established experts and young researchers from a range of different fields. Its ultimate goal is to generate broader discussion about how we confront uncertainty in the recovery of archaeological datasets, how we treat it analytically and computationally, and how we incorporate it into our inference building, interpretations and narratives.

Uncertainty is at the core of long-standing debate in a wide variety of modern scientific domains, as testified by the recent inclusion of the topic among the twelve most relevant scientific endeavours listed by the Royal Society . Critical debates concerning the assessment, representation and public understanding of uncertainty are also of widespread interest in the social and political sciences.

The increasing availability of tools capable to solve large computational problems has provided a suitable environment for tackling this issue. Examples of such approaches can be widely found in different realms of our discipline. These include the use of advanced techniques in chronometry (Buck et al 1996), predictive modelling (Ducke et al 2009), spatial analysis (Crema et al 2010) remote sensing (Menze and Ur 2011), phylogenetic analysis (Nicholls and Gray 2006), typology and classification (Hermon and Nicolucci 2002), stratigraphy (De Runz et al 2007) and data visualisation (Zuk et al. 2005).
Despite some explicit, epistemologically-oriented contributions (Wylie, 2008; Lake, 2011) a debate encompassing both practical and theoretical aspects has never emerged nor it has determined direction and priorities of mainstream archaeology.


Themes:


Isolated discussions within single subfields (e.g. radiocarbon dating or spatial modelling) can certainly provide grounds for theoretical and methodological advancement, but we need to develop more integrated approaches to uncertainty across all the aspects of the discipline.
We invite original contributions to the following themes: a) the role of uncertainty in archaeological narratives; b) methodological debates about different probabilistic approaches; c) measurement and integration of uncertainty into archaeological analysis; d) appropriate sampling strategies and missing data problems; e) cultural resource management, risk assessment and decision making; f) public understanding and data visualisation.

References


Buck, C.E., Cavanagh, W.G. and Litton, C.D., 1996. Bayesian approach to interpreting archaeological data. Chichester:Wiley.

Crema, E. R., Bevan, A. and Lake, M., 2010, A probabilistic framework for assessing spatio-temporal point patterns in the archaeological record, Journal of Archaeological Science, 37, 1118-1130.

De Runz, D., Desjardin, E., Piantoni, F., and Herbin, M. 2007. Using fuzzy logic to manage uncertain multi-modal data in an archaeological GIS. Proceedings of ISSDQ 2007, Enschede, Netherland.

Drennan, R.D. and Peterson, C.E. 2004. Comparing archaeological settlement systems with rank-size graphs: a measure of shape and statistical confidence, Journal of Archaeological Science 31:533-549.

Hermon, S., and Nicolucci, F. 2002. Estimating subjectivity of typologists and typological classification with fuzzy logic. Archeologia e Calcolatori, 12, 217-232.

Lake, M W, 2010. The Uncertain Future of Simulating the Past. In A Costopoulos and M W Lake (eds) Simulating Change: Archaeology Into the Twenty-First Century, Salt Lake City: University of Utah Press. 12-20.

Nicholls, G,K, and Gray, R.D. 2006. Quantifying Uncertainty in a Stochastic Model of Vocabulary Evolution. In: Forster P., and Renfrew, C. (eds.) Phylogenetic methods and the prehistory of languages. Cambridge : McDonald Institute for Archaeological Research. 161-172.

Wylie, A. 2008. Agnotology in/of Archaeology,” in R. N. Proctor and L. Schiebinger (eds.) Agnotology: The Making and Unmaking of Ignorance, edited by; Stanford University Press:183-205.

Zuk, T. Carpendale, S. and Glanzman, W.D., 2005. Visualizing Temporal Uncertainty in 3D Virtual Reconstructions, In Proceedings of the 6th International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST 2005), 99-106.


How criminology and computational statistics can help archaeology...


I've been working for a while (Crema et al 2010, Crema In press) on the issue of temporal uncertainty in archaeological analysis. The reason for such interest emerged while I was trying to do the simplest spatial analysis of the distribution of pithouse dwellings. While making my database I found something similar to the following:

ID,  Name, Date
001,Pithouse A, Kasori E1 phase
002,Pithouse B, Middle Jomon
003,Pithouse C, transition between Kasori E2 and E3 phase.
....

Now... in case you're not an expert of pottery phases of Jomon period (you should, they look good), the Middle Jomon period lasted ca 1,000 years (ca. 5470-4420 cal BP) and the Kasori E1 phase ca 60 years (4900-4840 cal BP), with the latter being a sub-phase of the former.

You can easily imagine my problem here. In order to do any diachronic analysis I have two options. I could either lump a large portion of my data choosing the coarsest resolution I have (Middle Jomon in this case, and hence virtually dismissing the knowledge I have about pithouse A) and then carry on my analysis or I can use only those satisfying the temporal resolution I am interested in, and ignore the rest of the data (thus removing pithouse B in this case and using only A and B).
Both solution is highly unsatisfactory, and frankly the second one involves even omitting part of the available knowledge....

While looking around, I found some papers by Jerry Ratcliffe, a american criminologists who happened to have very similar problems...Imagine you've left your car at 9 AM in the morning, you've worked all day and when you've finished  at 5 PM you found your car to be stolen. You know that the crime happened sometimes between 9AM and 5PM. Now if you are criminologist and you are trying analyse the data of other thefts, you will quickly notice that the great majority of the temporal data involves intervals within which the crime have occurred rather than the precise time of the event. Ratcliffe calls these intervals time-spans and noticed that you might have shorter ones when you have more information (a friend came late to job and noticed the car was already missing at 11 AM) and longer ones where you have less information. The problem of spatio-temporal analysis of crime data is that you have consistently different time-spans in your dataset.... Exactly the same problem WE have...

The solution proposed by Ratcliffe is called aoristic analysis (Ratcliffe 1998) and essentially involves what is called "principle of insufficient reason" and basically assumes that, with other things being equal, if we divide our time-span in equally long time-blocks (e.g. decades or hours) the chance that the event have occurred in any of these will be homogeneously distributed within the time-span. In other words, if you don't have any information, the chance that your car might have been stolen between 9 and 10 AM is equal to the chance that the crime occurred between 3 and 4 PM. Based on this very simple premise, we can provide probabilistic measures to our "events".

Now the problem I was facing is that, probability weights cannot be used for standard analysis. You can enhance perhaps visualisation of the data, and maybe provide broad cumulative sum of the probability as time-series. But If you want to do something more sophisticated you need a non-probabilistic data, since the majority of available tools are not designed to deal with temporal uncertainty.

Then I cam across to Monte-Carlo simulation, The idea itself is very simple. Based on a probability distribution (in this case given by the aoristic analysis) one could simulate all the possible combinations of events, and hence all the possible spatio-temporal patterns that might have occurred. The number will be immensely huge, but if a sufficient degree of knowledge is available, some pattern will occur more frequently than others. Hence by simulating n scenarios, one could compute the proportions of these where a given pattern is observed. This will then provide a likelihood estimate of such pattern.

Adopting Monte-carlo simulation opens an entire array of possibilities. One could in fact use different sources of knowledge, from radiocarbon dates to stratigraphic relations and explore the range of possible spatio-temporal patterns. One should then simply assess each of the possible scenarios and compare the distribution of the outcomes to infer about the past in probabilistic terms...

References


Crema, E. R., Bevan, A. and Lake, M., 2010, A probabilistic framework for assessing spatio-temporal point patterns in the archaeological record, Journal of Archaeological Science,  37, 1118-1130.

Crema, E. R., In press. Aoristic Approaches and Voxel Models for Spatial Analysis. In: Jerem, E., Redő, F. and Szeverényi, V. (ed.) On the Road to Reconstructing the Past. Proceedings of the 36th Annual Conference on Computer Applications and Quantitative Methods in Archaeology.  Budapest: Archeolingua.

Crema, E. R., In press, Modelling Temporal Uncertainty in Archaeological Analysis, Journal of Archaeological Method and Theory (online first). 

Johnson, I., 2004. Aoristic Analysis: seeds of a new approach to mapping archaeological distributions through time. In: Ausserer, K. F., ̈rner, W. B., Goriany, M. and ckl, L. K.-V. (ed.) [Enter the Past] the E-way into the Four Dimensions of Cultural Heritage: CAA2003. BAR International Series 1227.  Oxford: Archaeopress, 448–452.

Ratcliffe, J. H. and McCullagh, M. J., 1998, Aoristic crime analysis, Inernational Journal of Geographical Information Science,  12, 751-764.

Ratcliffe, J. H., 2000, Aoristic analysis: the spatial interpretation of unspecifed temporal events, Inernational Journal of Geographical Information Science,  14, 669-679.

Tuesday, July 19, 2011

Subsistence Strategies, Uncertainty and Cycles of Cultural Diversity

Our (Mark Lake's and mine) paper on "The Cultural Evolution of Adaptive-Trait Diversity when Resources are Uncertain and Finite" have been accepted for a special issue of Advances in Complex Systems!
We basically extended the work we've done for the conference on Cultural Evolution in Spatially Structured Populations (see blog entry), focusing more on the dynamics of cultural evolution for traits which are: 1) adaptive (instead of being neutral) and hence determining changes in the reproductive rate; 2) characterised by negative frequency dependence (we've actually explored initially both positive and negative frequency dependence and a combination of the two, but that's another story/paper); and 3) produces stochastic yields. 
In practice, we developed an ABM (written in R) where agents forage based on a specific trait they possess. The yield of the foraging activity is associated to some degree of uncertainty and is restricted by two types of frequency dependence. In the S-mode model we've explored scenarios where different traits represents different technology or behaviour which are adopted for harvesting a shared resource, while in the I-mode model we've explored scenarios where each trait harvests a separate and independent resource (e.g. different preys). We then allowed agents to reproduce, die, innovate and learn (with frequency z)  using a model (payoff) -biased transmission following the model proposed by Shennan (2001), and measured the diversity of traits using  Simpson's diversity index. The model showed many interesting properties, here are some which I thought were particularly notable:



  • High values of z (frequency of social learning) have negative impacts in both I-mode and S-mode models if some degree of stochasticity in the payoff. 
  • When traits share the same resource, the highest rate of cultural evolution occurs with values of z which determines a limit cycle between moments of low and high diversity. 
  • When traits are harvesting independent resources, the highest rate of cultural evolution occurs with values of z which determines the adoption of largest number of different traits (highest richness) with patterns similar to the Ideal Free Distribution.  When the frequency of social learning is too high, novel traits are lost by the innovators before this is transmitted to the rest of the population.

The negative impact of high reliance on social learning is perhaps the most interesting outcome and relates to what is known as the survivorship bias. Suppose a population of n individuals adopting the same trait A, which determines a normally distributed payoff (with mean μA and standard deviation σA).  At a given point in time an individual innovates and adopt a novel trait B, with a payoff which is on average higher than A (thus μA > μB). With a frequency z some individuals will copy the most successful (thus the individual with the highest payoff) individual among k randomly sampled individuals (with k being our sample window of observation for each agent). If the  σA=σB=0 the payoff will be always the same, and thus trait B will always produce a higher yield than A. This means that the novel trait will be adopted by approximately zn agents, and that the innovator will stuck to B (which will be always higher than A).   However if σA > 0 < σB (or in other words if the payoff have some degree of uncertainty) something different will happen. Since the number of individuals adopting trait A is by definition higher than the number of individuals adopting B, and since the payoff is stochastic, some lucky individuals with trait A are likely to have a payoff higher than μB. If these individuals are among the k sampled individuals of the innovator, the innovator will switch back, erroneously underestimating his own new trait. If z is low, the innovator is unlikely going to do this (it won't rely on social learning) while a proportion zn will have some chance to adopt trait B. If the number of individuals adopting this trait exceeds a certain number this will unlikely got lost, and hence can spread and invade trait A The survivorship bias tells a similar story. Suppose you are a businessman and decide to adopt a specific market strategy because you've read on forbes that some guy was successful on this. You have a model (the guy on forbes) which is successful and you explain this based on the strategy he used. However forbes won't mention you that maybe there are 10,000 other businessman who adopted the same market strategy but actually failed. The same applies for music industry. You see people making a lot of money, and so you decide to learn and give a try. And you ignore that hundreds of thousands of people did the same, and failed. The pattern is probably stronger here, because the success rate is not normally distributed, but much more skewed (in fact its likely to be a power law, see here). The small tail of very successful individuals are much more visible than the other (majority) of people. In our model the  shape of the distribution is different, but nonetheless the few successful people are regarded as a representative of a trait in a model biassed transmission. 
So what's the moral in all of this? If there is any, although it's an obvious, a bit cheesy, over-mentioned advice, is to "believe in yourself and not rely to much on copying successful individuals". They're in most case just lucky, and you might have something bigger in your hands. 


Monday, June 27, 2011

Deterministic-Continuous models vs. Stochastic-Discrete models

Recently, I've been in several workshops/discussions where several people were using continuos numerical models rather than ABM. A very common question from these guys are along the line of "why we should use ABM when we have mathematical models?". The argument is a known controversy (see McElreath and Boyd 2007 for a typical critique of ABM from a mathematical standpoint). Generally ABM is considered weak as it is computationally far more demanding, cannot be solved analytically and requires the exploration of a phase space (which sometime is not done, or not done properly) which is often too large. ABM users defend themselves by pointing out the importance of stochastic components, and the difficulty to represent complex behaviour which cannot be represented (easily at least) with a set of differential equations. My position stays somewhere in the middle, as I know how sometimes the temptation is to go for ABM due to my mathematical deficiency as a modeller. Ideally I try to invest on both types of modelling, and not to be epistemologically confined in one or another. Having said that one aspect which gives an advantage to ABM is the capacity to model dynamics with small population sizes.
While trying to verify the ABM of my doctoral project, I found a stunning example of this. In order to test if my ABM was working properly, I've created a mathematical equation of the same model, to see if their behaviour were consistent. Of course there were small differences, the mathematical formula was not discrete and non-stochastic. Thus for instance, if the population reproductive rate is 0.3, then the new population size will be:

old population size + old population size * 0.3

This means that if the original population size was 7, then the new population size would be 9.1. In the stochastic discrete model the different is that the population size is always an integer, and that the reproduction is basically modelled through a random pick from a binomial distribution as follows:

old population + Bin(old population, 0.3)

where the Bin() stay for a random pick of a Binomial distribution, with number of trials = old population, and success rate 0.3. In this case the most typical outcome will be the same (9.1), but there will be chances for smaller or larger number (you can visualise this in R, with hist(7+rbinom(100,size=7,prob=0.3);Notice that the mean of such frequency distribution will be roughly 9.1).


The first figure shows an example which compares a single run of the ABM (solid line) with the expectation derived from the numerical equation. Although the exact values are slightly different, the qualitative long-term outcome of the two cases are identical.

The second figure shows the same model with different parameter settings. The deterministic/numerical simulation expects an equilibrium around 4 individuals (actually 3.824 persons) while the discrete/stochastic model shows a second increase in population size at around timestep 170. The model actually continues such dynamic (not shown here) with other instances of growth and decline.


The third figure shows the very same model, expect that this time one of the parameter were set in a way that the population size was 25 times larger. Although the timing between the deterministic and stochastic model is slightly different, the overall shape, along with the long term equilibrium appears to be essentially the same.

So what is the point of all these figures? If you are dealing with low population sizes, the fact that your model is deterministic and continuous matters, and, depending on the model, matters a lot.  This is not just a problem of dealing with continuous numbers (e.g. 1.5 persons) but also because the consequences of small fluctuations are far more bigger in smaller population (the greatest real-world example is probably the effects of genetic drift). If you are dealing with large populations, chance events will be averaged out before it effects determines any qualitative change in the system. If you are dealing with smaller population, chance events might radically condition the output. This happens because some basin of attractions are so small that can be captured only when we deal with non-discrete population sizes. For instance the first figure had an equilibrium of 3.824. In a discrete model we have either 3 or 4 individuals, and both are likely to be outside of such basin. If you have a model which is scaled by x25 (as in the third figure) we will have an equilibrium of ca 95.59 individuals, and the discrete model can get closer to this (95 or 96), allowing to reach (and stay) within the basin of attraction. If the process we are trying to model is occurring at smaller population sizes, continuous model becomes unrealistic as decimal values are no longer an acceptable starting assumption. Of course you could tweak the mathematics and reason in discrete terms, but this becomes complicated and requires additional assumptions on how to convert continuous outputs into discrete values. In that case ABM is a robust alternative which can definitely give some contribution despite the known drawbacks.