Kieren Diment - The Perl Survey
Title: The Perl Survey - From "Pilot" to Production
Name: Kieren Diment
Grant Manager: Ricardo Signes
Duration: 6 to 7 months
Started: December, 2008
Amount Requested: $2500
In 2007 Kirrily Robert organised and administered the Perl survey (http://perlsurvey.org) to provide a snapshot of the Perl community. In particular she made significant effort to recruit as many people as possible, resulting in a sample size of around 4500 responses.
While an excellent start for a design for a survey instrument, it can be improved in a number of ways. These are:
- Removal of as many open-ended questions as possible by recoding into closed categories.
- Improvement on existing analyses. A couple of interesting visualisations aside, existing analyses consist of descriptive statistics. A more sophisticated statistical analysis would be useful in order to establish links between different variables - for example looking at programmer seniority or community seniority versus platform and programming knowledge.
- The rich demographic data would very useful, if complimented by an attitude survey. This way more links can be made between individual's demographic profiles, and what they think of issues relating to Perl and the community surrounding it. I propose to rerun the perl Survey in February 2009 with this included. This will track changes to the community, and provide useful measurements of community attitudes.
Benefits to the Perl Community
Over the past few years, with the rise of other dynamic languages, Perl has often been described as having an "image problem" - misconceptions about the, readability, maintainability and general "hackishness" of perl code are commonplace. A more complete implementation and analysis of the Perl survey should help dispel some of this image problem, and provide a greater insight into the structure of the community. Although the Perl community is internally cohesive, there seems to be a problem with external communication. Based on the aphorism 'know thyself' and the provision of high quality data analysis, this project should be seen partly as an attempt to move the discussion on within the community, and partly as a resource for people and companies that use or want to use Perl.
- Converting the original Perl survey into a low friction replicable instrument, which can be re-administered periodically to track the state of the community (codename: "The Perl Barometer").
- Scripted inferential statistical analysis for the existing Perl survey data, and for analysis of future runs of the survey.
- A written report extending the "official" report (available at http://xrl.us/bjp5z), as well as regular use.perl.org blog posts outlining progress (at http://use.perl.org/~singingfish)
- A better understanding of the community's attitudes towards the Perl language.
- A framework with which to assess people external to the community's attitude to Perl.
- Review of sources of opinion on Perl across the world wide web. Twitter, in particular is an interesting and current sample of positive and negative opinions.
- BOF on open source communities at OSDC.au to help inform the contet of the attitude survey
- Public svn or git repository for all work performed on this grant.
Stage 1. Cleaning up of the perl survey data file.
A number of questions are represented in the Data::PerlSurvey2007 datafile as arrays. So for example the 'Programming Languages Known' hash key contains an array listing all the programming languages checked by that individual. From a statistical point of view this leads to a data file that is difficult to analyse. The correct practice is to create a dummy variable for every option that could exist. Again in terms if the 'Programming Languages Known' question, this means that for every individual, the language should be stored in a hash key rather than an array, and the value of the key should be 1 if the language is known to the respondent, and 0 if not. There is also significant extra work in folding the 780 responses in "Other programming languages known" back into the main "Programming languages known" before dummy variable coding. Unfortunately these responses will need to be processed manually so this stage is labour intensive.
While there are no other obvious problems with the data file other than this, experience suggests that smaller issues will occur. These should be much less significant than the problem with the "other programming languages" question.
Stage 2. Detailed statistical analysis
Once the data file is cleaned, we can then code each variable into the appropriate statistical data type (i.e. continuous, ordinal, nominal or boolean) in preparation for a more detailed analysis. The open source statistical software R (http://r-project.org) will be used for this analysis, and the scripts to generate the analysis will be documented and stored in a public version control repository.
The first step in a serious statistical analysis of the Perl survey data is to assess the best data reduction procedure to use. Two likely candidates are cluster analysis and multidimensional scaling. This work can be time-consuming due to the need to select and evaluate the performance of a variety of distance functions and clustering algorithms. This is worthwhile with the current data set, as there is a large sample of high quality data. This means that we have considerable statistical power. An examination of some of the demographic variables (including measurements of Perl community involvement) and examination of relationships between these and any patterns found, are likely to prove interesting. So we can see how language preferences differ between "old-timers" and newer programmers, level of CPAN contribution and other community involvement.
This detailed analysis will lead to a much better understanding of the structure of the survey, and this in turn will lead to refinements of the questionnaire based on the data. Which leads to step three - refinement and extension of the existing survey.
Stage 3. Refinement of questionnaire
Once we have a clear picture of the structure of the Perl community, we can then refine the existing questions. For example, from my point of view, there are missing questions about the proportion of work time spent on programming and related tasks. Discussion with the community will reveal others, and the data reduction process from stage two will also reveal other useful lines of questioning for future surveys. The job then is to find the shortest questionnaire for maximum benefit.
Stage 4. Development of Attitude Survey and/or Q methodology.
Attitude surveys are frequently used in social science to understand individuals and communities better. With careful design, these can be useful instruments with which to better profile the community. We propose to develop the questionnaire by analysis of the existing corpus of text in mailing lists and the web web with an intelligent search (text mining) strategy, and to read and code (tag with themes) relevant parts of this corpus. Experience suggests that this process should result in a well targeted 5 minute attitude survey.
I also mentioned using Q methodology. This is another clustering technique where participants rate a reasonably large number of statements on a continuum (actually a quasi-normal distribution - see the CPAN module for Statistics::QMethod::QuasiNormalDist for more details). Following this, correlation matrix between individuals is calculated and subject to principal components analysis and orthogonal rotation. This procedure results in quantifiable clusters of points of view, and the structure of subjective opinions. The decision whether to use this method really rests on the results of text analysis.
Stage 5. The Perl Survey 2009
We would plan to run the second Perl Survey in April 2009 on hosting donated by Shadowcat Systems. Data analysis after this should be straightforward and quick based on the pilot work done on the first survey. There are a number of options for delivery of the survey. I may ask Strategic Data in Melbourne, Australia for a donation of their services, use the existing code that Kirrly Robert wrote, or use the survey management software which I am currently writing as part of another project. My own code is meant to be a generic solution to survey delivery, with ease of administration and encouragement of good psychometric practice, which I hope to be able to release under an open source licence.
I'd anticipate that the survey should be run at 18 month to two year intervals following this.
Stages 1 and 2 Immediate to October 2008
Cleanup and statistical analysis of 2007 Perl survey results. Plan to release report on analysis for July in order to provide basis for stage 3.
Stage 3, October to December 2008
Analysis of web and mailing list texts. Running a BOF at OSDC.au on open source communities.
Stage 4, January 2009
Analysis of text mining/BOF data, and development of attitude survey and/or Q sort.
Stage 5, Run the second Perl survey through February 2009, with reporting and analysis ready by April 2009.
NOTE: This schedule is based on the the date the grant proposal was submitted. Months change, as the grant started in January 2007.
I have been using Perl for research data management, visualisation, analysis and collection since 2002. In 2006 I became involved in the Catalyst project, and am currently the documentation manager. I was editor-in-chief for the 2006 and 2007 Catalyst Advent calendars and I was closely involved in the development of the Catalyst tutorial.
I have ten years experience of questionnaire design and analysis in health care and management research settings. I have tutored statistics at undergraduate and postgraduate level to students in science, psychology, medicine and marketing. A selection of my publications (with full text) are available at: http://xrl.us/bjwd7 (this does not include current publications in review, pre-2003 articles, and some conference papers).
I am currently trying to develop a substantial research project on open source communities, project viability and aspects of commercialisation. This is an extension of the work that I was involved with on Australian cross-sector R&D organisations from 2003 to 2005. The Perl survey is related to, but separate from my larger research programme.