Discussion Forums

Google Summer of Code (GSoC)

41 replies [Last post]
gerryharp
Offline
Joined: 2010-05-15
Posts: 365

Hi Friends

As you may have seen, posted elsewhere:

http://setiquest.org/wiki/index.php/GSoC

SETI is accepting applications for summer interns within the umbrella of Google's summer of code. These internships offer $5,000 for a 12-week period of performance, with all the glory that comes from working with Google and SETI, generating open source code which everyone can see (including future employers), and other benefits.

There are many good projects offered, and one is related to SETI algorithms:

http://setiquest.org/wiki/index.php/Open_Sourcing_of_Exploratory_Techniq...

In this posting, I am providing more information for prospective interns and interested people who may take part in the GSoC project as beta testers / users of the software.

The nature of the project is to have an intern perform coding, testing and documentation of novel algorithms for SETI, the search for extraterrestrial intelligence. The analysis software uses setiData and ultimately identifies and classifies signals found in this data. These signals might be of the form of a nearly continuous sine wave or of a more complicated form where information is encoded in a wide bandwidth signal. The goal is to provide a simple set of tools that lower the barrier to entry for all setiQuest participants for data analysis. Analyses with these tools can also be compared to analysis with other tools, such as Baudline, etc.

Some simple Octave codes (Octave = the freeware version of MatLab) are attached below. One task for the intern is to transliterate from the Octave to C, using numerical libraries such as FFTW (FFTW.org) for numerically-intensive sections. We already have some example codes in C which use FFTW, and I'll see about making these available in the future.

The overall goal of the project is to create an open source software solution which allows savvy users to download and compile examples of code that performs analysis of SETI data. Users are then encouraged to modify the code to look for other types of signal. For this, the intern is asked to build a very simple (thus easily modified) environment to support compilation and to support early users of the software. At the end of the internship, a self-contained repository of basic algorithms will be accessible to a large audience and serve as a legacy for both the SETI Institute and the intern.

Check back on this thread for more info about the GSoC on SETI algorithms as it develops.

Cheers

Gerry Harp

ps. Note -- the website does not allow upload of arbitrary file types. This is an issue that must be addressed. Meanwhile, I have (arbitrarily) added ".txt" to the end of the filename for the tar/gzipped Octave code file. Please remove the ".txt" from the filename before gunzipping / untar'ing. Gerry

pps. You have to be logged in to the setiQuest website to see the attachment, immediately below. Please sign up and log in, if you haven't already.

AttachmentSize
octave_progs.tgz_.txt5.42 KB
gerryharp
Offline
Joined: 2010-05-15
Posts: 365
non-standard algorithms / SETI communication

One of our SETI scientists, Doug Vakoch, has produced a book on recent research in ETI communication

http://www.amazon.com/Communication-Extraterrestrial-Intelligence-Dougla...

which includes a paper by yours truly on signal types that may be discovered using an autocorrelation algorithm. We have implemented this algorithm at the SETI Institute and I believe it is implemented also in Baudline (?) and other signal processing packages.

There are many good papers in the book, I encourage you to take a look.

Cheers

Gerry Harp

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
Can you explain what these

Can you explain what these lines mean?

"The goal is to provide a simple set of tools that lower the barrier to entry for all setiQuest participants for data analysis. Analyses with these tools can also be compared to analysis with other tools, such as Baudline, etc."

I don't understand. Why spend the development effort creating a simple set of data analysis tools that you can compare with baudline? Why not just use baudline? You could then build from there like I explain here:

http://setiquest.org/wiki/index.php/Talk:Open_Sourcing_of_Exploratory_Te...
 

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
use baudline

Hi Sigblips

We have no prior bias against Baudline. Also we're not re-creating everything from scratch. The code is to be rather simple, calling into C libraries for hard-core numerics. We weren't initally planning on a GUI. There may be an important role for Baudline in this project, especially with its excellent user interface.

At the same time, our goal is to minimize the barrier to entry for newcomers to code their own ideas for SETI algorithms. The simplicity of basic C is attractive on this account.

One requiement by Google is that the software produced should be releasd under "an Open Source Initiate approved license" (http://www.opensource.org/licenses/). Our current plan is to release under GPL3, using only compatible libraries / tools. Under what license is Baudline released?

Thanks

Gerry

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
How I proposed using baudline

How I proposed using baudline with stdin as a display tool is completely compatible with the GPLv3. It's a tool, use it, build off of it. Your goals description sounded like you wanted to build something comparable.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
excellent!

Hi Sigblips

Excellent news. You already know the setiData format (binary complex 8-bit integers: real, imag, real, imag, ...). Is this a format compatible with Baudline's input parsing?

Thanks

Gerry

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
Yes, here is an example that

Yes, here is an example that would do a power autocorrelation on the 8-bit quadrature setiData:

cat *.dat | baudline -stdin -format s8 -channels 2 -operation magnitude -transform autocorrelation

Baudline also supports 16-bit integer and 32-bit float samples via standard input.  I recently added binary and 4-bit samples in my unreleased prototype version.  Baudline can also stream in frequency domain data so you can use it solely as a graphical renderer.

In this wiki talk page:

http://setiquest.org/wiki/index.php/Talk:Open_Sourcing_of_Exploratory_Te...

I suggested creating a seti_filterprogram that would do your custom processing and then send it into baudline for visualization. This would allow for a very flexible architecture with command line like this:

cat *.dat | [no-glossary]seti_filter[/no-glossary] | baudline -stdin -format le32f -channels 2 -quadrature

You can imagine stringing mulitple different filter programs together with stdin/stdout to create a very complex signal chain.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
using Baudline

Hi Sigblips

Thanks for the examples and information on use of Baudline. This looks very handy. Baudline could be a big help for visualizing our results and doing conventional data transforms. Your suggestion of a SETI Filter programs that operate on data and then pipe to Baudline looks promising.

After we get through the application process  we can revisit this discussion in more depth.

Thanks

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
project schedule

Hi

One applicant asked for more information about the schedule. Here is my response:

The project is scheduled to begin essentially now with the gathering of requirements. This is enabled by you and other prospective interns with your questions that sharpen the details of the project.

When a proposal is accepted by Google/SETI, we will continue conversation with the intern (email and/or skype) to focus the project on an agreeable set of specific goals for both SETI and the intern (general goals are already defined).

When the summer starts (date specified by Google at http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011...) we plan to begin code development and testing (using real astronomy data). As development proceeds, there may be clarifications and/or small changes in requirements. We will also immediately release all code into the open source realm and add the intern's code to this base during development.

By the end of summer, a complete stand-alone product will be available to (and and tested by) the open source community. At that point, the intern may wish to continue to manage / develop the code base, or not, depending on the desires of the intern and the SETI Institute.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
where do I start?

One applicant asks, where do I start?

This link appears to give you a place to start:
http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011...

When you write your application, sell yourself to us. Focus on the skills you have to offer, and document your personal list of successful undertakings. Be sure to explain your previous experience developing code both in and outside the classroom. If you pursue SW/HW projects outside of the classroom, then tell us about those projects. If you have used / compiled from source / participated in development of open source code, then please say so and if possible give us a link to your code.

Besides software development, if you are accomplished in athletics or music, are an Eagle scout, prevoiusly worked at full-time/part-time jobs, etc. then adding this type of information is beneficial. These details show you have life experience and the ability to stick with your goals.

It is good to show enthusiasm about parts of the project like SETI, numerical processing, open source, .... Justify your comments with personal anecdotes (stories) of prior activities in the area of interest. If there is a personal statement, please don't begin with "Ever since I was a child, I've been fascinated while looking at the stars..." (imagine me rolling my eyes). This is true for almost anyone. Begin with something more like this: "My career in (computer science / astronomy / whatever) began during my Nth grade of high school, when I carried out ..." I'm trying to say that accomplishments and facts are stronger than idle wishes.

Document your availability for the project. During what times will you be available to work? What times are you available for contact? If you have a concurrent (part-time) job lined up, how will you arrange to spend the appropriate time on GSoC? Having a part-time job doesn't break your proposal as long as you have a plan to mitigate distraction and a work schedule.

The propsal may sound daunting if you haven't written something like this before. It is not difficult. Take a look at the very simple Octave code that is posted. Read the code and try to understand what each source-program does. By the way, this is a random sampling of only a fraction of the existing code, and there is more in C.

Get a piece of paper and make a plan. How would you approach this project? You'll need to acquire the following skills and items: (1) a place to work. You need a computer to work on, gnu C-compiler (or Sun Java compiler), a text editor / code development environment. Download and install the latest version of FFTW (FFTW.org) (2) A familarity with "configure" and "make" on Linux, or freeware equivalent. Talk about how you would like to package the open source code and allow for easy download / compilation by users. (3) Talk about documentation. For one thing, you will be responsible for reading and addressing the comments of beta-testers on the setiQuest website. You will want to produce documentation that users can look at for instructon and explanation, and a FAQ.

Explain how your experience will help you with your proposal. You are not expected to know everything from today. On-the-job learning is fine. Block out a schedule that accounts for the startup period (learning), coding / documentation, and generation of code release. The best plans produce documentation during or immediately after the generation of a single piece of source code (that is, all through summer) so there is no documentation backlog at the end of the project. Also, you should  plan for a staged release -- meaning that as soon as you are satisfied wtih implementation of some algorithm, you plan to post it for beta-testers to try right away.

Keep in touch,

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
in addition
  • For this particular project, the successful applicant will know some (have learned) calculus. During the period of performance, you are expected to become familiar with the Fourier Transform
    http://en.wikipedia.org/wiki/Fourier_transform

    and the Convolution Theorem,
    http://en.wikipedia.org/wiki/Convolution_theorem

    both of which can be found on Wikipedia.
     

  • Don't forget to drop us a line via direct email if you are planning to write a proposal for this project. This is an important step! Contact gharp(at)seti(dot)org.
gerryharp
Offline
Joined: 2010-05-15
Posts: 365
developing new algorithms

Question: Will I be asked to develop new algorithms for the project?

Answer: No, this is not in the main stream of the intern's duties.

Having said that, interns are encouraged to be creative and suggest alternative ways of performing analyses of SETI signals. For example, there are a wide variety of satellite communication methods (amplitude keying or phase shift keying, etc.) that might be examined and ported to the SETI search. The team at SI and the intern will work together on suggesting and implementing several new search algorithms over summer.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
example of one algorithm

Question: Can you give me an example of the kind of algorithm does one has to
deal with in this project?

One of the standard algorithms in SETI is described here:
http://setiquest.org/wiki/index.php/Waterfall_plot

A newer algorithm begins with a very long array of integer values (stored in a file) representing the electric field amplitude (radio signal) as a function of time. This is very much like WAV sound file, where each byte represents the amplitude of the measured signal at a certain time. If you sent one of these signals directly to your sound card, you would probably hear white noise from the speaker that sounds like an AM radio tuned between stations (kshhhhhhhh). The algorithm I will describe starts with this data and generates "Autocorrelation waterfalls" from the data that might look a bit like this:

The radio signal is read from file, one million (actually 2^20) bytes at a time. This number series is passed to a Fast Fourier Transform routine (FFT), which returns one million floating point numbers. Then these numbers are each squared, then fed into an inverse FFT. This produces an "autocorrelation power spectrum." The results of the inverse FFT (one million floats) are scaled, converted to 8-bit integers, and stored. This process is repeated many times (typically N = 100) until there is no more data in the signal file.

The result is a rectangular array of bytes, which we can view as a grayscale  image which is 1 million pixels long and N raster lines high. The image is written out to disk in sections, perhaps 800 pixels in width and 100 pixels high. These images are a new kind of "waterfall" representation of our SETI data. They can be examined by eye or with numerical algorithms that spot certain signal types (like straight lines) in the images. There are many variations on this theme worth consideration and development.

Once implemented, I imagine that there might be a few small programs that pipeline this processing. For example, the first program (FFT) might read 1,000,000 bytes, perform the FFT, output the results to STDOUT, and then read more data from file, repeating until the data is exhausted.

The second program might square the data (SQUARE). The third program might use FFT again but with an "inverse" option, and the results are written to file for later visualization. A command line invocatoin of these programs might look like:

FFT --infile input_file.dat --length 1000000 | SQUARE --length 1000000 | FFT --stdin --length 1000000 --inverse --outfile autocorr_image.dat**

** Note: the above example is pseudo-code. For an accurate representation of invoking programs represented in the setiQuest C-code source, please see later posting.

Here the input data file has name "intput_file.dat", and the length of the FFT (and image raster lines) is specified by --length. The last invocation of FFT uses the --inverse option, reads from STDIN, and sends to the specified output file "autocorr_image.dat." Other programs can then be use to genterate images in say, png format.

This is one simple example of how the software might work. We will work out the final details once the project is started.

Feel free to ask questions if my meaning is unclear.

Gerry

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
I like the idea of using

I like the idea of using standard input and output to chain together small processing blocks. It seems flexible and it will allow the creation of some rather complicated signal processing chains.  Here is an example command line of how to take the autocorrelation chain that Gerry described above and pipe that into baudline for real-time graphical visualization:

FFT --infile input_file.dat --length 1000000 | SQUARE --length 1000000 | [no-glossary]FFT[/no-glossary] --stdin --length 1000000 --inverse | baudline -stdin -channels 1 -format s8 -fftsize 2097152 -slizesize 1000000 -transform raster -record

The autocorrelation slices will stream into baudline and it will plot them as a real-time scrolling spectrogram. Some color range scaling and horizontal axis zooming may be desired.  Baudline can be paused, individual slices examined, measurements made, time summations performed, and data can be played out of the computer's speaker for audio listening.

Quiz: If the "-transform raster" in the example above above was changed to "-transform fourier" what would we be looking at?

Here are some suggestions for the above command line processing chain. Instead of signed 8-bit samples I would use 16-bit integer or 32-bit float samples for improved resolution and a cleaner signal.  I would also use an [no-glossary]FFT[/no-glossary] size that is a power of 2 for improved performance (1048576 instead of 1000000). Here is how baudline can be used to do the same autocorrelation example natively:

cat input_file.dat | baudline -stdin -channels 1 -format s8 -fftsize 1048576 -transform autocorrelation

So the true value of this setiQuest/GSoC project is not going to be doing the autocorrelation since that is trivial. The true value will come from what other steps are introduced into the signal processing chain. Imagine inserting several new and creative signal processing blocks into the command line chains described above.

Quiz: What new and creative signal processing blocks can you think of? What would they do? Why would they be useful?

[note: If you post answers to the quizzes here then everyone will see them. You might want to wait until after April 22nd to do that.]

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
references to learn more

Question: I have found some books in my university library related to extraterrestrial intelligence communications, however many of which are from the 70s and 80s. How much have things changed since then? Would it be worthwhile reading these, or should I focus on finding more current papers and books.

The arguments and methods for discovering ET have changed remarkably little since the '80s (until about 2009). The standard references include

Classic and still relevant:
Project Cyclops (http://www.amazon.com/Project-Cyclops-Detecting-Extraterrestrial-Intelli...)

More recent take on how to perform the SETI search:
SETI 2020 (http://www.amazon.com/SETI-2020-Roadmap-Extraterrestrial-Intelligence/dp...)

These references are dense. Feel free to check out explanations at setiQuest.org, and for 3rd party sources of information, check wikipedia, and the setileague.org websites.

If you learn by listening, then check out a recent video lecture by Gerry Harp about programs at the SETI institute (late 2009) at
http://www.youtube.com/watch?v=lEFLFs_xQ8k

In 2010 especially, a number of new theoretical studies have begun to look for alternative means to send signals between stars (that is, instead of sinusoidal tones, sending information-bearing signals).

I will update this message with more info if I can lay hands on video presentations or preprints (apart from my own) that address these ideas.

khrm
Offline
Joined: 2011-03-20
Posts: 39
Well I can't see the file.

Well I can't see the file. And I am login.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
getting files

Hi

If anyone else is still having trouble with this problem, please email me at gharp(at)seti(dot)org.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
Small C-code library of routines

Hi

Rob has provided us with some source code he has developed. This is written in C and is the prototype for what we will be developing in summer. I'm posting only a small fraction of the code (see below). If you have trouble with access, email gharp(at)seti(dot)org. I have already sent this code to all potential interns who have contacted me directly. If you have not received it by email, you're not on our list so please contact us. Note that Google advises that we do not consider proposals from interns who do not contact us directly before the proposal deadline.

The code bundle contains enough to implement one interesting algorithm. To call this algorithm on some setiData datasets (available for download on this site), one could use the following command line:
 

cat *-8bit-{01,02,03}.dat | sqsmpls -l 65536 | sqwndw -l 65536 | sqfftw -l 65536 | sqpwr -l 65536 | sqsum -l 65536 | sqreal -l 65536 | sqread -c 1 > baseline.dat

If you can't read the files or something goes wrong, check the ftp site:

ftp://ftp.seti.org/gharp/GSoC/c-files.tar.gz

Good luck with your proposals.

Gerry

saksham.bhatla
Offline
Joined: 2011-03-20
Posts: 2
You can see the data at

You can see the data at http://setiquest.org/getting-data

To access the raw data (*.dat files) click on the TITLE of any of the datasets listed.
Be prepared for a long wait, each data file is 2 GB. For testing, you need only one file.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
where to find setiData

Someone asked, where can I see/download the raw data used in this project.

Here:

http://setiquest.org/getting-data

Geryr

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
raw data is accessible by clicking on title

Some interns and I have been confused by the appearance of the page at

http://setiquest.org/getting-data

You can dowload waterfalls from this area and they are clearly linked. What may not be obvious is that if you cliick on the TITLE of each dataset, you can get to the raw time series data, such as the file with name

2010-08-13-bl0716-714_1435.1072_1-8bit-01.dat

Almost all of these files are large (2 GB). This is more than you need for simple testing, and I usually use the linux tool "split" to break the data up into smaller files. If you do this, it is recommended that you choose a file size that is a power of 2 in bytes.

Hope that helps you get started with a look at the data.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
questions about timeline and organization

Question: Is there anything outside of simply applying that we're supposed to do? And on the application, what would you like to see for the schedule part? Given the project description, I'm having trouble thinking of how to layout a tentative schedule.

There have been questions about detailing the proposal plans. Here are some suggestions to help you get started. If you have already submitted (or written) a proposal, check that these points are addressed and consider adding a short section to the existing proposal with a numbered list of activities along these lines:

  1. Bonding period: Examine the existing code base. Consider if there are any tweaks to I/O (pipes in linux shell) that  should be considered before going forward. Examine preexisting libraries (e.g. fftw) and tools (baudline) that provide utilities necessary for the project. Work with mentors on ideas for new algorithms that will be implemented over summer.
  2. Date 1 to Date 2: Working with mentors, choose subset of code to start with. Write compilation scripts for code base. Become intimate with the code through testing real time series data from setiQuest project website. Write simple tools to generate two kinds of test data (e.g. sine wave generator + noise, repeating signal + noise).
  3. Dates 2-3: Provide basic documentation for base code, compilaton. Using test data, design a test sequence that verifies compilation of (several) codes in the base. This becomes the initial release of open source code for this project.
  4. Dates 3-4: Add new algorithms to code base. These can be a rearrangement of initial pipeline or be produced by new code written by intern. Not required, but if you think of an algorithm you'd like to try then mention it. Document new algorithm(s).
  5. Dates 4-5: Take feedback from open source users and where necessary improve code.
  6. Second major release.
  7. Iterate until end of summer.

This outline, with dates provided by you, is just a crude outline. Creative applicants should not feel constrained by the outline. All proposers shoudl add their own details and ideas. Feel free to rearrange the order of, rephrase, add to, the outline above. We will be looking for students who think this through and have a plan.

Thanks to all the diligent proposers who continue to ask questions. Please keep checking back for more information.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
related question

Question: I read on the forum that we need to know about the functionality of the 'make' and 'configure' on linux. Are we required to tell that how are we going to package the code developed in our proposal?

It would improve your proposal to acknowledge the requirement to provide compilation scripts and package the open source code resulting from this project. If you provide configure and make scripts, and create a tarball of these scripts and all source code, that is sufficient. In the message above, I suggest that you propose to write a couple of test routines to allow the user to verify the code base.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
Physics question

Question: The paper talked about the speed of radiowaves being lower than light speed in parts of the Inter Stellar Medium (ISM). How come light still travels with light speed then, since light is a electromagnetic wave as are radio-waves? Is that due to the duality of light?

You are correct, radio waves and light are essentially the same thing. They're both EM waves and travel at the speed of light, in vacuum.

The problem is that the ISM is full of plasma -- charged particles including electrons. The light / radio waves interact with the electrons in a complex way. Suppose you have linearly polarized light travelling along the z-axis (to the right) with electric field oscillating on the y-axis (vertical). As the oscillating field passes the electrons, they respond by oscillating up and down.

However, the electrons lag the field -- they oscillate with the same frequency, but there is a phase shift (retardation) of the electron oscillation relative to the driving force of the electric field.

So now the electrons are oscillating, which looks very much like a transmission antenna. The electrons re-radiate the light, with equal amounts of absorption and transmission in the steady state. Since the electrons are retarded in motion, the emitted radiation is also retarded. The original light and re-emitted light sum together, forming a new EM field that is slightly retarded in phase relative to the original EM field. This interaction has the overall effect of causing light propagation to slow down when passing through the plasma. This effect is called dispersion.

All EM radiation, from radio to light to x-rays is retarded (slowed down) relative to light speed in vacuum. At optical frequencies the retardation is small (and has a small first derivative w.r.t. frequency). For this reason at optical frequencies it is often OK to ignore dispersion.

At radio frequencies the dispersion is larger. For even lower frequencies (milliHz) light slows down even more until it comes to a complete stop below a critical frequency (omega_c or cyclotron frequency). To say "light has stopped" is to say "light will not propagate." That is, if you set up an antenna and attempt to transmit sub-milliHz radio waves, they will not exit the antenna but be reflected back to the transmitter.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
Should we have a GUI?

I've received many questions about creating a complex GUI to "simply" processing.

Firstly, we're open to simple approaches (GUI's) that allow us to visualize data, c.f. the discussion with SigBlips.

In my opinion, working up a complex GUI (to e.g. organize code blocks into algorithms) from scratch would probably take longer than the entire summer. However, starting with a firm GUI code base (perhaps using baudline, or another convenient starting point) it would be possible to write softrware around it that is specific to our task.

The main goal of our written project idea is to write C-code, driven from the command line, with inputs and ultimate outputs to files. The files can then be visualized (GUI'd -- that's a verb! ha ha) with whatever tools we feel comfortable with, including Baudline.

If you are interested in serious GUI work, then I suggest you invent your own idea of how the GUI would work with the c-based software (presumably developed by someone else). Be sure to specify which GUI package you will use for starters. Then we can examine your GUI proposal as a different project along side the algorithm development project.

Thanks

Gerry

rraf
Offline
Joined: 2011-03-23
Posts: 2
RE: Should we have a GUI?

I have played with baudline the last few days and it'a pretty impressive piece of software, to replicate that it's no easy task. The actual GUI becomes irrelevant in such software and will probably end up to be incredibly slow.

As a do-able project one could build a wrapper on top of baudline+algorithms to be developed in C+other things. This should *only* be regarded as an alternative for people not used to the CLI and be used to construct the pipeline then fork and execute the pipeline or baudline.

I recommend TCL/TK or Perl/TK for such a project.

--
Alin Rus

rraf
Offline
Joined: 2011-03-23
Posts: 2
RE: Should we have a GUI?

Though, I have to say this will take at most one lazy afternoon and I doubt it could be taken seriously in the GSoC program.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
visualizing setiQuest data

Hi all

One important way of visualizing seti data is with so-called waterfall plots. See

http://setiquest.org/wiki/index.php/Enhancement_of_Algorithm_to_Detect_P...

for examples. The "signal" in this data pops out as a connected string of hgih-intensity white points on a black background. The horizontal axis is increasing frequency to the right. The veritcal axis is time. New times are at top, old times at bottom. The frequency binning on the horizontal is ~1 Hz and the time binning on the vertical is ~1 sec. Notice that we are limited by the Heisenburg uncertainty principle (or Nyquist theorem) in choice of frequency time axes.

Aren't the coordinate axes of the waterfall kind of arbitrary? Yes. Besides frequency versus time, there are an infinite number of ways to represent the SETI data, e.g. autocorrelation versus time (phase modulation search), power autocorrelation versus time (ampitude modulation search), total power versus time (pulse search), chirp power vs chirp rate (dilation, this is essentially a search in dispersion space), matched filter power versus dilation (e.g. search for the binary expansion of Pi), first eigenvalue versus time (a la Maccone), und so weiter.

We enjoy two-dimensional search spaces since they map easily to an image. The human eye proceses images efficiently, so this is a good way of displaying results. It is relatively easy to throw an image up on the screen for viewing. In the current SETI system, highly efficent image search algorithms have been developed, but currently they do not address signals like those in Dr. Tarter's writeup.

I am interested in hearing your ideas for algorithmically detecting "wiggly" signals like those displayed in Dr. Tarter's idea. For example, over a short time span (as short as two raster lines), the wiggle is well-approximated by a straight line. Over longer times, it looks like a (possibly filtered) random walk. How would you write a program that identifies such signals? This could be one of the algorithms we address in the "exploratory" project, in collaboration wtih Dr. Tarter and her intern.

I'd like to discuss this topic in the open, so please reply here.

Thanks

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
the goal is reference implementations

Question: How about my plan (thus and so) to improve the efficiency of the code?

I thought it would be helpful to comment on the scope of the project. The main goal is to create reference implementations of at least a dozen or so algorithms and test data generators (possibly much  more). As we are receiving feedback from prospective interns and as our own work progresses, we are moving solidly in the direction of a C-based software structure.

Having said this, it is straightforward to compile the programs into a library and write wrappers for them in C++ or Java. If you prefer an object oriented language structure, consider this resolution that satisfies all interests.

We expect the algorithms to be pretty efficient, especially if they are cast into a form that employs the fast fourier transform (FFT) for heavy lifting. Efficiency is not a main goal, whereas code readability is highly prized.

The capabile intern may complete the project requirements in a short period of time. In this case, the intern is encouraged to consider efficiency improvements, GUI's, or other improvements that are of interest.

The open source user may choose to work with the plain vanilla C-code. We envision that some users will copy and paste algorithm "cores" from our software into their own applications, transliterating to a new language as necessary. For this reason, we look to C as one of the most portable and accessible languages.

I hope that helps,

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
"none" in the reply box

Hi All

We've received high-quality proposals for the "Open Sourcing of Exploratory ..." project.

If you have already submitted your proposal (by Thurs morning, PDT) then I have probably looked it over. Since this is preliminary, I have not written comments to anyone with a good quality proposal. When you look for comments, you may find "None" in the comments box. This simply means I had no comment (yet). That's all.

Gerry

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
A GSoC applicant asked me

A GSoC applicant asked me some project questions involving baudline in a private conversation. This is not a beneficial use of the limited setiQuest bandwidth. I'm not even a GSoC mentor and this has consumed a fair amount of my time. We really should of had a requirement that these sort of conversations were public so everyone could benefit.  Here goes:

Q: Is baudline's auto drift rate integrator feature based on the DADD algorithm?

A: Baudline's Auto Drift feature was inspired by Project Phoenix's DADD algorithm. It is very similar but I would not say that Auto Drift is based on DADD simply because of the very limited documention that was available to me many years ago when I created it.
.

Q: What is the "raw parameters" format?

A: There are many different audio file formats such as .wav, .aiff, and .mp3 files. The Raw Parameters window allows the reading of raw data files that don't have a specified format. So basically the "raw parameters" format is a non-format. Raw Parameters is also a powerful tool for exploring unknown data file formats.  For more information see http://baudline.com/manual/open_file.html#raw_parameters and make sure you also check out the very useful Bit View.
.

Q: How do I get the setiData into baudline?

A: There are currently two methods. For smaller files <= 2 GB use the Raw Parameters method mentioned above. The second method is to "cat file | baudline -stdin" which records the standard input (stdin) stream and allows the down mixing.  For more information see http://baudline.blogspot.com/2010/05/setiquest-tutorial.html
.

Q: Can I use baudline to listen to the setiData?

A: Yes. First load the setiData into baudline by using one of the methods mentioned above and then use the Play Deck to play the audio and apply some DSP filtering.  For an example see http://www.youtube.com/watch?v=wGUMcuCp9yY&hd=1
.

Q: If baudline can already play the audio then why even bother with this project?

A: This project is not about baudline. There are all sorts of variations (outside of baudline) to the different processing / analysis steps that can be done.  A more powerful concept would be to add value by doing something like the following command line "your_DSP_program | baudline -stdin". Baudline's -stdout (standard output) might also be useful. You can do stdin and stdout with Unix FIFO's too. The impressive part is what it allows you to create.  Think of it as a building block in a larger and more complicated processing chain.  Baudline is a tool, use it as leverage to do bigger and greater things. This is where the real power is. The only limit is your creativity.
.

Q: The setiData is enourmous. Can I analyze it by listening to it?

A: It is not feasible to listen to or to manually analyze all of the data collected by the ATA.  That is why the automated SonATA system is so important for the search.  Listening is a valuable tool for better understanding a signal once it has been detected.
.

Q: How can I incorporate baudline into this GSoC project?

A: Think of baudline as a tool.  What can you do to add value to using this tool?  Some -stdin and -stdout pre-processing in your own code could be a powerful way to leverage that. You can also use baudline's -stdin feature to visualize and test the output of your DSP algorithms.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
audio

Thanks for the message, Sigblips. Just a few additions:

The "Exploratory" project discussed in this thread is not necessarily about audio playback. It is about data transformation and analysis. Through discussions on this thread, I've become more interested in DaDD and related algorithms to discover signals. These are all "computational" methods to find signals. After the programs are invoked, humans are not involved in the processing.

By contrast, there is a difficult problem of taking discovery results and presenting them to humans. This is an area of human cognition. What color / intensity scales make waterfalls most understandable? Can the brain process radio signals as sound to learn more about them? (yes) Are there other ways of bridging the computer/brain interface? For example, how about converting signals to a representation in the English language. For this to happen, we'll have to programmatically distill a lot of information from the data.

Also, the setiQuest library of codes will very likely leverage baudline and other tools and also be capable of running stand-alone.

Gerry

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
I agree Gerry with everything

I agree Gerry with everything you said. The computational methods that can fit into the SonATA code framework (someday) are going to be the key to detecting SETI. There is just too much data. That said, I'm a huge visualization and listening fan.

Here is a very relevant example of a recent DSP debugging accident that was just so cool that I had to share it.

http://twitter.com/baudline/status/56505587424428032

It has aliasing, bifurcations, chaos, and textures. I still don't understand exactly how a bug in the zero_insert filter was able to create such beauty. And it sounds like an otherworldly modem when you listen to it. Here is the command line that created it:

linear_sweep | zero_insert 2 2 | baudline -stdin -record -channels 1 -fftsize 32768

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
Someone asked if they could

Someone asked if they could listen to the "otherworldly modem" sound that created the accidental zero insert bug image.  The entire file is about 10 hours in duration but here is a 3 minute excerpt of several different sections:

http://baudline.com/misc/zero_insert_bug_sprintf_stdout_excerpt.wav

If you have a subwoofer make sure you turn it up and listen to the section that begins at 1:20. Note that inserting zeros isn't supposed to sound anything like this. The bug was caused by sprintf'ing numeric text to stdout instead of stderr (standard error) which then became inserted into the sample stream.  It was a trivial bug to fix but this coding accident created something far more interesting than what was expected.

GSoC applicant quiz: What was I expecting to see? Why is the zero_insert filter useful?

Definition:

zero_insert stride zeros
     stride = number of bytes of data
     zeros = number of zeros to insert

cat in_file | zero_insert 2 1 > out_file

This filter takes a stream of samples from stdin, inserts a number of zeros between every stride amount of bytes, and then outputs it to stdout. I'm using the term "filter" loosely here as a program that performs an operation on a stdin / stdout sample stream.

For example the command line "alphabet | zero_insert 2 1" would generate the following stream:

0ab0cd0ef0gh0ij ...

Questions:

  • What do the "zero_insert 2 1" parameters do and how might that be useful?
  • What would "zero_insert 2 4" generate and what is significant about it?
  • What would "zero_insert 12345678 1" do and how might you detect it?
  • What else can you do with this simple zero insert filter?

Forum members please refrain from answering publicly until after the April 22 2011 GSoC applicant rating deadline. Thanks.

Clarification: The zero_insert bug is interesting but it is not the topic of this quiz question. The focus is what zero_insert is supposed to be doing when it is working properly (i.e. inserting zeros into the sample stream).  I'm interested in the DSP aspects of what those different zero_insert parameter settings do. The first 3 questions at the end are the important ones, the 4th is a bonus.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
Note that this comment was

Note that this comment was written before I was asked by the SETI Institute to be a GSoC mentor. Reading everyone's proposals and scoring answers to my quiz took an even larger amount of time. I had no idea how time intensive GSoC was. And it has only just begun. (:

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
I am pleased to announce that

I am pleased to announce that I have been accepted as a GSoC mentor for the SETI Institute organization. I will be just one of the many setiQuest mentors. The setiQuest project received such a large number of high quality proposals that I would like to ask the GSoC applicants to answer a couple quiz questions.

DSP exploratory tech and pulse / squiggle detector applicants:

SonATA infrastructure / Linux distribution applicants:

  • What is your favorite debugging technique? (gdb is not really an answer) Why is it your favorite?
  • What is your favorite inter [no-glossary]process[/no-glossary] communication (IPC) mechanism? Why is it your favorite?
  • How do you plan to approach the potential problems caused by using different gcc compiler and Linux kernel versions?
  • [no-glossary]SonATA[/no-glossary] uses pthread_setschedparam(SCHED_FIFO). What are your thoughts on this versus sched_setscheduler() and/or SCHED_RR?

Other setiQuest project applicants:

  • I'm not qualified to be a mentor for your project so I don't have any questions for you.  You are free to answer any of the above questions if you'[no-glossary]re[/no-glossary] feeling left out. (:

Everybody:
I don't want to read any long essays. Creative answers are encouraged. You are welcome to answer questions that are outside of your GSoC project topic above (everyone should be able to answer the debugging question). The answers to each question should be limited to a sentence or two or three.

GSoC applicants please post your answers in your private http://www.google-melange.com proposal comments section. I would like to request that people please refrain from publicly posting answers to the quiz questions here in the forum until after the April 22nd 2011 GSoC proposal review deadline.  Thanks. - sigblips.

Clarifications:
The word "favorite" is used in several of the above questions. I'm not asking for the "best" or "correct" answer.  Most likely your favorite is different than mine and that is perfectly fine. Your explanation why it's your favorite is really the key part of the question. The non-favorite questions are more important and you should focus your spare brain cycles on those.

There is no rush in posting your quiz answers. Post your answers in parts if you like. The mentor reviewing deadline is April 22 so I would get them in at least a couple days before that. If after much idle thought you want to change an already posted answer then feel free to do that.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
Q: I'm busy with exams right

Q: I'm busy with exams right now. When is the deadline for posting my quiz answers?

A: Our GSoC review deadline is April 22. So posting your answers a couple days before that date will give us enough time to review them. The quiz answers were designed to be simple and short. Give them a little thought with your idle brain cycles but please don't spend too time on them. Keep our answers short, no essays please.

avinash
Offline
Joined: 2010-01-26
Posts: 278
One minor modification - our

One minor modification - our de-dup starts on April 20, so we should have all information finalized by then. Please give us a few days to process your answers and rank proposals. I would say, to be safe, please have your answers in by Monday. After there, there are incresing chances that the answrs won't positively impact your proposal.

khrm
Offline
Joined: 2011-03-20
Posts: 39
SonATA uses

SonATA uses pthread_setschedparam(SCHED_FIFO). What are your thoughts on this versus sched_setscheduler() and/or SCHED_RR?
I answered this wrong.I interpreted this question as though we are trying to replace the threads with processes completely somehow. In this interpretation only one answer will be right. lol

My first interpretation: Place sched_scheduler() in place of pthread_setschedparam() in task.cpp.
But then I thought all this in fifteen minutes and answer three questions in 20 minutes( favs one hardly took time).  For one I took a long time.
BTW both interpretations are wrong.
Will update the answers.

For those who are still thinking what I meant. Let's take a process which has 10 threads then I thought it was being asked to remove all threads and then have process scheduling only. And the if needed to fork the process.
BTW anyone else have interpreted any question wrong?

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 732
My Quiz Answers

Last week I looked at and scored all the quiz answers that the GSoC applicants posted. The quiz was very helpful in selecting the students for the SETI Institute's 2 GSoC slots. With 58 applications for 2 GSoC slots the competition was fierce.

Except for one question, there weren't any correct answers, but there were some very good answers. The quiz was designed to test general knowledge, experience, and problem solving. Since I asked all of you to take my quiz it is only fair that I take the quiz myself. Here are my answers to the quiz and since there weren't right or wrong answers these are basically my opinions. My hope is that someone finds my answers helpful.

What is your favorite DSP filter process? Examples={FIR, IIR, WOLA, convolution, polyphase, ...} Why is it your favorite?

This question didn't ask for the the "best" filter, there is no such thing, the question asked for your favorite. The Art of DSP is choosing the best tool from the toolbox for the particular specifics and constraints of the problem you are trying to solve.

A: I am a huge fan of convolution because of it's FFT based O(n log n) speed but it is not my favorite.  The polyphase FIR filter is my favorite because I feel the optimizing out of all of the zeros is particularly elegant. Note that baudline's Play Deck uses a polyphase FIR for the speed slider and convolution for the equalization, LPF, and HPF filter options.

What is your favorite DSP filter taps instance?  Examples={LPF, HPF, bandpass, Hilbert, ...} Why is it your favorite?

The low pass filter (LPF) is the most useful because it is used by decimating, interpolating, polyphase, and WOLA filters but it is not my favorite.  I am particularly fond of the Hilbert filter because I am still amazed that it's crazy alternating shape can shift a signal's phase by 90°. The Hilbert filter highlights the importance of phase in a DSP world where frequency response gets more emphasis.

What do the "zero_insert 2 1" parameters do and how might that be useful?

It would insert one zero byte after every two bytes. I created the zero_insert filter because I wanted to convert some 16-bit audio files to be 24-bit audio files to use as a test for a project I was working on. The "2 1" parameters do this by filling the lower 8 bits with zeros. So instead of writing a single purpose program that did the 16 to 24 bit expansion I decided to write zero_insert with flexible parameters that could be used to do many different things.  Then it occurred to me that it would be a great teaching tool and hence the zero_insert quiz question

Some people interpreted "zero_insert 2 1" to be operating on 8-bit samples and the single zero insertion to be interpolation. What they failed to realize is that this would be an odd form of non-uniform sampling that creates aliasing that cannot be filtered out. This is a bad thing. Here is an example command line and what it would look like:

linear_sweep | zero_insert 2 1 | baudline -stdin -channels 1 -format s8

http://www.baudline.com/misc/linear_sweep_zero_insert_2_1.png

I was surprised that no one got this answer correct. Mentioning the 16 to 24-bit conversion or the non-uniform sampling aspect would of been good answers.

What would "zero_insert 2 4" generate and what is significant about it?

The "2 4" parameters would perform interpolation by 3 on 16-bit samples. The zero insertion creates aliasing that can be filtered away with a LPF. Here is an example command line and what it would look like without the filtering:

linear_sweep | zero_insert 2 4 | baudline -stdin -channels 1 -format le16

http://www.baudline.com/misc/linear_sweep_zero_insert_2_4.png

What would "zero_insert 12345678 1" do and how might you detect it?

This zero insert command inserts a periodic impairment every 12 million 8-bit samples. The difficultly of detecting this zero impairment depends on what assumptions are made. Using a sine wave signal source, the single zero_insert is easily visible as a periodic discontinuity. Here is an example command line and what it would look like (notice the periodic horizontal discontinuities):

linear_sweep | zero_insert 12345678 1 | baudline -stdin -channels 1 -format s8

http://www.baudline.com/misc/linear_sweep_zero_insert_12345678_1.png

If the signal source is white noise then the problem becomes much more difficult. Exploiting the fact that zeroes are being inserted, a scheme that translates all non-zero samples to unity (1.) and then stream that into a highly decimated baudline that uses autocorrelation would find it. A command line like:

wgn | zero_insert 12345678 1 | clip_non-zeros | baudline -stdin -format s8 -decimateby 8192 -fftsize 32768 -transform autocorrelation

Now if you can't exploit the fact that zeros are being inserted then the problem becomes extremely difficult. This is basically saying that the impairment is unknown, imagine that instead of zeros the sample value of 42 is being periodically inserted. Normal Fourier or autocorrelation analysis, even with an extremely large transform size, will not be able to detect this periodic impairment. Advanced techniques are required, this is out-of-scope of this quiz answer and I'll write up a future blog post about it. 

What else can you do with this simple zero_insert filter?

Conversion between different sample sizes is possible (8, 16, 24, 32 bits}.  In addition to interpolation, if the number of zeros parameter is made negative which deletes a sample, then decimation without filtering will be done. The zero_insert filter can also be used to zero pad a chunk of samples up to a power of 2 which would be useful for some follow-on analysis algorithms. If an offset parameter is added then either the upper or lower byte(s) in a sample conversion can be selected. An offset parameter with a cascade of zero_insert filters piped together could also be used to embed impairments of different periodicities or a binary sequence. Example:

wgn | zero_insert 100000 1 0 | zero_insert 100000 1 4 | zero_insert 100000 1 8 | baudline -stdin

Having a parameter that allows any arbitrary sample value to be inserted would enhance this program's impairment testing usefulness. This is what's so great about piping with the Unix command line.

I was impressed that some people actually wrote their own zero_insert program and analyzed what it did by feeding in test data and looking the output with baudline.  This shows dedication and how seriously some people took my quiz. I was surprised by this because I expected people to analyze the question in their heads. It never occurred to me that people would use external tools to answer the zero_insert questions.

What would you be looking at if you take the Fourier transform of an Autocorrelation?

You get the power spectrum of the original sequence. This is due to the forward FFT canceling the inverse FFT of the autocorrelation. It is important to understand how the FFT and it's inverse transform are reversible. Very few people got this question correct.

What is your favorite debugging technique? (gdb is not really an answer) Why is it your favorite?

My favorite debugging technique is the lowly printf, well "fprintf(stderr, ...)" to be exact.  It is a quick way to get internal state information and it works well with multi-threaded applications which most debuggers don't handle that well.  In fact I find printf to be such a useful debugging tool that I build debugging flags into all my large programs. For example baudline has several -debug flags.

Another debugging technique that I am a huge fan of is the signal loopback. Baudline has a loopback built in and you can also string together multiple baudline's with -stdin and -stdout to accomplish the same thing.  I find it incredibly valuable to use baudline to analyze baudline.

What is your favorite inter process communication (IPC) mechanism? Why is it your favorite?

I am a fan of the lowly semaphore. It is the most basic atomic operation than can be used to build all the other IPC mechanisms. I often use the semaphore to build custom IPC primitives. As you've seen from this quiz and my posts I really like the power of Unix pipes but I don't consider them an IPC mechanism.  I also really like TCP network sockets because of their elegant implementation and networked nature but they are a bit overkill for non-networked applications.

How do you plan to approach the potential problems caused by using different gcc compiler and Linux kernel versions?

Compatibility between kernel, compiler, and libraries is a very serious problem in the open source world. It seems as if they don't take backwards compatibility seriously. Every time a new Linux distribution or version is released something really important breaks. This is extremely frustrating. The way I've dealt with this is version tracking and testing. Another angle is to write code that uses the lowest common denominator, don't do any fancy system level stuff, and try to use the POSIX libraries as intended.  I've also found that supporting baudline on a multitude of different operating systems (FreeBSD, Linux, Mac OS X, and Solaris) and different CPU chips (32/64-bit, big/little endian) is a great way to find bugs and improve long term compatibility in general.

SonATA uses pthread_setschedparam(SCHED_FIFO). What are your thoughts on this versus sched_setscheduler() and/or SCHED_RR?

Many people answered this question by reciting the man page. That was the wrong way to approach this question. This is a real-time operating system question designed to determine how much you can understand about kernel scheduling details.

Important points:

pthread_setschedparam() works on threads and sched_setscheduler() works on the main parent process. If work is being done in the main parent process then setting it's scheduling may be important. The default of pthread_attr_getinheritsched() is PTHREAD_INHERIT_SCHED which means a thread's scheduling mode is inherited from the parent. So if pthread_setschedparam() is being used then PTHREAD_EXPLICIT_SCHED had better be set or the action will be ignored. It is important to also note that sched_setscheduler() is not supported in Mac OS X and possibly not in BSD either.

A process/thread defaults to SCHED_OTHER which is standard round-robin time-sharing used by most processes. Both SCHED_FIFO and SCHED_RR preempt SCHED_OTHER which makes them good for real-time tasks. The SCHED_RR is like SCHED_FIFO but with time-slicing which may be valuable if many real-time tasks at the same priority are fighting for the CPU.

How this relates to SonATA in the sig-pkg/sonataLib/src/Task.cpp file. PTHREAD_EXPLICIT_SCHED and SCHED_FIFO are being set which is good. Looking in the different SonATA include/System.h files it can be seen that multiple priorities are being used by different tasks and the priorities look to be carefully selected. It is possible that multiple threads of the same priority will be running on the same machine so an argument can be made that SCHED_RR is a better choice than SCHED_FIFO for SonATA. I don't have a feeling if SonATA's parent processes do any work other than spawning threads, if they do then sched_setscheduler() may be a useful call to use.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
FTP site changed for GSoC file postings

The data once found on Gerry's FTP site, relating to GSoC 2011, can now be found here:

ftp://ftp.seti.org/setiquest_ftp/

Please make a note of this change.

Gerry

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
general feedback to all GSoC proposers

Hi All

I want to thank everyone who submitted proposals associated with our idea, "Open Sourcing of Exploratory Techniques for the SETI Search ." As you have probably heard, for the 5 ideas the SETI Institute submitted, we've been allocated 2 slots. There is a good chance that one of the slots will support one applicant to our project above. I'm very sorry that the number isn't larger, it will be extremely difficult to choose among many outstanding proposals.

What should you do now? Mostly wait, patiently. If you haven't answered Sigblips "quiz" please do so by Monday. Don't sweat the quiz too much. There are no right answers. We just want to hear some words from you about topics.

In the mean time, I've put together comments I've said (or wish I'd said) in response to the many proposals I have reviewed. This is just a review of some things that came up. I'm posting it only for your amusement and perhaps consideration when you propose next year.

For those students in physics, astronomy, etc.
As a GSoC applicant, your hard math and science training gives you a slight disadvantage when compared to people working on a professional degree in CS. You may be smart, but there are people with equal talents and training CS who can kick your butt. Study hard and prove that you're a talented engineer.

To everyone
1. For the timeline, few people gave enough detail.

2. Although the information we've disseminated is there in the documents sent and on the GSoC thread, I'd like to have a warmer feeling that you have read the information. A fine way to show this is to repeat back the goals of the project in the proposal. Remember, the proposal is supposed to be a stand alone document.

3. Include a mention of chosen coding language. Also, what configuration / compilation tools? What documentation tools? Debugging? Showing a working knowledge with existing tools for these processes is helpful.

4. We won't put off documentation to the end of the project. That's known to be poor practice. It is much better to document while writing the code or just after completion of first working version. Then the code can be released to the community for testing while new algroithms are started.

5. The successful applicant will be asked to communicate regularly (once a day) in forums or by other means with the community of developers who comment on, use, or contribute to the project. While some of this is necessary, too much is damaging to productivty. Setting a schedule (12-25% of working time and no more) and routinely updating at the same time each day is suggested.

6. Since the applicant is working long distance, most interactions with the mentor will be via email or forums. The applicant should acknowledge this point and always have multiple tasks running with low-cost context switching (in your brain). So if you get stuck on one problem, you can work on another until help arrives.

7. Many proposals could be improved by description of prior experience in FOSS. If you don't have any, get some! If you have not previously worked in open source, then you are at a disadvantage compared to those who have. They don't (usually) teach this in school, so you'll have to join a team and work for free. The good news is that a modest contribution can give you a great deal of experience.

8. If you have one of your projects visible online (or code from previous projects) then give us the link. If not, consider upping your online presence.

9. The most difficult part of maintaining a long-distance project is to keep the motivation levels high when there is minimal or delayed feedback. For this reason, we favor applicants who are highly motivated from the start and get involved in communicaion with mentor and other applicants. You may think that helping others puts you at a disadvantage. However, if you help someone in a PUBLIC FORUM, then the mentors see this an assign positive impressions to your behavior, distinghishing yourself from the pack.

No one knew this until now:
There will be an undergraduate science major invovled with testing and application of the developed code. Are you willing to help a smart non-CS type learn the basics required to use the code?