Discussion Forums

Roadmap page

31 replies [Last post]
Anders Feder
Offline
Joined: 2010-04-22
Posts: 618

 It would be nice with a roadmap page for the entire project, that shows what's next and what milestones to reach when.

avinash
Offline
Joined: 2010-01-26
Posts: 278
We are in the process of

We are in the process of updating the roadmap over the next few weeks. We will post it as soon as it is finalized.
Avinash
 

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Now that SETICon is over, are

Now that SETICon is over, are there any updates on this item?

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
To move forward on this item,

To move forward on this item, Avinash and I have discussed this by e-mail. My main point is that real-time access to the telescope stream should be the driving priority because this is really the ultimate destiny of setiQuest. Analyzing on limited archived data files is nice, but it has many shortcomings:

  1. We can't distinguish alien signals from RFI. Once we find a candidate signal in the data files, we are left in the dark with regards to its origin, and there is not really anything we can do about it because it was recorded months ago and in all likelihood has long since disappeared from the air waves.
  2. We can't adjust the quality or parameters of the data. If we have a candidate signal, we can't 'zoom in' on it; the data can be cut off in time as was the case with sigblips' Kepler-4b signal; etc.
  3. We have no way to influence the observation targets or schedules. Even if we can make a program that routinely produces plausible arguments for observing particular targets in the FOV, there is no way to make these arguments be heard by whatever controls the telescope. We can outline an elaborate experiment here on the forums, sure, but it will be based on the very limited data we now have available for less than 20 targets - not on the much richer stream of data going through the ATA 24/7.

Due to these shortcomings, I personally don't see any way by which the setiData program in itself can become the basis for a citizen science operation. At best it will be outreach or education.

At this point, I want to ask then: does everyone agree that the end target is to give the public real-time access to the ATA data stream? (Not necessarily direct access - it can be in the form of applications running on behalf of outside users locally at the ATA.) I pose this question both to SI and other members of the forum. If this is the goal I am utterly convinced that we can find a solution that everyone is happy with.

This is how I imagine it could work:

 

Laying out the roadmap would be a matter of following the chart starting from "open source community" and ending with "citizen science", in turn making client modules and a well-defined API into OpenSonATA the next steps from where we are now.

avinash
Offline
Joined: 2010-01-26
Posts: 278
Anders - You beat me to it. I

Anders - You beat me to it. I too have been working on a diagram that describes the concept. Your drawing and mine have a lot of similarities, but thre are differences too that we will have to work on resolving. It is a power-point with a fair amount of animation (although not as simple and elegant as yours). A static image is included below. For the power-point, click here.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
You should be able to upload

You should be able to upload the file from the forum rich editor by clicking 'Insert link' and then the 'Upload' tab. Alternatively, there is services like this one.

avinash
Offline
Joined: 2010-01-26
Posts: 278
Thanks. Modified my original

Thanks. Modified my original posting to include the power-point.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
What are the differences you

What are the differences you see between my diagram and yours? The ones I see are the conspicious lack of arrows from the community and back to SI / ATA :)

avinash
Offline
Joined: 2010-01-26
Posts: 278
There is one major

There is one major difference. In the current system we were not envisioning real-time access to the entire data-stream (or even part of it). We were thinking only of limited access to stored data for signal detection and real-time access to citizen science applications.

If you are thinking of real-time access only for the citizen science application, then we are aligned.

If you were thinking of real-time feed for signal detection, then we need to find both bw and storage solutions.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
How were you envisioning

How were you envisioning the real-time access to citizen science applications would work? How or what would reduce the data-stream in real-time for citizen science?

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
I agree that access to the

I agree that access to the real-time telescope data should be the top priority and ultimate destiny of setiQuest. The main problem is that there is too much data combined with a lack of bandwidth. It is like "trying to squeeze cabbage trough a keyhole."  Running an Internet2 feed to the ATA is the ideal solution but that is expensive and might not be possible due to Hat Creek's remote location.  So I propose this two tier solution:

1) Access to the full data stream. Allow on-site servers that can tap into the UDP multicast packets on the ATA's local 10 GbE network. The UDP multicast packets are the basically the API.  So by utilizing the existing data distribution method that the beamsplitters and SonATA use no real integration work is required.  All that is required is a network switch, electricity, and some space for the servers.  I envision multiple teams built around hardware sponsorships from companies like IBM, Intel, AMD, HP, Dell, Apple, ...

2) Access to a partial data stream. Select a narrow chunk(s) of spectrum, possibly 410 kHz wide UDP multicast packets from the Channelizers, and send it off-site over the existing Internet pipe to Amazon EC2 for archiving to disk.  I envision a wrap-around disk buffer scheme that will hold the last ## hours depending on storage space. The computing requirements to do this transmitting, collecting, and archiving is very low.  The heavy duty processing will be done on the Amazon Cloud.

Does anyone know the speed of the ATA's Internet connection?

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Good suggestions. I've merged

Good suggestions. I've merged them into this page. I assume your option 2 can be implemented as a 'client module'.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
I don't understand your

I don't understand your question of "can your option 2 be implemented as a 'client module'?" What do you mean by client module? Client of what? SonATA or Cloudant or something independent? Are you talking about software running at the ATA or in the cloud? My option 2 would require software parts in both places.

Your wiki roadmap page is a good start.  I have some suggestions:

* The wiki roadmap doesn't mention "Open data access" which I believe is key. Having open access to the raw real-time data stream will allow for a whole new dimension of searches that are not possible with the current SonATA architecture. I'm thinking big radical innovative ideas that just won't fit in the current scheme.

* The wiki roadmap doesn't mention "UDP multicast packets" which I also believe is key. They are the flexible data bus of the ATA. They are also the API that individual components use to pass processed data. What is key here is that these UDP multicast packets can be tapped without any action required by the sender.  This read-only activity means zero modifications of existing software systems which in turn greatly reduces the risk of interfering.

* The wiki roadmap doesn't differentiate between SonATA/Cloudant client modules and piggyback solutions. Both are key but they are very different. Certain experimental ideas will fit well within the SonATA/Cloudant infrastuctures but the big innovative radical ideas would break them. Isolated piggyback solutions that utilize the UDP multicast packets are a place for these far ranging ideas. The seti@home project that piggybacks on the Arecibo data is a good analogy of what I am suggesting here.

* The wiki roadmap in section #1 lumps together raw data access and suggesting new observation targets/schedules. Both are important but they should be separate. The open data access can be done by reading UDP multicast packets and is not invasive or harmful to the current operation of the ATA. Suggesting new observation targets/schedules could be built into the confirmation flow chart but that would require much more integration and trust, so much so that I think it should be a distinct step.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
By client module I mean

By client module I mean something that is running locally at the ATA - a client of SonATA.

I don't see the see the big differences you are referring to. In terms of outside requirements, SonATA and your option 2 seem to do essentially the same things - read the data stream and output a reduced subset. It is likely that they can share much of the same infrastructure. For instance, if someone urgently needs the telescope for something else and wants to shut down everything related to SETI, they shouldn't have to call a million different programs and ask them to kindly shut down. This can be managed through a single channel in SonATA.

By integration I don't mean that Jon Richards have to sit and write a million lines of code to get all the damn programs working. As I state in the document, it is the responsibility of the community to prepare the modules. But the SI will want to exert some oversight of what is being pushed onto their telescope in any case. If someone uploads a program that fills the network with lolcats at 20 Gbps, for instance, that is not good. Measures to prevent incidents like this are implemented at integration.

Also note that it is a very high level map with just five steps. Each step should be broken into smaller steps, of course, as work progresses and plans become more and more defined. It shouldn't be necessary to lock ourselves into particular protocols and so on at this stage. This is something we do at specification when SI has figured out what it wants to share and how etc.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
The differences are huge. 

The differences are huge.  I'll try to explain why I think this is so important.

My piping data to the cloud option #2 could run as a client of SonATA but I was envisioning it being separate for a couple reasons:

* The SonATA system is already complicated enough.  Do they really want to add another subsystem to SonATA that doesn't improve their primary mission?

* The UDP multicast data packets make this division of project easy. This is important because it reduces risk and it reduces the SI staff's integration time.

* The SonATA source code required for option #2 won't be released until late 2011 and I would like to be accessing this real-time data today.

* "Open data access" means that what you can do with the data is only limited by your own creativity and talent. Saying that all data must be analyzed in the SonATA context or by Cloudant's setiCloud is extremely limiting and shortsighted. If it wasn't for open data access I could not of done any of this http://baudline.blogspot.com/search/label/SETI

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
The UDP multicast data

The UDP multicast data packets make this division of project easy. This is important because it reduces risk and it reduces the SI staff's integration time.

Reduces risk and integration time relative to what? The page makes no assumptions about how the access is implemented. It could well be in the way you describe. But if there are other solutions that SI are more comfortable with that achieves the same access, there is no reason to preclude them.

The page makes no reference to CloudAnt or setiCloud or assumes that data must be analyzed in the context of SonATA. I am not sure why you bring this into it.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
The page makes no reference

The page makes no reference to CloudAnt or setiCloud or assumes that data must be analyzed in the context of SonATA. I am not sure why you bring this into it.

The reason I brought that up is because I personally have experienced this sort of "open data access" lockout here at setiQuest. Here is my story:

Initially the SI staff would post new files to the data-api page; I would download them, analyze them, and post a blog report about it.  Then Cloudant's setiCloud was created with an implementation that is of zero value to me but I'll save that criticism for a different thread.  What happened next was that new data stopped being posted in the old location and instead was uploaded only to setiCloud. So what did I do? I stopped analyzing the data because it had become inaccessible to me.  I was locked out.

New uploads have since returned to the data-api page but I haven't posted any more analysis reports. I'm not sure if I will. I am feeling a bit of uncertainty about the direction this project is headed. That is why "open data access" is important to me.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
If you thought the ATA had a

If you thought the ATA had a data bandwidth problem here is a link to an article about the LHC's computing grid:

http://arstechnica.com/science/news/2010/08/lhc-computing-grid-pushes-pe...

The article's two main topics are very relevant to setiQuest and "open data access":

* Moving data which discusses their network and storage arrays.

* Supporting users and companies which discusses how the data is shared and the innovative industry collaboration called CERN openlab:

http://proj-openlab-datagrid-public.web.cern.ch/proj-openlab-datagrid-pu...

The fascinating openlab concept may be something the setiQuest project should think about emulating.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
Reduces risk and integration

Reduces risk and integration time relative to what?

UDP multicast is an IPv6 addition which is slightly different than standard UDP.  It is an ideal data bus to use in complicated data acquisition systems.

What I meant by reducing risk is that UDP multicast packets are read-only and the source machine is not involved in the multicast distribution. The magic is all done by the switch.  The existing system (SonATA) isn't involved or disturbed by my data access. This reduces risk of harm to the existing system.

This just doesn't just reduce the SI staff's integration time, it basically eliminates it since they don't have to do anything special or change any code. This lack of any required code modifications again reduces risk of harm to the SonATA system.

Relative to what? Relative to building a data export system external to SonATA as opposed to being part of SonATA.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
What I meant by reducing risk

What I meant by reducing risk is that UDP multicast packets are read-only and the source machine is not involved in the multicast distribution. The magic is all done by the switch.

Which switch? Do you know of a switch at the ATA? Is it setup to not allow your software to cause disruptions on the network?

Relative to what? Relative to building a data export system external to SonATA as opposed to being part of SonATA.

Yeah, but the page doesn't say that. I hope the subpage I linked above addresses the concerns you have. 

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
Which switch? A 10 GbE

Which switch?

A 10 GbE switch.  I don't know any details of their switches but I suppose it could have a firewall.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Exactly. So we have to get

Exactly. So we have to get the details, and configure it if it has a firewall and do something else if has not. There is a word for that - it's called integration.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
What I meant by "integration"

What I meant by "integration" had to do with the SonATA code base.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
sigblips, I've tried to

sigblips, I've tried to address some of your concerns on this page.

avinash
Offline
Joined: 2010-01-26
Posts: 278
Hat Creek Radio Observatory

Hat Creek Radio Observatory is connected to the Internet by a 40 mbps connection that is shared between the University of California, Berkeley, and the SETI Institute. So, for our planning purpose, we should consider it to be 20 mbps.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Bandwith

In terms of connectivity, if the current pipe can't be upgraded, this could be something - in three years. Edit: Except that satellite probably isn't great at a radio telescope! Maybe for downlink - don't know.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Progress?

I feel that things are going pretty slowly at the moment, so I decided to write this comment on the progress of the project.

sigblips
sigblips's picture
Offline
Joined: 2010-04-20
Posts: 733
That is a very nice synopsis

That is a very nice synopsis of the state of the setiQuest project.  I too feel a bit disappointed in the current status and rate of progress.  Here are my thoughts.

I feel that one of the primary problems is a lack of engagement. I joked about the existence of the SETI Institute staff being a sort of Fermi paradox but seriously, this forum should be swarming with them commenting, guiding, steering, ... Jill has had a couple of posts but no sign of Seth or Frank, why? Every single member of the SETI Institute staff, and I don't care if their specialty is deep sea vents, should be on the setiQuest forums posting.  I've had several very informative forum discussions with some of the ATA-related staff but those conversations sort of fizzled out when things started to get interesting. Another example is that my most recent baudline report about the Crab pulsar data generated zero comments from any SETI Institute staff. There is some strange stuff in that data that I'm really curious about. I wouldn't be surprised if they don't like all the artifacts and flaws I keep finding in the data. Maybe they are thinking "if we ignore sigblips then he might go away?"

The second big problem setiQuest has is that it's wandering into uncharted territory. No one before has attempted to do Open Science. Many parallels have been made to open source software but that is only a small aspect of what the setiQuest project is about. What setiQuest is attempting to do is so much bigger and more important than a software development paradigm.  I think the mistake made was that software experts were consulted instead of scientists. What do the scientists think of the idea? What would help them? How could they make setiQuest work better? Maybe they are afraid of what Open Science means for their profession?

I really don't know the answers to these questions but engaging in discussions about them is probably a good idea.

avinash
Offline
Joined: 2010-01-26
Posts: 278
Good points - both of you -

Good points - both of you - Anders, in your blog, and Sigblips, here. I can't disagree with what you are saying. Yes, pace has been slow, and often, responses missing. (But, there has not been any attempt to ignore.)

Our challenge has been discovering things as we have gone along - none of us started as OS experts. In addition, when we looked for advice, as you point out, we solicited advice from software people, not scientists. I did communicate with people doing open science in the January/February timeframe, but those discussions went nowhere. My sense is that open science is not as widespread as open software, but I could be wrong.

Where do we go from here? We had a meeting on the 11th. Karsten Wade attended by phone. They have given us action items that require us to rethink some of what we are doing. And, there are two reasons why this rethinking has not happened - first, I have been away from the office for almost the entire time since that meeting. But, more important, our office is being rebuilt, and we are all working from home for some time - prevening a face to face meeting of people.

We have also not allocated enough resources to this task. The software team is busy completing the SonATA software, giving only a short amount of time to the open source task. The scientists are continuing their observations in addition to responding on forums.

But, these are all small reasons in the overall scheme of things. Your criticism is valid, and you will see action on it - as soon as we are back in the office (at the end of this month).

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
I absolutely think that open

I absolutely think that open science is not widespread, and SETI Institute has every opportunity to be a pioneer and future authority in this field. But I think you would be ill-advised not to heed the lessons learned by the open source software community of the necessity of openness and transparency. There is a fine line between "Open Innovation" and just tricking people into doing your donkey work, and the latter, I fear, carries a steep PR cost.

avinash
Offline
Joined: 2010-01-26
Posts: 278
Thanks, Anders, for cuing up

Thanks, Anders, for cuing up this topic through the blog.

Let me provide a high-level summary of progress-to-data on setiQuest.  Sorry, it is long, but in the interest of transparency, I am trying to give the complete picture. Hopefully, this will help you understand the challenges, and the perception that things have been stationary. I agree that we have not moved as much as we could have, but we have made tremendous progress.

When the project was started, there were 8 sub-projects within it:

  1. Acquiring standard hardware - so that we could replace the custom hardware, and bring software to a stage that anyone with a home PC can run it. This goal was achieved through significant help from two teams - Dell / Intel combination and Google. The hardware is on site, and we are in the process of installing it. We have had some networking issues that Dave Hartzell, a community member was peripherally involved in helping us solve.
  2. Porting the software to run on this standard hardware. There is a lot of progress on this, and the goal is to finish it by the end of this calendar year. However, this also ran counter to our goal 5 - that of open sourcing the software.  At every stage, the team has had to think about what to focus on first - finishing the software, or open sourcing it. This ends up delaying both.
  3. Community building - This involved creating the website, and progressively enriching it with tools on one side, and engaging with the community on the other. The website process has been good and we have made much progress - thanks to all the work LastExit did for us. We also tried to create a sandbox so that commuity members can try out their proposed changes - but this one I have to admit defeat on. We also relied on many Drupal features - which I am finding out now, are not the best to use (wiki, mailing lists ...). A new version of the site is being prepared, and this time we will make sure that there is a sandbox.
    On the people side, we have fared average, not great. We have lacked in multiple areas - responding to forum issues, blogging and sending newsletters. The scientists share their time between real obserrving, and helping the community - again a pull-pull situation. Two former SETI Institute scientists have volunteered to help in the fourms. Once they are up to speed, hopefully, you will see more communication.
    A recent suggestion has been to create a setiQuest board - independent of SETI Institute. We agree, and we would like to convene a meeting to discuss this (IRC, Phone call, ... whatever works best for the team); I hope a few of you will step us to carry the torch until a formal board is in place.
  4. Outreach - this involved two parts - actively evangelizing the program, and giving useful information to people who come to our website. We have been moderately successful in the first part, and are improving in the second part.
  5. Open Sourcing the software. Look also at item 2 above.  With the attorney helping us, we came up with a new license. Based on input from the community, we changed it to Apache 2.0. We also needed to make sure that we don't run afoul of others' licenses. Thankfully, Palamida stepped in to help us. They pointed out 74 issues with the software - that needed to be resolved before we could open source it. Some issues have been resolved. Others will be. (Would it help if we  make the issues available through Redmine, even if the source code is not open sourced? If people know of open source substitutes for "problem" libraries, we could use the referrals.)  We have a roadmap, but we all agree that it is not aggressive enough. But we are making progress on it. The biggest barrier has been a lack of resources on our side.
  6. Making setiQuest data available to the community. The goal was to have the community look at the data, and use it for whatever they felt was appropriate - from our point of view, development of new algorithms, and of new data visualizations would be useful. We have put a fair amount of data (although not in line with our initial promise of one data-set per week). This part has not been very successful - sigblips has analyzed the data, but beyond that there is not much. I am not sure how to proceed on this one.
  7. Encouraging the development of algorithms. We worked with Cloudant to create an environment where people can try out their algorithms. While not perfect, the system works, and if enough people use it, we can lean on Cloudant to continue working with us on further improving it.
  8. Developing apps for citizen scientists. The initial app concepts that we came up with did not generate much enthusiasm, so this was relegated to the bottom of the list. This is something we could be easily taken up again. Maybe there are creative types among you, who can define compelling apps.

So, where do we go from here.

  1. Create a board to help provide external leadership, and direct us. Suggested by Karsten, and reiterated by Anders.
  2. Hire - we have authorization to hire one person who can help us do both software, and open source it. If you know of C++ people with open source experience, send them my way. We are already talking to some people. Once this person is on board, expect a higher level of responsiveness.
  3. A new website based entirely on open source tools, and richer in fuctionality. It is in progress.  It is being modeled after the meld community. If you have thoughts, let us discuss them.
  4. More aggressive open sourcing schedule. By the end of the calendar year, we will have only one focus, but we are looking at what we can do before then. Until things change, however, we will continue open sourcing according to the currently-published roadmap.
  5. Anders has suggested a roadmap for citizen scientists. Jill is looking at it, and will respond soon (hopefully hours and day, not weeks :-)). There are some good ideas, and some ideas that may be difficult to implement.  The discussion, I hope, will lead to an implementable plan that everyone will be happy with.

Did I miss anything that is important to the team? The question of when the community will do SETI is an interesting one. If people can do it on fixed data, that is now. Let us work with Anders and Jill on coming up with the right roadmap for more.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Looks very good, thanks. I am

Looks very good, thanks.

I am going to open a separate thread on the upcoming meeting.