Discussion Forums

Kurtosis for setiCloud data

21 replies [Last post]
Anders Feder
Offline
Joined: 2010-04-22
Posts: 618

Using setiCloud, I've computed the kurtosis for all data objects (2^26 bytes each) in the cloud (the code is shared, and can be accessed from your own setiCloud account). In all cases, I get a kurtosis around -1.97, far from the expected 0 of Gaussian white noise. What is the explanation for this? Can interference signals not be assumed to be negligible on such a large scale?

robackrman
Offline
Joined: 2010-04-15
Posts: 235
I ran your code (see below)

I ran your code (see below) independent of the CloudAnt environment with the expected kurtosis result ~= 0 for the data file I randomly chose. I expect this result for most setiQuest observations. I looked at your shared setiCloud code. It seems fine to me. I wonder what is going wrong? rackrman@dev:~/temp/anders$ octave -q octave:1> [data,status,msg] = urlread("http://s3.amazonaws.com/sq001chop.seti.com/2010-03-26-bllac-8bit-1-of-8.dat.5"); octave:2> length(data) ans = 67108864 octave:3> x=data(1:length(data))*1.0; octave:4> size(x) ans = 1 67108864 octave:5> mean(x) ans = -0.0012580 octave:6> kurtosis(x) ans = -0.0088491 octave:7>

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Now that it is really weird.

Now that it is really weird. Doing the same thing on both my Windows and Linux machines, I get:

octave:1> [data,status,msg] = urlread("http://s3.amazonaws.com/sq001chop.seti.com/2010-03-26-bllac-8bit-1-of-8.dat.5");
octave:2> length(data)
ans = 67108864
octave:3> x=data(1:length(data))*1.0;
octave:4> size(x)
ans = 1 67108864
octave:5> mean(x)
ans = 124.03
octave:6> kurtosis(x)
ans = -1.9801

What version of Octave are you using?

robackrman
Offline
Joined: 2010-04-15
Posts: 235
On the computer where I ran

On the computer where I ran that experiment: Octave version 3.0.1

There is an inconsistency. I suspect the data in your case is interpreted for some reason (setting? Octave version?) as unsigned (therefore values 0 through 255 rather than -128 through 127). If that is the case, it can be accommodated with additional code.
Please try the following so that we can compare interpretation of individual sample values for our two cases:
octave:1> [data,status,msg] = urlread("http://s3.amazonaws.com/sq001chop.seti.com/2010-03-26-bllac-8bit-1-of-8.dat.5");
octave:2> disp(data(1:100)*1.0)
Columns 1 through 16:
-14 7 -12 2 16 -24 9 -38 -11 -4 -4 -13 3 -15 2 19
Columns 17 through 32:
-12 6 -18 9 -2 7 -18 -7 -6 6 8 -20 -1 8 15 -6
Columns 33 through 48:
-5 -6 -3 -19 -11 15 6 1 14 -17 6 2 16 11 13 -8
Columns 49 through 64:
7 9 9 -10 15 -18 6 7 1 -8 -10 0 0 26 -5 -3
Columns 65 through 80:
-12 2 6 -1 -1 -15 13 -17 -1 1 -9 10 -14 17 -6 -3
Columns 81 through 96:
-22 -11 7 -6 5 5 15 -6 26 7 -6 30 -21 4 -45 -12
Columns 97 through 100:
4 -10 -13 13

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Indeed, I get: Columns 1

Indeed, I get:

Columns 1 through 13:

242 7 244 2 16 232 9 218 245 252 252 243 3

Columns 14 through 26:

241 2 19 244 6 238 9 254 7 238 249 250 6

Columns 27 through 39:

8 236 255 8 15 250 251 250 253 237 245 15 6

Columns 40 through 52:

1 14 239 6 2 16 11 13 248 7 9 9 246

Columns 53 through 65:

15 238 6 7 1 248 246 0 0 26 251 253 244

Columns 66 through 78:

2 6 255 255 241 13 239 255 1 247 10 242 17

Columns 79 through 91:

250 253 234 245 7 250 5 5 15 250 26 7 250

Columns 92 through 100:

30 235 4 211 244 4 246 243 13

(I use Octave version 3.2.2)

robackrman
Offline
Joined: 2010-04-15
Posts: 235
Try this hopefully improved

Try this hopefully improved method to consistently achieve signed values:

rackrman@dev:~$ octave -q
octave:1> [data,status,msg] = urlread("http://s3.amazonaws.com/sq001chop.seti.com/2010-03-26-bllac-8bit-1-of-8.dat.5");
octave:2> double(int8(data(1:10)))
ans =
  -14    7  -12    2   16  -24    9  -38  -11   -4

If you get signed values above, then try the kurtosis function:

octave:3> kurtosis(double(int8(data)))
ans = -0.0088491

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
 Yep, that works. Getting the

 Yep, that works. Getting the same result. Thanks.

robackrman
Offline
Joined: 2010-04-15
Posts: 235
Outstanding Anders.  Thank

Outstanding Anders.  Thank you for verifying.

Will you at some point attempt your setiCloud analysis again with the new code method?

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
 Yes, I've updated the code.

 Yes, I've updated the code. It's currently rebuilding the view.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
The view has been rebuilt,

The view has been rebuilt, and I get nice kurtosises above -0.01 for all data sets except one named "galaxy 19", which I assume is not actually of a galaxy, but a satellite.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
For the heck of it, I went

For the heck of it, I went ahead and tabulated the data and sorted by descending excess kurtosis (it actually fetches the data from setiCloud on the fly when the page is generated 8). If people are interested in analyzing data with many anomalies relative to GWN, the data sets towards the top of the table should be the best candidates.

hartzell
Offline
Joined: 2010-07-28
Posts: 17
Cool!

Good job guys!

I now have access to the Cloud (account problem)...

A quick question:  How is the data organized?  Is it I/Q samples, like the ones posted on the setiQuest website?

Dave

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
 Yes, it's the same data,

 Yes, it's the same data, only segmented into 2^26 byte "chops".

robackrman
Offline
Joined: 2010-04-15
Posts: 235
Interesting results.  Galaxy

Interesting results.  Galaxy 19 is a satellite.  The SNR is likely high enough that the sample distribution deviates from Gaussian enough to be seen in the kurtosis test.  Nice work!
In that observation (Galaxy 19) we recorded the unencrypted C-band AMGTV transponder. An interesting DSP challenge would be for someone to attempt to decode and play a few seconds of HDTV (most likely an old TV show or movie) from the samples.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
please don't download data multiple times!

Hi Anders and all
 
I'd like to discourage you and anyone from designing an octave program that reads data (downloads data) from the setiCloud (or setiData) websites more than once .We have a very limited budget for data downloads (1 TB/month) and if we exceed this threshold then we pay through the nose. The project literally cannot afford this. If we run out of download time, we'll have to cut off access to data in the cloud at least until the next month rolls around.
 
If your code, above, was actually run in the cloud, then notice that you are actually passing through the amazon firewall twice. That is we pay not only for one download but for one upload as well. This isn't the preferred method.
 
I apologize that the setiCloud is not better documented, including the correct way to address data so that it doesn't go through the firewall. I can't give you that info off hand, but I believe someone else knows it by heart. I'll try to get that information for you.
 
I suggest that one convenient way to analyze data is first to download it to local disk, once and for all. Then you can perform jillions of analyses without breaking the bank.
 
Meanwhile, if your computer is not overtaxed, it is probably better to download the large 2GB files from the Data website. I'm starting to upload more of our files to setiquest.org so that you will have access to all the data. If there is a file you are especially interested in, then please email me and I'll make it a priority.
 
Thanks for your understanding and patience.
 
Gerry

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
Thanks. This has been a

Thanks. This has been a mystery for a while. I'll remove the links I have from my website to these data.

We should really find a way to mirror them so we don't have to worry about suffocating the project just by using the data.

By the way, I got the code for loading the data into Octave from one of the examples available in the setiCloud interface. If this is a "bad" method, you may want to edit that example.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
hopefully, to clarify

To start at the top; there are two projects in the Amazon cloud: setiData and setiCloud.

setiData is available through the "Getting Data" link under Data/API's on the setiquest.org website.

It is permissable and encouraged to download copies of this data to your local disk. But please download each file only once to save resources. Today I'm adding more data to setiData and this will be your first chance to look at it.

setiCloud is the neat site set up by Cloudant for processing setiData. It has more data than setiData, but the latest data is not available there. Within setiCloud, there are links to "bite sized" splits of the setiData data, up to mid-June, I think. Feel free to access these bite-sized chunks from within the cloud as much as you want. Please do not download these smaller chunks from the cloud (or at least, download them no more than once and store on local disk). Generally, you're better off grabbing data for local use by getting the bigger chunks from setiData.

Sorry for the confusion; we'll continue to provide more info on these issues as time permits.

Gerry

hartzell
Offline
Joined: 2010-07-28
Posts: 17
Gerry- Just to clarify, this

Gerry-

Just to clarify, this doesn't apply to the setiQuest Cloudant service, does it?  IIRC, you can move as much data as you want to inside the Amazon cloud?

Dave

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
I'd like to hear the answer

I'd like to hear the answer to that too, but as far as I can tell, it does apply to the Cloudant service. I imagine that we are supposed to access the data by some means that are local to the Amazon cloud - e.g. using the filesystem - rather than passing it out and in through the firewall.

Anders Feder
Offline
Joined: 2010-04-22
Posts: 618
It seems you're right,

It seems you're right, Dave:

We charge less where our costs are less. Some prices vary across Amazon S3 Regions and are based on the location of your bucket. There is no Data Transfer charge for data transferred within an Amazon S3 Region via a COPY request. Data transferred via a COPY request between Regions is charged at regular rates. There is no Data Transfer charge for data transferred between Amazon EC2 and Amazon S3 within the same Region or for data transferred between the Amazon EC2 Northern Virginia Region and the Amazon S3 US Standard Region. Data transferred between Amazon EC2 and Amazon S3 across all other Regions (i.e. between the Amazon EC2 Northern California and Amazon S3 US Standard Regions) will be charged at Internet Data Transfer rates on both sides of the transfer. (Source: Amazon S3 FAQ)

Since I have no reason to believe that SI's EC2 and S3 accounts are placed in different regions, I'm going to be a bit arrogant and assume that we can safely import data from within setiCloud itself via HTTP. The people I've talked to say this is the standard practice and I'd also be surprised if Cloudant didn't know what they were doing when they used that method in their examples.

Downloading to computers outside of the Amazon environment is still a no-go, though, apparently.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
correct

Late, but... Yes you are correct, Dave.

gerryharp
Offline
Joined: 2010-05-15
Posts: 365
why kuritosis is not 1

Hi Anders

Taking this discussion in a completley different direction, you may recall that the frequency spectrum of all of our data has a fillter rolloff at the edge of the band. This is good since it eliminates artifacts due to band edges not lining up.

However, applying such a filter turns white noise into "pink" noise (the first time I heard of pink noise was a Pink Floyd concert for the Pigs album). Since it is not white, the kuritosis should be non-unity.

If you cobbled up, by hand, two spectra: one that is (randomly generated) GWN and another that is the same file passed through a band narrowing filter (FFT, apply bandpass correction, inverse FFT) then you can compare the kuritosis on the file and determine how much difference the bandpass should make. The point being, that this effect is present in all of our data.

Gerry