Monday, October 7, 2019

Axis aligned artifacts

Left: Original data distribution. Right: Learned co-displacement, darker is lower.
Notice the echoes around (10,-10) and (-10, 10)


There are minor artifacts created by choosing axis aligned cuts in RRCF, similar to what was noted with IsoForest. 

Friday, October 4, 2019

State of Astro-informatics

I had a glance through "Realizing the potential of astrostatistics and astroinformatics" by Eadie et al. (2019). While I do not feel qualified or informed to comment on the suggestions, I can summarize them quickly. There are three problems:

  1. Education: Most astronomers are not trained in code development resulting in maybe good but fragile code. Similarly, most computer scientists don't have the astronomy background or connections. 
  2. Funding: Grants for methodology improvement are scarce. I wonder if these things can be funded from the computer science side of things in collaborations. 
  3. Quality: Astro-informatics lacks support of state-of-the-art methodology as it stands. 
I was much more interested in the final section about potential themes in research:
  1. Nonlinear dimensionality reduction.
  2. Sparsity.
  3. Deep learning.
I find the last theme incredibly broad and am unclear exactly how they mean it. It seems they're most interested in hierarchical representations of data. I would also claim that anomaly detection/clustering is important for reducing the volume of data. 

Tuesday, September 17, 2019

Training an autoencoder with mostly noise

I am working on a project where we wish to use anomaly detection to find what image patches have structure and which don't. As an aside, I ran an experiment on MNIST. You have 500 images of fives. You have 5000 images that are pure noise. You train a deep convolutional autoencoder. What you end up with is the following reconstruction:

The top row are the inputs and the bottom row are the reconstructions. You find images of fives even when nothing is present.

Monday, September 16, 2019

Flood

I stumbled upon a game called Flood. It's a simple enough game. You start with a grid of random colors. Then, you change the color of contiguous region formed from the upper left corner until you have flooded the entire grid with one color. I wrote some code and have been tinkering around some. 

The most naive solver is a breadth first search. So, I did that. Below you see the solution length for a grid size of varying size with only three colors.
This search breaks down at large grid size because it's so slow. Some kind of heuristic approach would perform better, but can you prove it's within some epsilon of optimal? What is the expected optimal solution length? I think that should be proveable theoretically since you just have a uniform grid and can constrain the growth rate. I will likely return and do that. 

Monday, September 9, 2019

Goal of Anomaly Detection in Non-stationary Data

I was explaining anomaly detection in non-stationary data to someone and threw together this crude example figure. The blue points are nominal and represent 90% of the points. The red are anomalous and represent 10% of the points. In this example, the red data is stationary while the blue passes through it. Thus, it would be very difficult to differentiate the red and blue points when they overlap. However, even if we only had a few frames of this video, we would like to be able to realize there are two dynamics going on.

The code for this is: