Monday, October 7, 2019

Axis aligned artifacts

Left: Original data distribution. Right: Learned co-displacement, darker is lower.
Notice the echoes around (10,-10) and (-10, 10)


There are minor artifacts created by choosing axis aligned cuts in RRCF, similar to what was noted with IsoForest. 

Friday, October 4, 2019

State of Astro-informatics

I had a glance through "Realizing the potential of astrostatistics and astroinformatics" by Eadie et al. (2019). While I do not feel qualified or informed to comment on the suggestions, I can summarize them quickly. There are three problems:

  1. Education: Most astronomers are not trained in code development resulting in maybe good but fragile code. Similarly, most computer scientists don't have the astronomy background or connections. 
  2. Funding: Grants for methodology improvement are scarce. I wonder if these things can be funded from the computer science side of things in collaborations. 
  3. Quality: Astro-informatics lacks support of state-of-the-art methodology as it stands. 
I was much more interested in the final section about potential themes in research:
  1. Nonlinear dimensionality reduction.
  2. Sparsity.
  3. Deep learning.
I find the last theme incredibly broad and am unclear exactly how they mean it. It seems they're most interested in hierarchical representations of data. I would also claim that anomaly detection/clustering is important for reducing the volume of data. 

Tuesday, September 17, 2019

Training an autoencoder with mostly noise

I am working on a project where we wish to use anomaly detection to find what image patches have structure and which don't. As an aside, I ran an experiment on MNIST. You have 500 images of fives. You have 5000 images that are pure noise. You train a deep convolutional autoencoder. What you end up with is the following reconstruction:

The top row are the inputs and the bottom row are the reconstructions. You find images of fives even when nothing is present.

Monday, September 16, 2019

Flood

I stumbled upon a game called Flood. It's a simple enough game. You start with a grid of random colors. Then, you change the color of contiguous region formed from the upper left corner until you have flooded the entire grid with one color. I wrote some code and have been tinkering around some. 

The most naive solver is a breadth first search. So, I did that. Below you see the solution length for a grid size of varying size with only three colors.
This search breaks down at large grid size because it's so slow. Some kind of heuristic approach would perform better, but can you prove it's within some epsilon of optimal? What is the expected optimal solution length? I think that should be proveable theoretically since you just have a uniform grid and can constrain the growth rate. I will likely return and do that. 

Monday, September 9, 2019

Goal of Anomaly Detection in Non-stationary Data

I was explaining anomaly detection in non-stationary data to someone and threw together this crude example figure. The blue points are nominal and represent 90% of the points. The red are anomalous and represent 10% of the points. In this example, the red data is stationary while the blue passes through it. Thus, it would be very difficult to differentiate the red and blue points when they overlap. However, even if we only had a few frames of this video, we would like to be able to realize there are two dynamics going on.

The code for this is:


Friday, June 14, 2019

The Value of a Peer-Reviewed Activity

This week, we have been talking about proof writing in the discrete mathematics course I'm teaching. Yesterday, I started class by having students answer how confident they are about their proof writing skills on a scale of 1 to 10, 1 being "clueless and not sure where to start", 5 being "Okay and ready for homework," and 10 being "I can do most any proof you throw at me with ease." I then had them individually complete three proofs in 15 minutes.




Many students struggled with Problem 2 because they are not comfortable with sequence notation. Some misunderstood Problem 3 and tried proving something completely different than intended. After 15 minutes, they traded papers (with anonymous codes instead of their names so there was no embarrassment) and reviewed each someone else's papers as we went over them in class. They wrote some lovely, encouraging comments to each other. For example, one student had no idea of what to do on the second problem so they left it blank and their reviewer wrote, "Bet you can do it now! :D." Others noted where the proof failed and wrote a comment that they too had that difficulty. It's refreshing to see such kindness. Finally, they took the same survey about their proof writing again and there were dramatic changes in confidence.
Many more students were confident in their abilities. I'm not sure who the one who reported 1 after the exercise is. I hope they come to office hours.

This is in no way a rigorous test, but students expressed they learned more from this exercise because it forced them to think about the material instead of just going through the proof together on the board. I imagine there is a psychological benefit of seeing how someone else is doing too and being kind to them in written comments. It was also suggested that I do a problem example in class without proving it before class so students could hear my raw thought process first hand. I'll have to think about that. I like being prepared, but I'm sure I could find some way to do that.

Tuesday, June 11, 2019

In-class assignment collection

I have been quite busy and not prioritized logging work here; it was fairly new so it never became a habit. I'm teaching a course in Discrete Mathematics right now, the first class I've ever been the instructor. It's exciting. I have many things to learn, but I believe it's going well.

One trick that someone suggested to me is to have students try and solve a problem, write down their solution, and then hand it in with their name on it. It serves two purposes: you get a record of attendance and also get to glance through the work and gauge student progress. I award 3 points of quiz credit regardless of whether they answered correctly as an incentive to attend class. It's a little thing I might never have thought of, but it seems very effective.

Friday, May 3, 2019

Atypicality Presentation Recap

Yesterday, I gave a presentation introducing the ideas of atypicality to the Monteleoni research group. These are the slides and handwritten notes. I plan to explore this idea further and write up better LaTeX notes, which I will then share as well. For now, the idea of atypicality centers around using two coders: one trained to perform best on typical data and one that is universal and not data specific. A sequence is atypical if its code length using the typical coder is longer than the universal coder, i.e. it is not favored by the typical coder indicating the information is somehow unique. 

An example of the usage of atypicality from their 2019 paper "Data Discovery and Anomaly Detection using Atypicality for Real-valued data."


I presented on Elyas Sabeti and Anders H⊘st-Madsen's 2016 paper titled "How interesting images are: An atypicality approach for social networks". I think there are lots of opportunities for development in the image space, e.g. using different representations of images maybe including deep learning, exploring what made those images interesting by training a supervised classifier on the resulting labels and exploring the learned features. I'm concerned that their atypicality could be keying on background features, a lot more investigation is needed to understand the details of this application. I also think the image application needs more rigorous validation. They could have tested against other kinds of images to see if they also were labeled as atypical. One idea that was suggested by a member of our group (Amit Rege) is using the atypicality idea in a down-stream application to speed up stochastic gradient descent by picking atypical examples to learn from. 

A list of atypicality papers, by Sabeti & H⊘st-Madsen:

Wednesday, April 24, 2019

Ulam–Warburton automaton inquiry: Part 1

The Ulam-Warburton automaton is a simple growing pattern. See Wikipedia or this great Numberphile video for more information. For the more technical see this paper too.

Ulam-Warburton animation from Wikipedia
I was curious what you'd get under various other versions of it, using the same basic rule of "turn on cells with exactly one neighbor" but with a tweak. For example, what happens if you a cell turns off after being activated for a few cycles?


You get this beautiful modification. I plan to follow this up more and will make code available then (although it's insanely simple). I would be curious what statements you can make about the periodicity and the number of active cells at any time. 

For example, empirically it seems that the number of cells in a "dying" version is always upper bounded by the ageless and standard Ulam-Warburton Automaton. Now, prove that and derive formulas (or prove it's not possible) for a generalized version. 

Other ideas:
  • How does the total cell count formula change depending on the starting configuration, e.g. more than one active cell? 
  • Are there interesting stochastic versions? 
  • What happens when cells have a regeneration period, a time after they die before they can activate again? That models disease and other phenomena better maybe since resources/population has to restore before a new outbreak is successful. 
  • What if the age of a cell is a function of its position on the plane? 
  • Can we generalize to other grid types? 
  • How does this fit into other work? Has it already been done? 
This whole curiosity partially started because I wanted to assign a simple proof about Ulam-Warburton to my summer discrete math class. I also have an affinity for fractals, who doesn't?




Thursday, April 18, 2019

Denoising presentation

Slides [ppt or pdf] for a presentation I gave on denoising images. Noise2Self is amazing.

Check out this video from the related Noise2Void:

Wednesday, March 27, 2019

Training a denoising autoencoder with noisy data

How do you denoise images with an autoencoder if you don't have a clean version to train with? One option is to add more noise to your images! In this experiment, I trained an autoencoder with noisy MNIST data. I began with MNIST images on the bottom row, the noiseless versions. To simulate observational data, I added Gaussian noise to the images. In reality, we may never have access to these noiseless images. To train an autoencoder we need an input set with noise and output set without noise so the autoencoder can learn the denoising procedure. An autoencoder could potentially also learn the denoising procedure if we gave it extra noisy images as input and slightly denoised images as output. To simulate this, I added more Gaussian noise to the observations to arrive at the top row. Then, the top row is input and the second row is the output for training. When we want to denoise observations, we use this trained network with the observations as input and the denoised row as our output.

I am not sure how sensitive this is to an accurate noise model when adding noise or the amount of noise added. In the solar extreme ultraviolet setting, we suffer more from shot/Poisson noise than Gaussian noise. I am unsure how well this approach works under that setting.

An arguably more elegant approach to this problem is the "Blind Denoising Autoencoder" by Majumdar (2018). It does not require this noise addition or noiseless images.

Direction specific errors and granularity

In solar image segmentation, we identify many categories of structures on the Sun: coronal hole, filament, flare, active region, quiet sun, prominence. In our use case, some mistakes are more egregious than others. For example, mistaking a filament as a coronal hole is not too bad, no where close to as bad as calling it a flare. Assume we have a gold standard set for evaluation. (In reality, even this gold standard set may have errors, but we can ignore that for now.) It has a region labeled as filament. Ideally, we want our trained classifier to also call that filament. However, if it calls it quiet sun, we would be okay. Calling it coronal hole is also acceptable. Any other category is wrong, with the most egregious being if we call it outer space or flare. Now another portion of the Sun is labeled quiet sun in the gold standard. It is not okay for the classifier to then call it filament. In this way, it is acceptable to mistakenly label a filament as quiet sun but unacceptable to label quiet sun as anything else. The error depends on the direction of the mistake.

Similarly, in our current evaluation we evaluate errors on a pixel-by-pixel basis. In reality, we do not care about this granularity. We want coherent labeling. Small boundary disagreements are okay. We need a more robust evaluation metric.

TSS versus f1-measure





The above movie shows how accuracy, TSS, and f1-measure change under the assumption that a classifier has no false positives until it has classified all of a class correctly. The vertical grey line shows the actual percentage of the features having a given class versus the horizontal axis what percentage of the class is identified by the model. For example, if the true class percentage is 0.1 as shown below we see that an aggressive classifier, one that prefers creating false positives, is punished much less by accuracy and TSS than by the f1-measure. If the model to classified 20% of the examples as a true example, the accuracy and TSS is around 0.9 while the f1-measure drops to around 0.65. Selecting your metric is very important depending on if you prefer false positives or false negatives.


Thursday, March 21, 2019

Motivation for Denoising Solar images

I am beginning a project to denoise solar images. Here is a motivating example from March 8th, 2019. Off the limb of the Sun, faint features can be seen. These are hard to study without denoising. I am also interested in using dictionary learning for the denoising so that I can exploit the learned atoms as a mechanism for classifying the solar features too. Solar denoising relies on a Poisson noise model, different from the commonly used additive or impulse models.

Wednesday, March 13, 2019

Boulder Solar Day

I attended the second half of Boulder Solar Day yesterday and gave a talk on some of what I've worked on and plan to work on. The slides are available here.

Monday, February 25, 2019

Generate Thematic Maps from Heliophysics Event Knowledgebase

The below script will allow you to generate thematic maps from the valuable Heliophysics Event Knowledgebase (HEK). I have written it to take a SUVI thematic map as input and output only Spatial Possibilistic Clustering Algorithm (SPoCA) coronal hole and bright region patches in HEK but would be willing to help assist others to modify the script as needed.

Expert labeled map
SPoCA map from HEK



The script can be found below and bundled with smachy, my solar image segmentation toolkit.