Sieve of Eratosthenes

I’m using this post to introduce the Sieve of Eratosthenes, which is currently my favorite algorithm. See the Java implementation below:

// Sieve of Eratosthenes

public int sieve_of_eratosthenes(int input)
{
    boolean [] is_prime = new boolean [input];
    int number_of_primes = 0;
    
    // Set all values to true
    Arrays.fill(is_prime, Boolean.TRUE)
    
    
    // Sift
    for (int i = 2; i <= sqrt(input); i++)
    {
        if (is_prime[i])
        {
            for (int j = pow(i, 2); j <= input; j++)
            {
                if ((j % i) == 0)
                {
                    is_prime[j] = false;
                }
            }
        }
    }
    
    // Collect number of primes
    for (int i = 2; i <= input; i++)
    {
        if (is_prime[i])
        {
            number_of_primes++;
            System.out.println("%d %n", is_prime[i]);
        }
    }
    
    return number_of_primes;
}

The Sieve of Eratosthenes is a 1900 year old algorithm that can efficiently find all of the prime numbers under a given number. The work of finding prime numbers is no trivial task in the world of computing: RSA cryptosystems, for instance, rely on the factorization of one large number into two prime numbers. Prime numbers also play a role in the generation of psuedo-random numbers. Such examples underscore the importance of quickly generating primes in computing, and the Sieve of Eratosthenes does just that. To describe the algorithm’s process, I will refer to the inputted upper boundary as :

  1. Every number between 2 and is listed.
  2. Iterate through each number
  3. If the number is less than or equal to and is prime, label all of its multiples under as composite (not prime).
  4. After all numbers have been iterated through, the ones not labeled as composite are prime.

The GIF below is incredibly helpful for those with a more visual predilection. When watching it, consider the quote

Sift the Two’s and Sift the Three’s, The Sieve of Eratosthenes. When the multiples sublime, The numbers that remain are Prime.

GIF of Sieve of Erastosthenes

Most research states that this particular implementation of the sieve is the most efficient algorithm for determining primes under .

Given the fact that this algorithm is , the code completes amazingly quick for inputs under the specified range. As the input gets larger, however, it becomes necessary to switch to other sieves due to memory constraints. Even so, other sieves use a similar approach to this one.

I’ve used the above code to optimize my solutions to Project Euler problems, one of which asked to find the sum of all the primes below two million. The solution was calculated in less than five seconds.

That, in essence, is what makes this algorithm so great. In less time than it takes to drink a couple ounces of coffee, a wild number of prime numbers can be calculated.

On Unkept Promises & Productivity

I have an irrational fear of appearing hypocritical. This fear is best demonstrated in my inability to complete what I said I would:

Every important algorithm or data structure that I come across in [Robert Sedgewick’s] Algorithms, I will write about right here on this blog.

I wrote that three months ago. Pathetic that I haven’t posted anything about Algorithms, right? Probably.

Let me explain to you how this promise went unkept, and you’ll learn a couple of morsels relating to productivity.

In the post where I told all of you that I would write about algorithms and data structures, I said that I would do the necessary reading on the bus ride home from my internship at TMC Bonds in New York.

It became crystal clear to me within the first few days of my internship that the stamina required to do that was something I simply did not have. Eight hour work days were new to me, and so I decided that reading Sedgewick would come on the weekend.

It became clearer within the first few weeks of my internship that Saturday and Sunday were necessary to recharge from the week that was; reading a textbook did not fit into this idea well. Thus, I was about a month out from where I had started, and no progress had been made on my conquering of the textbook.

But why? A feeling of paralysis as a result of not starting the textbook when I was supposed to. As the summer progressed, the less possible it seemed to do the task at all and fulfill the promise that I had kept. That’s one problem.

An additional problem is the wide expanse of things I’m interested in doing at any given time. Since middle school, I’ve kept an agenda of assignments and goals in my head. I’ll chalk it up to changes in my neurochemistry that prohibit me from keeping the agenda as well I used to, now that I’m at the ripe old age of 19. As a result of this change, I am implementing a trick from a friend at Michigan: the Post-It system. Make a Post-It note for everything you need to do. Once it’s done, get rid of it.

It doesn’t require much of an explanation. I’ll let you know if I’m still doing it once I’m at school.

What can we learn from this boring, self-centered post? Two things:

  1. The longer a task goes on the back burner, the harder it becomes to start it.
  2. Trying to keep all of your ideas organized in your head is damn pointless.

You would think that after 19 years on this planet I would have figured this all out by now, but it is clear that I have work to improve on. We all do, I imagine.

A Neat Mathematical Proof

Here’s a quick mathematical proof that I came across on Quora. Rather than simply giving you the link, I’m going to repeat it with some of my own commentary because I like it so much:

Prove that there are two irrational numbers and  such that  is rational.

Proof: Consider two cases for the instance in which : either is rational, or it is irrational.

The first case is straight forward: if , then both  and are irrational. This case goes ahead and states that  is rational, so our proof suffices for this specific case.

The real fun, however, comes in the second case, in which  is irrational. Where do we go from here?

[Aside: About eight months ago, I likely would have been bewildered at this point in the proof and would have promptly given up. But to quote Shakespeare’s _Cymbeline_, I will tell my past self to “Fear no more” (Act IV, Scene 2), for the next part of this proof is damn nifty.]

Think about it this way: in the second case, we suppose that  is irrational. If we set and set , we can say that . This, in turn, means that 

As a result, we can say that if , then either  is rational, or  is rational. And because in either case two irrational numbers  and  yield a rational number , our proof is complete! Go on, celebrate with your friends and relatives. A Fields Medal is in your future. You could be the next Will Hunting for all I know.

It took me a long time to realize the beauty of mathematical proofs, largely because of how frustrating I first thought them to be. Enrich yourselves with simple proofs like these, and you’ll learn to appreciate the simplicity and logic that dictates higher mathematics. You’ll become better with proof writing as a consequence.

Summer Work & Algorithmic Series

I feel the need to make this somewhat of a monumental announcement, as this blog may finally get updated more than once every four months. I recently decided that I would be interning with TMC Bonds this summer. TMC is a company based in Manhattan that creates electronic trading platforms for fixed-income securities. Bank of America Merrill Lynch, Citi Global Markets, and Morgan Stanley all have a stake in TMC.

About half of the company is focused on software development in one form or another. I think I will learn quite a bit in my 10 week stint there.

Now, the reason I bring this up is not simply to declare the fact that I’m working. The one hour bus ride to and from my hometown of Caldwell means that I will have plenty of time to do some reading. I’ve decided I will spend that time reading the fourth edition Robert Sedgewick and Kevin Wayne’s classic textbook, Algorithms. I chose this text over Thomas Cormen’s Introduction to Algorithms because of the fact that Cormen’s work focuses more heavily on explanations with discrete mathematics. I find that algorithms are best explained with code (in the case of Sedgewick’s text, Java) and plenty of English. Additionally, Sedgewick’s work comes with an extensive library of online materials that I will very happily toy with once I get back from work.

Every important algorithm or data structure that I come across in Algorithms, I will write about right here on this blog. The greatest way to learn about something is to teach others about it, and so I believe that writing about the data structures and algorithms I come across will prove to be a fruitful exercise.

One auxiliary benefit is that I expect this process of reading and writing will provide great training for the rite-of-passage in the EECS department at the University of Michigan: EECS 281. Though they use Cormen’s work as the required textbook, I’m sure Sedgewick will be a fine replacement.

Check back in for my enlightening discussion (yeah, right) on everything from mergesort, to quicksort, to heapsort, to priority queues, to everything in between.

P vs. NP

In the following post, I’ll show you how a problem in theoretical computer science relates to a cure for cancer.

About a month ago, I had the blessing to study vs. for my discrete mathematics class. While we only spent about five minutes discussing my first taste of theoretical computer science, I was desperate to have a greater understanding of the topic. While studying the topic independently, I came across this phenomenal introductory video:

While summarizing this video to everyone in the hallway of my dormitory, my good friend came back with a book on the subject: The Golden Ticket by Lance Fortnow, chair of the School of Computer Science at Georgia Tech.

If you chose not to watch the video, fear not. I’ll give you a very rudimentary understanding of the topic without getting too technical (the author of the video did it far better than me):

  • is defined as the set of all problems that can be verified by a computer about as quickly as they can be solved. Examples of include practically any type of mathematical operation: addition, subtraction, multiplication, division, and even some factorization.  The stands for “polynomial time”; meaning that the efficiency of the algorithm used to solve the problem is bounded by the size of the problem’s input. (For those with some discrete mathematics background, means that the number of steps required to complete the algorithm is , where is the size of the problem’s input, and is some nonnegative integer.
  • is a little bit trickier: it is the set of all problems that can be quickly verified by a computer, but cannot quickly be solved. Many truly intriguing problems are NP-problems: the travelling salesperson problem, the job shop scheduling problem, finding solutions to Hamiltonian paths, and finding prime factors of huge numbers (like 232-digit numbers). stands for “nondeterministic polynomial time,” meaning that the problem can be solved by a nondeterministic Turing machine (NTM). From what I understand, problems that can be solved by a NTM can be solved by checking every single possible solution.

For simplicity, stick the definitions I provide in the first sentence of each bullet, for that is what  and  truly boil down to according to Cobham’s thesis. From here, we must make note of a couple important observations:

  1. Any  problem is an  problem. This is because any problem also satisfies the definition of an problem. Think about it: an is pretty broad: it is any problem that can be solved by checking every single possible solution.  problems can naturally be solved this way, and so we can say that (that is, the set  is a subset of the set ).
  2. Some problems have first been classified as , only to be discovered to actually be part of  as well: finding gigantic prime numbers is the most highlighted example.

The fact that some problems are later discovered to actually be part of  begs the question: does ? That is, is every problem with an easily verifiable solution also easily solvable? I’ll give you a couple of examples of -complete problems: the hardest of all  problems that computers haven’t been able to solve. The nature of -complete problems is that if we can find a fast and reliable solution to one, we can theoretically find solutions to any problem. Take a look:

  1. The travelling salesperson problem. It asks if there is an optimal way to travel to a variable number of cities. It is easy to figure out the best (shortest) way to travel between a small number of cities, but the difficulty of the problem grows at a hyper exponential rate as the number of cities needed to travel to gets greater.
  2. Sudoku. Being able to check that a solution to a sudoku puzzle is correct is simple enough: all you have to do is check that each row, column, and sub-grid contain the digits from 1 to 9 once and only once. With a standard nine by nine grid, sudoku can be solved and verified by a computer with ease. But if the sudoku grid becomes 1000 by 1000, you can bet that a computer will have a tough time solving the puzzle in a reasonable amount of time.
  3. Tetris: you know, that game you played in middle school when you should have been perfecting your skills in typing class? As Fortnow describes in The Golden Ticket: in “traditional Tetris you don’t know what shape will come next. But even if you knew the order of the shapes ahead of time, playing Tetris optimally is…NP-complete” (63).
  4. Factoring large numbers. If you’re anything like me, you might have trouble realizing that . I didn’t choose this factoring problem randomly: if you noticed, both  and are prime numbers. As it turns out, finding  prime factors of numbers that are hundreds of digits long is difficult. So difficult, in fact, that it requires research teams multiple years to find the factors of just one astronomically large number. If it took more than a half a dozen Ph.Ds armed with computers more than two years to figure out one of these factoring problems, imagine how difficult it would be for one computer to do it by itself. (Keep in mind that it is a good thing that computers have such a difficult time doing this task. Finding the prime factors of large numbers is the basis for RSA encryption, which is the premier cryptographic method for things like online banking).

As I already mentioned, the problem regarding  and is determining whether the two equal one another. If you get mesmerized by graphs, this is how  and breaks down:

Graphical representation of P vs. NP

That’s all well and good, but I think the world-at-large would have a better understanding of the gravity of  and  if they understood the problem’s consequences, and that is where is Fortnow’s The Golden Ticket comes into play. Before I describe Fortnow’s interpretations of a  world, consider a quote from Scott Aaronson, a theoretical computer scientist at MIT:

If , then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found.

What I like about Aaronson’s quote is that it bridges the divide between a layman’s understanding of problem solving and a theoretical computer scientist’s understanding of problem solving. There is also the incredibly powerful message that the quote delivers as well: the fact that in a  world, so long as you can recognize that a solution to a problem is correct, you can also find a way to solve that problem quickly.

Fortnow, too, paints a picture of what the world would be like if . A picture so broad, in fact, that he devotes an entire chapter of The Golden Ticket to it! In particular, Fortnow applies the results of a world to a woman’s trip to the doctor’s office. In Fortnow’s story, the woman is diagnosed with pancreatic cancer through a simple blood test. The real kicker, however, comes in regards to treatment. As the doctor explains,

Much has changed in the last decade. General approaches to cancer have had limited success. We realized that one needs an individual approach. We can examine a person’s DNA as well as the mutated DNA of the cancer cells and develop proteins that will fold in just the right way to effectively starve the cancer cells without causing any problems for the normal cells. The dead cancer cells just wash out of the body (15).

Fortnow is describing what the Hackerdashery YouTube video calls “protein folding”: one of the most promising theoretical cures to cancer. A  world would allow for the creation of such individualized proteins for the treatment of cancer through efficient and fast algorithms that could find just the type of protein needed for the job. Were we to prove that , we could effectively find a cure for cancer that involves protein folding. That’s how theoretical computer science relates to cancer. (See? I did what I promised!).

Don’t get your hopes up, however. The vast majority of theoretical computer scientists believe that . The overarching significance of this scenario is that computers just can’t solve everything. In a time where tech start-ups are claiming they can do just about everything for you, there is a race in academia to figure out whether they can theoretically back up that claim. That race is  versus , and you now have the understanding of it to annoy others about it over a meal. Go tell your mother about it!

Better yet, go develop a proof for it. I’ll wait here patiently.