vinithavs

Smitten by Python

“ Programs are meant to be read by humans and only incidentally for computers to execute.”
― Donald Knuth

Python, the programming language which supports multiple programming paradigms and which was first released in 1991 (yes, earlier than Java! ) has become more accepted with its contributions in enabling easier programming in scientific computing applications. This wonderful language has been my companion during my life as a research student with its wonderful libraries and ease of use, especially REPL.

I was excited to attend the Euroscipy 2018 which took place at Trento, a beautiful valley in Italy. I attended the main conference which took place on the 30th and 31st of August 2018. It showcased the use of Python in different scientific applications and there were talks given by people from Academia as well as Industry who used Python to get their job done. It was interesting to observe that most of the talks mentioned the use of Python in Machine Learning and Data Science. Also, various python libraries which people have initially developed for use within their research groups/ company are made open source and people could contribute to their further development.

There were some talks which I found especially interesting which I would like to mention a few points about.

One of these was a Python library called Imbalanced-learn by Guillaume Lemaitre. This library is used to make more accurate predictions when the data set used for training is skewed with the samples in some classes being comparatively much fewer in number. For example, in the problems such as cancer cell detection, solar wind records, and car insurance claims, the ratio of data samples across classes can be as high as 26:1. The approaches used to solve this involve a combination of unsupervised learning (outlier detection), semi-supervised learning (novelty detection) and supervised learning (resampling).

Various researchers from the bio-medical community were also present at the conference, explaining the use of Python and it’s libraries, to solve interesting problems in the bio-medical field like named entity recognition (using a library called OGER), dimensionality reduction in Neuroscience (using techniques like Tensor Component Analysis and demixed PCA, in addition to normal PCA) and Chaosolver which helps to determine phase space dynamics in bio-medical applications.

Another interesting talk which I found to be particularly pragmatic was titled ‘How not to screw up with Machine Learning in Production’. This talk focused on explaining the components in a Machine Learning system which are essential for production in addition to the core Machine Learning models, such as training/ serving skew and data validation. The talk suggested using existing solutions or a hybrid approach instead of building this entire ML eco-system from scratch (using tools such as TensorFlow Serving, Clipper, Apache PredicitonIO, SeldonCore/KubeFlow).

This was my first time, attending an international tech conference and it has given me many valuable experiences and insights. I am sure that I can find good use of the open-source Python libraries introduced to me in the conference. Also getting to know the speakers and discussing my interests and challenges with them has widened my horizons. Now that I am back from the conference, all that remains are the memories of kind people whom I met at Trento and some wise words and thoughts from the speakers.

Beginning, not the end

“When you reach the end of what you should know, you will be at the beginning of what you should sense.”

–Kahlil Gibran

Officially, Outreachy 15 has come to an end. Yet, this is more like a beginning for me, than an end. The knowledge I have gained in these months has given me more confidence and courage to explore deeper and wider. This blog is mainly for those girls who are still doubtful about applying to Outreachy or who have to face the fears of a beginner. I am going to tell you about some fears you might face, which I too had in the beginning. I got an opportunity to be mentored by awesome mentors. Hope you get to meet yours too.

When I heard of Outreachy I wanted to try my best and get selected, but I also had the fear of exposing my ignorance. There was always a fear of “what if I do something wrong?”. For everyone who is in this position, the only words you need to remind yourself is “It is alright to be wrong.”. We have always been told to do right, but here you have an excuse. Writing a wrong code while learning to code has not harmed anyone. Everyone who started out didn’t start it perfectly. The more mistakes you make, the more you know what not to do. You only need to take care that you don’t repeat the same mistakes. Try to make new ones! And after making a lot of mistakes, you gain enough confidence. Confidence to make many more mistakes fearlessly and during this process, you learn how to write code which works.

How to begin? There’s nothing wrong with asking help about how to begin. But something else would be an even better way. Identify the bugs (which are displayed in the git codebase of the project) which are tagged as ‘easy to solve’ and try fixing them. You may not know the programming language used and you may fail miserably. But you have made the first move in the right direction. Try to understand what the code is doing. Search online to check anything you don’t understand. This can be very intimidating in the beginning. Trust me when I say that you are still on the right path. The more time you spend with the code, you will actually begin to understand it better. If you are able to fix the bug, well and good. Mostly, when you are a beginner, you will take more time. If you are not able to fix it, then start asking specific questions like “I had tried to do <this> and it is failing <here>. Can I get some help”. Mostly you will get help at this stage. If you don’t, then nothing to worry. Everything is still fine. Keep trying to tweak the code and keep asking. The more effort you put in, the more the chances that you are heard.

If there is anything that I think is a good skill to have before starting is the knowledge of git. This is because to start fixing bugs, you need to clone the code and start working on it. And the good news is that it won’t take you much time to get the basics of git. Initially, you only need the code on your system. And there are many online tutorials/ blogs/videos which you can help you get started.

There is no right time or wrong time to start. If you have some time, start by checking the participating organizations and find the project that excites you. Don’t give up. The hard work is totally worth it.

I am my data…

They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.

–Benjamin Franklin

Finally, the data collection module is successfully deployed. To build a system which can identify which of the accounts registered are humans and which are bots, we need information about both bot behavior and human behavior. What data should you collect? The data which can mark the differences in bot and human behavior. There is a bigger question. What data can you collect, rightfully?

Collecting data without causing any privacy concerns is a bigger thought pondering affair. The problem that we now try to solve is different from other projects of building a bot preventing captcha. How is it so? It is because of the way the privacy of the common person is taken care of. I cannot be more proud to be part of an organization which does this. In this age when we hear news of misuse of private data, this organization deserves a bigger applause for taking steps to ensure that all data collected, maintains the anonymity of user and that no such data is collected which can cause any concern anytime.

There are many reasons why people are concerned about the privacy of data, the most important one is that the possession of important data gives more power. The power to know about the individual, more than the person herself wants to reveal. And there are consequences when power is in the wrong hands. So the safest solution is not to collect any such data rather than swearing that it will not be used for anything bad.

As late Nobel laureate, Gabriel García Márquez said, “All humans have three lives, public private and secret.” Let’s not mix that up 🙂

Reaching out

Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.
– Marie Curie

Yet another two weeks have gone by. I am on track. And like anyone else, when I got stuck while coding or when I got some error message which made me look like an idiot, I wasted no time in googling my issue and finding the cause. With the internet at my fingertips, I am powerful(evil laugh). This was not the case a few years back. Living in one of the villages in India, I have struggled with the low-speed internet connections and poor response for repairing/replacing my ‘lightning-struck modem’. Why am I talking about internet connection in this blog? Access to information is one of the main contributing factors to progress in this digital era. When there are 45 million rural households in India where electricity has not reached, what can I tell about the internet?

It is not just unfortunate, but also sinful to deprive the people, especially students, of their internet access. I cannot imagine anyone knowing about open source projects or the wonderful world of programming without an internet connection and a laptop/PC. Though internet connectivity using mobile phones has gone up, we should ask ourselves if that is going to benefit the student community enough. Can they successfully use MOOCS, get to know about the (life changing)opportunities, enjoy coding, on a 5-inch screen? Of course, something is better than nothing.

Another interesting fact is that even among those fortunate few who get access to the internet, the women/girls have to take another step back. The ridiculous reasons behind this are that girls are expected to stay at home and also that they do not have control over their own finances.

Every day of this internship, I am blessed in many ways. This includes very supportive organizations, especially MediaWiki, my mentors and of course an electricity and internet connection. This internship is not just an opportunity for self-development, the good intention behind it to bring about inclusive development is also a good life lesson. Let this good deed breed more good deeds.

Halfway through

“If you are not willing to learn,

no one can help you.

If you are determined to learn,

no one can stop you.”

–Zig Ziglar

It has been a wonderful journey so far. It’s almost like I got access to a new world which I didn’t know existed before. And today I am writing code, asking questions and more importantly learning. I can confidently tell that I have walked past that stage where I felt like a total stranger in the opensource world. I have the responsibility of doing a task, making efforts to take care of all pitfalls which may occur later and discussing the progress and roadblocks with mentors. I am confident and happy with the way this journey has been progressing.

Last two weeks involved creating an extension to capture data which would be used in later stages. Digging deep inside the code requires a lot of patience and curiosity. The moment you take a first good look at the code, you cannot expect to have love at first sight. This love only develops when you and the code spend more time together. The more you be with the code, you know it better and you become ready to modify or add to it. 🙂

I learnt to observe the behavior of code closely and take care of all edge cases which needs to be handled. Even then, the code when reviewed, revealed more instances where care was needed. One among many things I learnt was how to use $.throttle to capture any event-linked data at regular intervals. I also learnt how to include third-party code, check the license and give credit where due. Even with regular discussions and meetings, we still might miss some things in code. The code reviews have been extremely helpful in this regard to make sure that nothing falls through the cracks. 🙂

Overall, the fortnight went well, coding happily. But now it’s time for me to buckle up and change the gear. 🙂

Step by step

“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all – in which case, you fail by default.”

J. K. Rowling

I learned two important lessons last week.

One is that organizing your tasks can help you do things in a systematic manner. I had many small tasks to do and the details of these tasks were mostly in emails. But referring to emails every time to check details of tasks is very tedious and inefficient. However, there is a much better way – put everything in Phabricator tasks and link information as necessary. Using the Phabricator workboard to organize my tasks, everything was visible to me in one place. I could arrange tasks as per their priority and dependencies. Earlier, I used to spend a lot of time thinking about what I had to do and should not forget. This would go on repeatedly in my mind, though not consciously. Now I am freer to focus on doing the actual work.

The second crucial lesson is regarding communicating ideas clearly, especially to people who do not have any prior context. Since I have been thinking about the project most of the time, I (unknowingly) assume that everyone else has some basic idea about it. This can lead to unfortunate misunderstandings. There is no point blaming Murphy when there is something we can do to improve the situation. In communication, the sender should not assume anything about the receiver and make clear the ideas to be communicated. This was a great tip I received from my mentors, who explained to me about the XY problem. Also, it is always good to talk about the problem as clearly and explicitly as possible. The receiver should also not assume anything and should ask questions if something is unclear.

This week, I also learned about the use of virtual machines as a safe sandbox for testing software without affecting the host system and explored some methods for Outlier detection, trying out different features and reading more about how to improve the accuracy.

I feel this is just the tip of the iceberg. There are many more things to do. And a lot more to explore, learn and implement.

Code it to know it

code

It has been two weeks since Outreachy internship period has officialy begun. So far, this is what happened. I looked at the problem, tried to understand the issues, discussed with mentors and made plans about how to get the task done. Now is the coding phase.

The implementation phase is always the most interesting phase. You code the ideas you had listed and then it comes. You find issues which you had never thought of. Earlier I had thought if I have a clear plan, I just need to work on it and write code, and there, you have the solution. Clearly it is not going to be like that all the time. You realize everything about your ideas when you sit and code. Sometimes new ideas kick in. You try hard to stick to the timeline you have. Sometimes you miss it. You struggle to get back in the saddle and ride ahead. You might even feel you are not good enough for this job. At times you might experience stress because things don’t move at that speed which you wanted it to. But I think it’s alright. One thing that can hurt you is trying to make everything perfect. I think we should move ahead when things are good. You can always come back when you have extra time. All these tasks are new to me. And I know I have to spend sometime learning new things. If you knew everything beforehand, you could have saved your time, but there is no fun without learning/ doing anything new!!

Last week I did feel difficulty in getting some things done which I thought I could do very easily. This, I feel should be a common feeling among people who just began to code. There is every chance that our estimates go wrong, some tasks can take more time and some less than expected. The key is not to panic and have enough patience to solve the issues. Some times taking some time off the work and coming back with a relaxed head helps 🙂

Recently I came across a post related to importance of mentors. The Outeachy program is a blessing for those who struggle alone with code. I am blessed to have wonderful mentors, who inspite of their usual works take out time to meet every week, discuss issues, suggest solutions, read my code, suggest edits, provide papers and references to read and also tolerate my bad internet 😉 Getting mentors is a critical component of growth. When they give their precious time, I know I should put in every effort to value it.

It is just the beginning and there’s a lot to learn and unlearn. Perseverence is the key.

Time to bond..

The woods are lovely dark and deep

but i have promises to keep

and miles to go before I sleep..

-Robert Frost

Meeting people has always been an exciting feeling.. This time it is twice as exciting. I got to meet the mentors who took out their time to review the code I had written and suggested edits. I am so much moved by their simplicity and care. Also I am sending mails and talking to people I have no idea who are. But I have the belief that because they are part of an open source community, they must be willing to help if they have time. The world seems more beautiful, connecting with more people. Getting selected is an amazing feeling, but what is more amazing is having some great people to work with..

Another thing I am doing is more research about the task at hand. My mentors have given some ideas and I think unlike other projects, my task requires coming up with effective ideas and reading about what has already been done. This period is so exciting as I am reading to actually know more and implement an effective system. Knowing more about what to do made me think about other possible ways to implement the tasks. I am confident that I will come up with some idea which will make it all work :).

The internship period will begin soon. I will be actually writing code which people can use. The journey has already begun.. Let it bear the best of all fruits 🙂

Outreachy..Got selected!!!

November 09th, 2017, 9:30 PM IST. Outreachy results are announced. And I am happy to be selected as an intern for Outreachy Dec 2107 to Mar 2018. More than the joy, my belief that hard work will payoff is strengthened. During the application period, there was nothing in my mind, but to give everything for the project I loved to do. It is a Mediawiki project for automatically detecting spambot registration using machine learning like invisible reCAPTCHA. I was always intrigued by the concept of captchas, but never took the time to know more about them. Everytime Google came with a new captcha I would wonder what was wrong with the old one. For the project proposal, I did an extensive research about various captchas. Learning more about captchas made me even more curious.

I have heard and read about people discussing about open source projects, but I never knew what to do or how to begin. Outreachy provides a good platform where you can place your first steps to enter this new world. The mentors and others in the project are patient enough and very supportive to clear the doubts you post in the forums. Earlier, I used to read about contributing to open source projects, but I never had the courage to really do anything. The tasks provided by mediawiki for the interested applicants were systematic and organized. Completing one task and moving to the next gave me immense joy. Instead of worrying about the whole project and my ability to work on it, focussing on one task at a time made things easier.

Now it is the community bonding period. I am very excited to meet the people behind these projects. I am a bit anxious too 🙂 . Again, my belief that taking small steps each day will make it all work in the end… Wishing myself all the best 😀

Vinitha V S

I am a research student at IIIT-Hyderabad, India. My research work includes post-processing of optical character recognition output in Indic scritps. I am interested in applying machine learning to areas such as Natural Langauge Processing (NLP), Computer Vision, etc. I am intrigued by the way human brain works and I am fascinated by the progress of machine learning towards achieving artificial intelligence systems. I am also interested in developing systems which people can make use of in their daily lives. Recently, I have started contributing to open source projects as I feel by doing this I can do my part to make the world a better place 🙂

Below are my publications so far:

Error Detection in Indic OCR, DAS’16 (oral)
An Empirical Study of Effectiveness of Post-processing in Indic Scripts, MOCR’17 (accepted)