Amethix Technologies - Medium

How Data Science and AI are Killing Software Engineering Best Practices (And Why We Need to Fight…

Fra Gadaleta — Tue, 21 Feb 2023 06:01:43 GMT

How Data Science and AI are Killing Software Engineering Best Practices (And Why We Need to Fight Back!)

In recent years, the field of data science and artificial intelligence has experienced an explosive growth, with a corresponding surge in the development of data-driven products and services. While these advancements have transformed the landscape of business and technology, there is a growing concern that the fundamentals of software engineering are being overlooked.

Many people see data-driven products as being fundamentally different from traditional software products. While it’s true that they require a different set of skills and tools, data products are still software at their core, and they need to follow the best practices of software engineering in order to be successful.

One of the most concerning trends in this field is the tendency to abandon tried-and-true software engineering practices in favor of newer, more data-centric approaches. Agile development, for example, has become something of a buzzword in the world of data science and AI, but many experts argue that it’s not an ideal methodology for these types of products.

Agile development is often seen as a quick-and-dirty approach to software development that emphasizes speed and flexibility over process and structure. While this may work well for some types of software products, it’s not necessarily the best approach for data products that require careful planning and execution. In fact, some argue that Agile development can be counterproductive for data products, as it can lead to a lack of clarity around project goals, poor communication between team members, and a lack of focus on critical details.

So, what are the best practices of software engineering that we need to be paying more attention to in the world of data science and AI?

Project planning

First and foremost, we need to focus on the importance of proper project planning and requirements gathering. This means taking the time to fully understand the needs of the user, as well as the technical constraints and opportunities of the project. Without this foundation, it’s impossible to build a successful data product that meets the needs of both the user and the organization.

Testing and quality assurance

Another key principle of software engineering that we need to emphasize is the importance of testing and quality assurance. In the world of data science and AI, this means putting in place rigorous testing protocols that ensure the accuracy and reliability of the data being used. This is especially important in fields such as healthcare, where inaccurate data could have serious consequences for patients. Not to mention automotive as more and more control is delegated to automatic systems doomed to become autonomous.

Security and privacy

We also need to pay close attention to issues of security and privacy when it comes to data products. As data becomes an increasingly important asset for organizations, it’s critical that we take steps to protect it from unauthorized access and misuse. This means building secure systems, implementing proper access controls, and adhering to industry best practices around data privacy.

Communication and collaboration

Finally, we need to focus on the importance of communication and collaboration between team members. Data products are often complex projects that require input from multiple stakeholders, including data scientists, software engineers, product managers, and business analysts. By emphasizing the importance of open communication and collaboration, we can ensure that everyone is working together towards the same goals, and that the final product is something that truly meets the needs of the user.

All the principles above, in fact, come from software engineering, way before data science and AI became buzzwords.

As the rush towards artificial intelligence continues to accelerate, many companies are racing to jump on the AI bandwagon in an effort to stay competitive. However, it’s becoming increasingly clear that simply having an AI strategy is not enough. The companies that will survive and thrive in this new landscape are the ones that not only grow organically on a topic that is unstable and experimental by definition, but also apply core software engineering principles to their workflows.

At the heart of this approach is the understanding that AI is not a magic bullet that can be applied to any problem and automatically solve it. Instead, AI is a tool that must be carefully integrated into existing workflows and processes in order to be effective. This requires a deep understanding of software engineering principles such as project planning, requirements gathering, testing, and collaboration.

it’s becoming increasingly clear that simply having an AI strategy is not enough

Companies that apply these principles to their AI workflows are able to avoid many of the pitfalls that plague less disciplined approaches. By carefully planning out their projects, for example, they can ensure that they are solving the right problem and that their AI models are meeting the needs of their users. Similarly, by implementing rigorous testing and quality assurance protocols, they can ensure the accuracy and reliability of their AI models.

In the end, it’s clear that the rush towards artificial intelligence is not a sprint, but a marathon.

In the rapidly evolving world of AI, it’s easy to get caught up in flashy claims of revolutionary products that often fall short of expectations. But the real key to long-term success lies in striking the perfect balance between innovation and disciplined software engineering. Companies that master this art are poised to make the greatest impact and reap the greatest rewards.

In the rapidly evolving world of AI, it’s easy to get caught up in flashy claims of revolutionary products that often fall short of expectations.

By taking an organic approach to growth and staying true to the core principles of software engineering, these companies can create AI products and services that genuinely meet the needs of their users, rather than simply hyping up the latest buzzwords. And with a deep understanding of the underlying technology, they’ll be able to drive their businesses forward with unparalleled insight and effectiveness.

In short, while other companies may be chasing after the latest AI fads, those that focus on solid engineering principles and a user-centered approach are the ones that will truly shine in the long run.

Imagine a construction company that has spent decades building bridges and skyscrapers, perfecting their techniques and becoming masters of their craft. However, as technology advances and the world changes, they find themselves facing a new challenge — building skyscrapers over bridges, or space elevators or underwater cities or any other sci-fi equivalent.

Such a new task is definitely innovative and exciting, but also fraught with risks and uncertainties. The company must apply their traditional construction principles to this new challenge, even though the context has changed. They must carefully consider the unique characteristics of the new project, such as the weight and stress distribution, the impact of wind and weather, and the materials and technologies needed to achieve the desired outcome.

Despite the novelty of the task, the company cannot abandon their traditional construction principles. They must still adhere to the same standards of quality and safety that they have always followed. They must still plan their projects carefully, gather and analyze data, and collaborate closely with other stakeholders to ensure a successful outcome.

In this way, the construction company can leverage their existing expertise and experience to tackle this new challenge, while also adapting to the changing landscape of the industry. By combining the old and the new, they can build even more impressive structures that push the boundaries of what is possible.
Why shall we not apply this same approach to the world of AI?

By focusing on careful planning, rigorous testing, security and privacy, and open communication, we can build successful data products that meet the needs of both the user and the organization.

Are you ready to take your data-driven products to the next level? At Amethix, we’re experts in developing innovative data solutions that adhere to the best practices of software engineering. Visit our website amethix.com to learn more and discover how we can help your business thrive in the age of AI and data science.

How Data Science and AI are Killing Software Engineering Best Practices (And Why We Need to Fight… was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why is Scrum a terrible idea for machine learning

Fra Gadaleta — Wed, 09 Dec 2020 03:30:11 GMT

Scrum, a definition

Let’s start with the definition of Scrum and its peculiarities. Scrum is an agile framework for developing products by breaking down them in small chunks that can be time boxed from 2 to 4 weeks. Such time boxes are called sprints. Progress is tracked in meetings called daily scrums. From this definition it looks like something that seems to work, right? Wrong.

Scrum makes sense in very complex environments in which the product has been defined in its entirety. The time box nature of sprints helps to identify blockers and dealing with requirements or market changes. In such a context, Scrum is merely an execution tool, not a philosophy.

The most common scenarios

Let’s see what happens in the majority of scenarios, in which the product is not entirely defined and needs to be discussed with stakeholders to constantly be arranged. In such cases, sprints create the urge of finalizing a piece of the product in a time box. The intrinsic problems of such a goal are the impossibility of defining 1) the size of said chunks and 2) the duration of the sprints as well. The success of the entire process would depend on two of the most critical variables of project management. In 100% of the cases, I have seen estimating such variables simply by flipping a coin. Another important issue I see with Scrum is the fact that it requires the entire team, not only to be disciplined but also to organize and prioritize their work beforehand. That is yet another critical variable usually estimated by watching the stars.

There’s more. Stakeholders are not supposed to participate in sprints. I have seen entire teams failing miserably when they have had the brilliant idea of bringing stakeholders in their conversations. Stakeholders usually ended up derailing objectives to what their feeling about the product was in that very moment or based on their understanding (which was usually very different from what engineers thought).

How does all this play with data science?

The realm of data science represents one of the most dynamic/empirical environments to build products. The biggest misconception of the scrum in data science projects is that it can set a structure and track it. Which is just like treating the adult lion from the jungle as the pet on the couch (and pretending the lion will listen). To start with, agile is a terrible idea for data science, and Scrum is a terrible execution of a terrible idea.

The nature of data science is to defining problems together with stakeholders and solve them in a creative — rather than hyper structured — way. This is a key differentiator with how engineering projects typically work. The number of unknowns in data science usually stays high even when the machine learning algorithm is robust enough to go to production. Applying Scrum to a data science project is the equivalent of disassembling a car and pretending to reassemble it as a plane. As a result, Scrum in data science typically ends up micromanaging scientists and pissing them off (or worst, making them feel frustrated). Engineers and data scientists are not the same breeds and they should not receive the same treatment.

From a more practical perspective, a data scientist who is changing her priority because data are not available or sufficient for the use case, quality of data is too low concerning the initial thoughts, etc. is quite normal. A data engineer who is changing her priority is probably thinking of changing her job. Both scenarios would make a Sprint goal completely useless. But the former is orders of magnitude more common than the second one.

Ok, Scrum sucks. So what?

Probably the best way to manage and structure data science projects in not-so-complex teams is using the kanban. My personal preference of course is Trello (though many others do the job). The typical setup I use is based on four lists:

Setting up a deadline on cards that stay In progress is the equivalent of the Sprint duration, though much less strict and more flexible. As a matter of fact, new insights from data can change many priorities and trash the entire Sprint. In contrast, with kanban, new priorities would affect only a limited set of cards.

Takeaway

Don’t try to time box tasks that do not depend on time.

Why is Scrum a terrible idea for machine learning was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Statistical Analysis of Phenomena that Smell like Chaos

Fra Gadaleta — Sun, 01 Nov 2020 20:29:52 GMT

Is the market really predictable? How do stock prices increase? What is their dynamics?

How many have asked such questions already? How many answered. How many do believe in those answers? Here is what I think about the magics and the reality of predictions applied to markets and the stock exchange.

I am not saying anything terribly new when I speak about predictions as a cool task but also an extremely difficult one. Scientists want to predict everything, from the weather to the outcome of political elections, from disease prognosis to economy. There is a very large list of predictions that are revealed to be poor if not completely wrong. Some of them are even ridiculous. But scientists keep predicting things. Or at least they keep trying to.

Good thing is that a lot of math is usually involved even when the problem seems to be solvable with simpler models. What makes a prediction method good is often the set of assumptions that one starts with. Keeping these assumptions as close as possible to what really happens in the physical world usually leads to a more complicated model.

The other extreme is relaxing excessively these assumptions, something that can lead to oversimplifying the model and therefore limiting its power or misinterpreting the results.

I recently attended a speech by JP Bouchaud about “The (unfortunate) complexity of economic systems” which, aroused my interest inspired me in the topic.

Despite the applicability of methods borrowed from physics, the attempt to explain the complexity of our economy with science is challenging. It is even more so, when the assumptions nobody starts with are about the consistent human factor in the (stochastic?) process of price change. The questions that Bouchaud and many others are trying to answer are “how do stock prices increase?”, “what is their dynamics?”

Answering such questions can be as fundamental as becoming super rich (for a market speculator), eventually mitigating economical crashes or, say, keeping the quality of life at a decent level for as many people as possible, for those who still care about ethics.

What Bouchaud confirmed in his speech is that the

erratic dynamics of markets are mostly endogenous and not exogenous, as one might expect.

This literally translates to the fact that no big news should be needed to change stock prices and determine huge profits. A sign of the complexity of the system is given by the very high sensitivity to small changes.

This in particular reminds me of catastrophic systems, in which a small change of some parameter leads to consistent changes within the system (due to the transition from one equilibrium to another).

An interesting observation is that while exogenous driving forces are stable, regular and steady, the resulting system dynamics is complex and intermittent. Intermittent phenomena are another sign of what mathematicians like Prof. Strogatz would define as Chaos.

Another observation of Mr Bouchaud that made me curious is regarding the collective nature of decisions taken by traders

If each trader is also influenced by what the rest of the community is doing, the overall system will jump from optimistic — buy — to pessimistic — sell — behavior (or the other way) even in the case of regular exogenous factors

An explanation of the current crisis, leads back to the years before 2007, when banks were making debt on debt. According to the efficient market theory this system should have corrected itself. But as we all know, that was not the case.

One possible reason for such a phenomenon could be that collective euphoria concealed the negative aspects of what was going on and brought the system to a state in which a little tiny, even irrelevant, news would have given rise to a global crash. And it did. Rather than statistics, I see the footprints of catastrophe theory again.

Another study that confirms, via a numerical model, Bouchaud’s views, is one titled “Unstable price dynamics as a result of information absorption in speculative markets”.

The authors of the work state that when the system is close to the point of perfect balance or critical point — that is when prices are converging to a stable value — noise can lead to instability. Susceptibility to noise increases dramatically near the critical point. This usually occurs when no news, capable of driving price oscillations, is left to be exploited. Basically, if the price is too low, traders will increase buy orders, until the price of the stock begins to rise.

As long as traders are searching for patterns in the price dynamics, they assume that the market will respond to available information. If there is no such information left apart from noise, as is the case near equilibria, traders will search for patterns into noise. This will lead them to react to random fluctuations and take decisions that might cause significant price variations. With the aforementioned numerical model, the occurrence of such a behavior becomes quite evident: as the market price approaches an equilibrium, once all predictable information has been exploited by speculators, a little perturbation of the price (or noise) can lead to an unforeseen price change. Since the market is not well adapted to this new state, extremely large price changes can appear very frequently.

…when prices are converging to a stable value — noise can lead to instability. As long as traders are searching for patterns in the price dynamics, they assume that the market will respond to available information

Another key observation about the stylized agent-based model, is that large returns are caused by endogenous information states that appeared less recently. Such phenomenon could be interpreted by the presence of unexpected news. Despite the time dependency of endogenous information, it might be interesting to explain the complexity of stock price fluctuations with chaos theory.

For instance, what if this sensitivity to noise were just a strange attractor? Surely, price fluctuations would still appear random, but within an attractor.

In light of these insights, approaching the problem of price or trend prediction with a merely statistical approach can limit place limitations on the reliability of the results.

Provided that the market can be predicted — and there are already serious doubts about that mathematical statistics might not be sufficiently powerful for this purpose.

Before you go

If you enjoyed this post, you will love the newsletter at datascienceathome.com It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science.

Statistical Analysis of Phenomena that Smell like Chaos was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Get ready for AI winter

Fra Gadaleta — Sun, 01 Nov 2020 20:23:36 GMT

Today I am having a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America.
His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence.
I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular.
He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).

Listen to the podcast episode

I hope you enjoy!

Get ready for AI winter was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Decentralized Machine Learning and the proof-of-train

Fra Gadaleta — Sun, 01 Nov 2020 20:23:23 GMT

In the attempt of democratizing machine learning, data scientists should have the possibility to train their models on data they do not necessarily own, nor see. A model that is privately trained should be verified and uniquely identified across its entire life cycle, from its random initialization to setting the optimal values of its parameters.
How does blockchain allow all this? Fitchain is the decentralized machine learning platform that provides models an identity and a certification of their training procedure, the proof-of-train

Listen to the podcast episode
I hope you enjoy!

Decentralized Machine Learning and the proof-of-train was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Predicting the weather with deep learning

Fra Gadaleta — Sun, 01 Nov 2020 20:23:04 GMT

Episode 37: Predicting the weather with deep learning

Predicting the weather is one of the most challenging tasks in machine learning due to the fact that physical phenomena are dynamic and riche of events. Moreover, most of traditional approaches to climate forecast are computationally prohibitive.
It seems that a joint research between the Earth System Science at the University of California, Irvine and the faculty of Physics at LMU Munich has an interesting improvement on the scalability and accuracy of climate predictive modeling.
The solution is… superparameterization and deep learning.

Listen to the podcast episode
I hope you enjoy!

Predicting the weather with deep learning was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural networks that can reason

Fra Gadaleta — Sun, 01 Nov 2020 20:22:52 GMT

Listen to the podcast episode here

Thanks for tuning in and welcome to data science at home podcast, where we talk about technology, machine learning and algorithms.
Today’s episode will be about deep learning and reasoning. There has been a lot of discussion about the effectiveness of deep learning models and their capability to generalize, not only across domains but also on data that such models have never seen. But there is a research group from the Department of Computer Science, Duke University that seems to be on something with deep learning and interpretability in computer vision.

I am here with Chaofan Chen, Oscar Li, Alina Barnett, Cynthia Rudin from Duke and Jonathan Su from MIT Lincoln Laboratory. Hi everyone. How are you?

Great! Thanks for having us!

What type of research do you guys do at Duke University? Is it mainly AI and deep learning oriented?

Cynthia: I run the prediction analysis lab at Duke, which is mostly focused on interpretable machine learning. We have a lot of projects that are not deep learning actually. Computer vision is something new for us, but we were doing it to prove a point. A lot of people assume that there is a tradeoff between accuracy and interpretability. So that if you want a really accurate model, they think you have to create a black box. But that really isn’t true. So our goal here was to create an interpretable model for computer vision that is as accurate as any black box computer vision model.

The project we are referring to in this episode is described in a paper titled “this looks like that: deep learning for interpretable image recognition”. Let’s start describing the project in more detail. When we watch something, we are used to make comparisons and say things like “this animal looks like that other animal I don’t recall the name, or that face really reminds me of that other guy” How do humans interpret images?

Cynthia: Well, let’s think about situations where people need to describe to others how they identify some object in an image. Often what they do is they take the image, and put arrows all over the picture pointing out what aspects of the picture we’re supposed to be looking at and why. Like for bird watching, people have to identify what type of bird they are looking at, so they label the different parts of the bird, like “this looks like a goldfinches’ beak” and this looks like the crown of a woodpecker, or whatever. You can see these kinds of images with arrows for identifying what architectural style a house is, and they do this in medicine in radiology for analyzing medical images, they do this for microanalyzing facial expressions, it’s how humans often explain to each other why this thing is what it is, because it looks like that other thing that I’ve seen before.

What is the main goal of this project?

Cynthia: We wanted to have the algorithm first , be accurate, but also explain its classification of images the same way humans do — by picking out parts of the image and telling us that it’s classifying something in a particular way because it thinks this looks like something it’s seen before, and showing us what that is.

Chaofan: The model will automatically learn a set of image patches whose presence or absence is most indicative of the presence or absence of a certain class. This set of prototypical image patches will serve as the “cases” each image will be compared against. This classification process resembles that of a case-based reasoning.

How different is this approach from a more traditional one in which the goal — in the case of a classifier — is to find the best input that maximises a specific class?

Oscar: Are you talking about activation maximization? That’s a posthoc method, where you first train a deep learning model and figure out afterwards what it’s doing. In contrast, our network actually explains its own reasoning process. It says “I think these parts of the image look like these parts of images I’ve seen before, and I’m going to use that information to make my prediction.” The network itself is actually interpretable, which isn’t posthoc.

Follow on [Jonathan]: This approach builds in interpretability from the start instead of trying to come up with an approximate explanation afterwards. It modifies the network architecture and the objective function to support interpretability, so the explanation one gets describes what the network is really doing. I suppose one could say that the difference is that instead of making up an explanation for the network, we’re making the network explain itself.

Can you explain the idea of having prototypes and identifying parts of images to such prototypes to perform classification?

A: Jonathan: During training, the network learns prototypical image parts, and it learns to use a sparse combination of parts to contribute to each class. The parts are spread out across the training set so they cover it, and each part matches some part of a training image. For example, the network might learn parts like a tire, windshield, door, wing, and jet engine. The sparse combinations mean the network doesn’t use every part for every class, and the presence of a part can encourage or discourage a class. For example, the network might learn to use the tire, windshield, and jet engine for the car class, and it might learn to use the tire, wing, and jet engine for the airplane class. Now, what does a jet engine have to do with a car? Not much, but the network can learn that the presence of a jet engine part discourages the car class and encourages the airplane class.

Add on Alina: For example, the model learns prototypes like the red wing of the red-winged blackbird or the yellow throat of a yellow-throated warbler. Then when it classifies a new image, it looks for patches similar to the prototypes. So the network can tell you something like “this throat patch from the new image looks like that yellow-throated warbler patch from the training set.”

In attention based mechanisms, a model is able to focus on a certain region of an image with “higher resolution” and perceiving the rest in “lower resolution”, such that only specific regions are indeed considered for classification or prediction in general. How does your approach differ from attention mechanisms in deep learning?

Chaofan: There are indeed connections between our model and attention models. When classifying an unseen image, our network computes a similarity map between its patches and each of the prototypes, and this similarity map can be interpreted as a coarse-grained attention map, where the highly activated region on this similarity map indicates something similar to the prototype has been detected on the original image.

Oscar: When making the final decision, our network will only use the strength of the most similar patches as the supporting evidence. In this way we can also view our network as focusing its attention only on regions which it thinks are the most helpful.

Cynthia: It’s sort of like k-nearest neighbors, but more like nearest parts of special neighbors, which are prototypes.

Another common ground I found is the one with generative models. In such models one can generate a “similar” input and then project it to a lower dimensional space for classification. We could briefly say that a combination of unsupervised-supervised approach is used in that case. How does your approach differ from generative models?

Alina: We aren’t generating any similar inputs, it’s solely a discriminative model. As for supervised and unsupervised, we would generally call this a supervised approach because our training data is all labeled.

Follow on: Cynthia: This paper isn’t generative, but we some of the past work we’ve done in this area solves a similar problem with using prototypes or parts of neighbors, and does use generative models, and can be used for unsupervised learning. I have a paper with Been Kim and Julie Shah from NIPS 2014 called the Bayesian Case Method that does this sort of thing for categorical data without neural networks, and there’s a new followup paper with Ramin Moghaddass on this, called Bayesian Patchworks, where each new observation is modeled as if it were generated from parts of other parent observations. We can generate just x without y, so it can be supervised or unsupervised.

Speaking about the network architecture, you have implemented this approach with convolutional neural networks. Does the approach generalize to other architectures too?

Cynthia: Yes, the approach definitely generalizes! We have actually several papers on this basic idea of comparing a new test observation to parts of past observations that don’t involve neural networks even. I mentioned the work we’ve done on Bayesian Case Method and Bayesian Patchworks, but also there’s a paper with Tong Wang that identifies crime series by looking at whether some subset of features of a crime define a modus operandi of a criminal that we’ve seen before. So you can do this type of model for pretty much any setting.

Chaofan: Speaking of generalizing to other architectures, we could attach our prototype layer to any deep neural network. For example, we could include our prototype layer in a fully-connected network, and treat prototypes as one-dimensional vectors in the latent space, that the output of a fully-connected layer can compare against.

How did you evaluate this approach? How would you claim it improves over traditional CNNs?

Oscar: Because our special prototype layer can substitute a fully connected layer in the final stage of a standard CNN, we have conducted multiple ablation studies that compare our model with standard CNNs which have the same architecture except for the last layer. What we have found is that in general our model performs about equally well while providing that extra level of explanation.

Follow-on [Jonathan]: In addition, CNNs are often highly over-parameterized, so there’s potentially room to include other constraints without sacrificing accuracy. In our case, the added constraints could help shape the latent space as well as reduce overfitting.

Do you think this method could be applied to other types of data and predictions, not just images? For instance, would it be possible to consider it for language models?

Alina: This method uses L2 distances, prototypes and patches. Classifying using L2 distance in the latent space can be readily applied to a variety of data types. No matter the input, if there’s a latent space, we can calculate L2 distances. Deciding what a “patch” or “part” is and how to isolate is just different for non-image inputs. For NLP you can just use parts of sentences. We haven’t done that yet though.

What’s in the future of “this looks like that”?

Cynthia: We’ve been working with some Duke undergraduates to apply it to medical images and some other high stakes decision making problems, but mostly medical images. We also have a lot of ideas up our sleeve on how to build on our framework; we’ve been working on this problem for a year and a half now so we’re really getting into it! It’s a fun problem.

If you want you can share your contacts so that people/researcher can reach out

A: Cynthia: Our group has a website, where people can find our contact information: https://users.cs.duke.edu/~cynthia/lab.html

Before you go

If you enjoyed this post, you will love the newsletter of Data Science at Home. It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science. Subscribe!

Originally published at datascienceathome.com on August 5, 2018.

Neural networks that can reason was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

EU Regulations and the rise of Data Hijackers

Fra Gadaleta — Sun, 01 Nov 2020 20:22:19 GMT

Courtesy of www.boxcryptor.com

The subject of this podcast episode is about a recent regulation that is being considered by EU that needs the attention of experts in data analytics and machine learning.

Data scientists are cool!
And data science as machine learning is just an awesome field that many times, especially in the last few years is achieving goals that are more magical than real.
The improvements of data science in several domains are just impressive.

We’re not far from the first reliable self driving vehicle being it a car, that can deal with realistic urban scenarios. Then there will be self driving cabs, buses, trains and who knows maybe planes and boats.

We can already predict quite accurately if we might be interested in purchasing this or that book, connect to this or that person to engage amazing conversations because they share our personal interests.

We can already support decisions of medical doctors in predicting certain diagnoses (and in some cases algorithms can be scarily accurate, think of fMRI or cancer imaging). I was personally involved in the Non Invasive Prenatal Testing — to detect chromosomal aberrations of the fetus from DNA data analysis. For specific genetic disorders and under specific conditions the accuracy can be higher than 97%. This is what I mean by scarily accurate!

Maybe the chances that you are protecting the interests of a financial institution are not that high, but would you like to know in advance if families can pay back their loan, in order to minimize the risks for both?

All of this can be illegal, in just less than 2 years.

According to a EU regulation that will take place in April 2018, it has been decided that algorithms that make decisions based on user-level predictors which “significantly affect” users, should be regulated. We will go to the details of what significantly affect means.

In addition, they also introduce the “right to explanation” for users who can ask for an explanation of an algorithmic decision that was made about them.

The first part of the regulation literally declares as out of law, many important algorithms that are currently used, eg. recommendation systems, algorithms for credit and insurance risk assessments, computational advertising, and social networks.

The second part is somehow impossible to guarantee. Namely, explaining a decision is usually equivalent to fully understanding an algorithm. Which can be possible only in few cases.

What about deep learning and neural networks? How can we explain what is notoriously known to be a black box? How about probabilistic approaches that make decisions that are only probabilistically stable/valid? Think about methods that use sampling, or MCMC methods in which there is an important random component.

Maybe the only algorithms that can be easily explained are linear regression and decision trees, to name a few, that are heavily based on the concept of correlation and trend analysis. This still does not explain causality however. We will get back to this concept later.

Here is an excerpt of the General Data Protection Regulation (GDPR)

Article 11. Automated individual decision making

Member States shall provide for a decision based solely on automated processing, including profiling, which produces an adverse legal effect concerning the data subject or significantly affects him or her, to be prohibited unless authorised by Union or member State law to which the controller is subject and which provides appropriate safeguards for the rights and freedoms of the data subject, at least the right to obtain human intervention on the part of the controller.

Paragraph 1 prohibits any “decision based solely on automated processing, including profiling” which “significantly affects” a data subject.
That is? Trying to quantifying something with qualitative terminology never ends well. So let’s assume we are studying an individual who is 80% healthy and an algorithm that might observe many more variables a medical doctor can, assess that the individual is a bit less than 70% healthy. Is this 10% significantly affecting the person?

Now let’s get to the word profiling.

Profiling is “a form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person” (a trivial correlation analysis, or variable selection procedure that focuses on certain variables related to the single person, can be considered profiling)

Decisions referred to in paragraph 1 of this Article shall not be based on special categories of personal data referred to in Article 10, unless suitable measures to safeguard the data subject’s rights and freedoms and legitimate interests are in place.

Ok basically this paragraph prohibits automated processing “based on special categories of personal data”

Profiling that results in discrimination against natural persons on the basis of special categories of personal data referred to in Article 10 shall be prohibited, in accordance with Union law.

And about discrimination we can start a long discussion because this is here infeasibility kicks in.

Discrimination in very general terms can be defined as the unfair treatment of an individual because of his or her membership in a particular group, belonging to a specific race, having specific gender, or political view or religion etc.

Whenever we use algorithmic profiling for the allocation of resources is, we are implicitly discriminating. Actually that is exactly what we want. Think about clustering or classification. We want to understand if an object or a person belong to a group or another.

I don’t say anything disruptive here if I claim that machine learning depends upon data that has been collected from physical phenomena or society, and as long as society contains inequality, or traces of discrimination, so too will the data.

Discrimination in machine learning is an asset, not a problem.

We are searching for discriminative algorithms. And when this becomes difficult we transform the data in such a way that the same algorithm becomes discriminative. Think about Support Vector Machines and kernel methods with non-linear data.
According to the General Data Protection Regulation, sensitive data includes:

personal data revealing racial or ethnic origin,
political opinions,
religious or philosophical beliefs,
or trade-union membership,
and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person,data concerning health or data concerning a natural person’s sex life or sexual orientation…

In the third paragraph of Article 11, they specifically address discrimination from profiling of sensitive data.

But there are cases in which it would be possible to get to the same conclusion by analyzing other types of data. For example, if a certain geographic region has a high number of low income or minority residents, an algorithm utilizing geographic data can determine loan eligibility based on race and income, due to analogy or what we data scientists define as common patterns.

In clinical data analysis we might have exactly the same scenario. Also in politics and social media (by analyzing how a user is connected and to who, to predict his or her interests and opinions or religious views). There was a research paper in which an algorithm could even guess the girlfriend/boyfriend of a Facebook user just by analyzing his or her ego-network that is how his/her friends were connected.

So what would be the solution to avoid discrimination with data analysis? De-correlate data? Or use only data that are not correlated to sensitive information? Who defines the correlation measure? Is Pearson 0.3 a good measure? This starts to sound a bit ridiculous.

As data grows, detecting correlated (or uncorrelated) variables becomes extremely difficult if not impossible. This is actually still an open research problem.

If correlation with sensitive data is the enemy here, one solution would be to operate only on uncorrelated data. Which can result poor in terms of informative power and useless to perform predictions.

Whoever wrote this regulation should be aware of the fact that Correlation is not causation and actually it doesn’t even imply it!

Maybe they were worried to conserve confidentiality and protect one’s privacy by predicting causality.

But with regression or classification or machine learning in general I cannot say if

race, ethnicity and political views of a person will cause that person to be unreliable in paying back a loan. That would mean predicting causality. Which is not correlation.

I have the impression that politicians and apparently philosophers are trying to regulate something that just cannot be regulated, a bit like the Internet or any other massively adopted technology that at some point of human history became fundamental part of everyday life and it is impossible to limit or prevent.

I expect that real data scientists supported those philosophers, making sure they understand what is already happening, what is feasible and what will never be.

I also see an analogy with the field of software security some years ago, in which lawyers and politicians tried to regulate transmissions of bits and bytes to mitigate the problem of copyright infringement with written rules and regulations that found their interpretation in different countries of EU.

What happened? Hackers tried to circumvent those restrictions in technology. And to download movies and music for free, we got Tor anonymizer networks, proxies, and decentralized protocols to share information, with or without encrypted connections.

Having operated in software security for more than 5 years, I expect that similar figures that we have already seen would rise, maybe the best name would be data hijackers that would try to circumvent those restrictions as much as it already happened for the Internet in general.
Those who are trying to stop the storm of data coming from the IoT are getting my attention. Because this is going to be really really challenging.

I am just thinking about a few of these circumventions by Data Hijackers.

For instance, due to the ease of allocating computing resources in the cloud and crunching data of any kind, a black market of data science could grow, in which no model would need to be explained.

Even in let’s say legally recognized analytic procedures, ? some variables cannot be used because they belong to the black list of sensitive variables?

Fine! I’m gonna search for collinear variables or groups of variables that are not in such lists and still get my prediction pretty accurately.

So, folks, do you think this is going to be difficult with big data? Not at all. We are doing this already.

EU Regulations and the rise of Data Hijackers was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ahem detector with deep learning

Fra Gadaleta — Sun, 01 Nov 2020 20:22:04 GMT

Do you know why you can’t hear the ugly ahem sounds on the podcast Data Science at Home?

Because we remove them. Actually not us. A neural network does.

Let me introduce the ahem detector, a deep convolutional neural network that is trained on transformed audio signals to recognize “ahem” sounds. The network has been trained to detect such signals on the episodes of Data Science at Home, the podcast about data science at podcast.datascienceathome.com

Project description

Slides and technical details are provided provided here. Before getting to the details, a few concepts should be clarified.

Two sets of audio files are required, very similarly to a cohort study:

a negative sample with clean voice/sound and
a positive one with “ahem” sounds concatenated

While the detector works for the aforementioned audio files, it can be generalized to any other audio input, provided enough data are available. The minimum required is ~10 seconds for the positive samples and ~3 minutes for the negative cohort. The network will adapt to the training data and can perform detection on different spoken voice.

How do I get set up?

Once the training audio files are provided, just load the training set and train the network with the code in the ipython notebook. Make sure to create the local folder that has been hardcoded in the script files below. Build training/testing set before running the script. Execute first

% python make_data_class_0.py % python make_data_class_1.py

A GPU is recommended as, under the conditions specific to this example at least 5 epochs are required to obtain ~81% accuracy.

How do I clean a new dirty audio file?

A new audio file must be trasformed in the same way of training files. This can be done with

% python make_data_newsample.py

Then follow the script in the ipython notebook that is commented enough to proceed without particular issues. The whole project is on github

Enjoy!

Originally published on datascienceathome.com on November 2, 2016

Ahem detector with deep learning was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 reasons why Rust is the future

Fra Gadaleta — Sun, 01 Nov 2020 20:15:15 GMT

If you are looking for some kind of metal panel business idea, allow me to be clear: the Rust I am referring to is a programming language.

Still there?

When I started learning programming languages I was 8 years old, the world was in different shapes and computers were more like romantic and magical boxes rather than the tools people TikTok with today.

GW-Basic and C were my first shots in computer science during the time in which memory was directly accessible — for the fun of many and the profit of others. It was so easy to perform brutal attacks on the operating system kernel via some legit constructs that those languages provided.

If this picture makes you nostalgic you are old enough to continue reading. If not, keep reading :)

My years in academia have been characterized by the massive use of C (much less of C++) for back-end systems and implementing common and more exotic algorithms in computer science. Big jump ahead, the years of my Ph.D. have been characterized by the massive use of C and instructing compilers to mitigate some of the nasty issues all low-level programming languages are affected by, which is direct access to memory. Hence, issues like forgetting that you have already freed memory (double free), reading/writing beyond the limits of an array (buffer overflow), pointing and accessing invalid memory, etc. have led to some of the most serious attacks we know in the history of computer science so far.

So what does this have to do with Rust?

As a matter of fact, Rust has literally wiped off most of my Ph.D. material in one new programming paradigm. And I am not pissed at all.

My most ambitious objective (and not only mine) during my Ph.D. was to build a compiler that fixed the double free issues, buffer overflows and invalid pointers, automatically (sometimes without even informing the developer who ended up repeating that nasty bug over and over — how mean?).

Rust has done just that: it has moved many of the responsibilities from the developer to the compiler (for Python programmers, the compiler is that thing you give up your flexibility and performance to, every time you write well… Python) by introducing a new paradigm of programming.

In such a paradigm it is not possible to have those bugs ever. The compiler will just refuse to continue and generate a buggy program. The only side effect of the more pedantic Rust compiler is that it will definitely frustrate the developer. But I don’t have a solution for that. Time for another Ph.D.?

Rust provides this level of awesomeness by leveraging five fundamental concepts in programming language design. While some are tightly related, let me briefly go through each of them.

The borrow checker

Rust’s borrow checker makes sure that references (and pointers!) do not outlive the data they point to. All memory unsafety bugs? Gone. “Yeah but dude… I am losing flexibility” you say? Worry not, you can still use unsafe Rust. If you feel like you deserve that responsibility (and you better know WTF you are doing) go ahead, make your code unsafe and unlock some super-powers. The good thing is that when you have a bug (because you will, dude), you would know exactly where to look at. Remember that unsafe block you wrote and felt so proud of? Maybe you wanna look there.

Ownership

If you don’t want to tidy up your room, someone else (mum?) will have to. And that will come at a price (don’t let me go there). There is a reason why all languages based on a garbage collector are slower than those without. Guess what? Ownership is a paradigm of programming that allows Rust to keep track of memory (and freeing it when no longer in use) at zero cost. How? By changing the way programmers are used to thinking (and believe me, the way Rust forces you to think is the right way). A trivial

let y = SomeType { field: String::from("hello") }; 
let x = y; println!("{:?}", y); // This will fail. We no longer own y

moves ownership of y to x. This means that y cannot be used after that statement. This simple concept allows the compiler to manage memory for you and keep your room in an impeccable state.

Concurrency

This is a big one. Especially due to the general trend of distributing computation on multiple cores. Not only Rust makes concurrency easy (alright relatively easy — sometimes the syntax is a b!tch). The combination of ownership and locking mechanisms make concurrency fearless in Rust. Channels enforce thread isolation and data are protected by locks and can get accessed only when locks are held. This prevents one from accidentally sharing states. Data races are never possible (the compiler will just refuse to generate code that might potentially lead to data races).

Portability

The Rust compiler is built on LLVM that in turn can generate machine code for a plethora of target platforms. While this level of portability is still not as big as C/C++, please remember that Rust is only 10 years old (just a bit older than myself when I started programming, how cute?)

Speed and security

If you want a compiler to generate safe code, be prepared to give up performance. The low-level security community is aware of how much code instrumentation techniques can make your software slowdown (to the point that many prefer to trade unsafe code for performant one). With Rust, you can have both. Rust is a compiled language. The machine code that is generated can be optimized (and will be optimized even more as the compiler gets smarter) as much as the one coming from C/C++ and other compiled languages.

As for speed, a decent Rust vs C comparison can be found here and a more detailed explanation here. Just consider that comparisons should be performed between idiomatic Rust and idiomatic C. Needless to say, sh!tty code = sh!tty performance, and that has nothing to do with the language of choice.

Of course, this post does not give Rust justice, as there are many more amazing things Rust has to offer. While this is a language that shall definitely be considered for system programming, the community is growing at an incredible pace, populating crates.io, the repository for packages and libraries in pure Rust. I’ve to say that there is a lot of duplication and many libraries seem abandoned. I believe that’s a result of the initial enthusiasm from many developers and engineers who have been putting Rust to the test.

Despite making some of my publications — especially those about low-level countermeasures to mitigate buffer overflows — I strongly believe Rust is the language of the future. I am also contributing to a project that utilizes Rust for data processing (the stuff that people must do before copy&paste Tensorflow models :) ) and I am looking forward to publishing it soon.

In any case, follow me on GitHub.

5 reasons why Rust is the future was originally published in Amethix Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.