Mayflower team - Medium

Artificial Intelligence in Media & Entertainment: Redefining the Rules of Regulation

Nikolas Christofidis — Tue, 28 Nov 2023 12:29:54 GMT

In a world increasingly captivated by the breakthroughs and buzz surrounding AI, there lies a cohort of professionals for whom these developments are more than just technological trends, but rather challenges demanding deep understanding and analysis. Hey! I am Nick — a lawyer in an IT company Mayflower. My daily work revolves around the laws and regulations governing the realm of technology, but recent news about AI has sparked a new curiosity in me. How is artificial intelligence reshaping the legal landscape? These questions have propelled me to explore AI from a legal perspective, an exploration the results of which he is ready to share in this article.

Media and Entertainment

The media and entertainment industry has witnessed a remarkable transformation with the integration of AI. AI has become a powerful force in game development, movie production, and advertising, revolutionizing creative processes. It is driving strategic investments and fulfilling the ever-growing demands of viewers. AI’s impact on this industry is substantial. Companies are leveraging AI to enhance their operations, improve the consumer experience, and create personalized content. For instance, chatbots are used for customer service, voice recognition technology enables hands-free control of entertainment experiences, and personal assistants like Alexa and Google Assistant are integrated into entertainment systems.

AI also brings efficiency to the media and entertainment sector. AI-driven tools enhance 3D animation and character modeling, resulting in more realistic visuals. AI-powered music composition aids composers in creating original soundtracks. Moreover, AI automates tasks such as video editing, proofreading, and ad copy generation, leading to cost savings and increased productivity. AI and machine learning have revolutionized the capabilities of entertainment companies, enabling them to analyze vast amounts of data and deliver personalized content recommendations and targeted advertising.

As this technology advances, we anticipate more companies adopting personalized strategies to engage consumers and boost revenue. The emergence of AI-generated avatars and virtual news anchors is expected to become commonplace, providing lifelike and captivating news presentations tailored to different demographics and languages, offering an immersive news experience. In the digital age, AI-powered content moderation tools will play a crucial role for media platforms. Acting as gatekeepers, these tools swiftly detect and filter out inappropriate or harmful content, ensuring a safe online environment and responsible use of digital media.

In the realm of live broadcasting, AI automation will take center stage. It effortlessly manages real-time tasks like closed captioning and enhances live content with dynamic graphics and informative overlays for sports events, news programs, and live shows, enhancing both the quality and accessibility of the content.

Lastly, with the advancing deepfake technology, AI becomes indispensable for developing sophisticated detection tools. These tools are crucial in identifying manipulated or fabricated media content, preserving the authenticity and trustworthiness of the media landscape.

Consequently, the regulation of online content transmission is a crucial concern for the media and entertainment industry, given the wide array of content available, including offensive material. Governing authorities recognize the need for strict control in this area, and AI has emerged as a vital tool for detecting and filtering objectionable content. AI can effectively determine user demographics, such as age and gender, thereby ensuring the appropriate delivery of content.

Key legal issues to consider

Data privacy is a major legal concern regarding the use of AI by companies. AI systems require large amounts of data to improve their algorithms, so organizations need to ensure that the data they collect is handled in accordance with applicable privacy laws. Organizations must be transparent with their members about how their data will be used and protected and obtain consent to use and share sensitive information. It’s important to note that once data is inputted into an AI system, it may no longer be confidential and will be subject to the system’s terms of use. Therefore, companies should not allow personal, confidential, or privileged data to be inputted into an AI system by staff or other agents.

Intellectual property is another legal issue for organizations using AI. AI systems can generate new works, so organizations must have the necessary rights and licenses to use and distribute these works. It’s important to be transparent about the creator of these works.

Companies must also consider potential tort liability issues that may arise from using AI. If an AI system produces harmful results due to inaccuracies, negligence, or biases, the organization may be held responsible for any resulting damages. Organizations should ensure that their AI systems are reliable and accurate, and carefully vet any work product that may affect industry or professional standards for accuracy and truthfulness.

In conclusion

In today’s technology-driven world, the impact of AI on our media and entertainment experiences cannot be ignored. This vibrant industry, constantly shaped by rapid technological advancements and evolving consumer desires, has undergone a significant transformation. From the emergence of new streaming platforms that revolutionize content consumption to a growing emphasis on diversity and representation, the future appears bright, brimming with limitless opportunities.

As we move forward, it becomes crucial for both creators and audiences to recognize and appreciate the media’s profound influence in shaping our perspectives, values, and cultural landscape. The media carries a great responsibility, not only in engaging in ethical discussions and promoting inclusiveness but also in providing a platform for sharing diverse narratives and facilitating a rich exchange of cross-cultural stories and insights. Let us navigate this diverse and dynamic media landscape, fully aware of its impact, responsibilities, and the multitude of unique stories it has the power to convey within our global community.

#artificialintelligence #privacy #law #technology

Artificial Intelligence in Media & Entertainment: Redefining the Rules of Regulation was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bugs backlog automation. RICE for bugs

glebsarkisov — Fri, 03 Nov 2023 11:34:28 GMT

How to make your bugs backlog nicely organized and always relevant

Hi there, it’s Gleb again.

Let’s talk about the bugs backlog. Imagine: it is your first day at your new job (you are a test lead, QA manager, maybe a test engineer or even a product/project manager), you meet your new colleagues, start learning the company processes, open the JIRA and face the 300 bugs backlog! Your reaction is nothing but confusion, despair, and pain.

“Ok, — your inner voice says, — that’s exactly why they hired me. I will fix this!”

You dive into the problem and see the actual state of things:

the oldest open bug ticket was reported 3 years ago;
the majority of the open bugs are medium priority, there are 100+ high bugs and the rest is low priority;
nobody knows exactly what the bugs are about and what they affect, especially product managers are not aware of at least about half of them;
a project manager says usually there is only space for 3–4 high bugs in the sprints, everything else keeps sitting in the backlog untouched;
a typical QA does not get why a high bug they reported 6 months ago is still not fixed.

You come up with questions like:

Is there any point in dealing with the backlog? Maybe it is better to close all existing tickets and create bugs from scratch when we catch them?
How many open bugs in the backlog are OK, and how many are too many?
How to get product managers to notice the backlog’s current state? Should we keep them aware of them?

This is exactly what this article is about: how to bring order to your bugs backlog and make it a useful and well-organized space.

Disclaimer #1:

Even though the Zero Bug Policy is not the topic of the article, I will refer to its philosophy “we do not fix it now — we will not fix it in the future”.

Disclaimer #2:

In the article I am not talking about critical bugs, which are not a part of the backlog and have to be fixed ASAP.

A few words on Zero Bug Policy

“If we do not want to fix it right now, this problem is not important to us” is the Zero Bug Policy main idea. The approach can be extreme: if we are not going to fix the bug now, the bug should be closed. The most obvious advantage of the approach is that there is no backlog at all.

It is worth noting, a product manager would hardly agree to switch to this flow right away. In that case, they would need to deal with the already existing 300 bugs. Besides that, they would need to take into account the real users’ data: how many users are affected by which problem and how badly.

Keeping in mind the idea of Zero Bug Policy, let’s talk about the bugs backlog clarity, what it is, and how to bring it.

What is wrong with the 300 bugs backlog?

How bad is it to have 300 open bugs in your backlog? Here are my points.

Pros:

QAs and other team members found these bugs and reported them — this is great! The team looks for bugs and creates tickets in JIRA.

Cons:

A Wild Unknown Territory. No one actually knows what is in these 300 tickets (this is not an issue for you if you have some kind of bug review process). There is always a possibility of some serious problems in your backlog: once reported, affecting only a few users, but once they become huge, though still not critical (and not fixed ASAP).
Backlog’s Heterogeneity. Try to sort 100 high bugs based on the impact of the problem — no easy way to do this. It is tough to make a decision on what has to be fixed now and what is later. Given that we are always tight on resources, we have to decide which bugs we use them on.
Bugs Backlog Growth. Now we have 300 bugs in the backlog, and in 1–2 years, the number would be around 1000, maybe even more. What then? This sounds very concerning.
Additional efforts to re-validate bugs. You need to guarantee your bugs are valid and actual, so you need to check them from time to time. This of course, comes at a cost of additional time and resources to reproduce or close the ticket.

Some might say we should do technical sprints twice a year when we try to fix everything we can. But just imagine how resource-consuming this is and how that shifts focus from the business goals. And even 2 tech sprints would not be enough to fix everything.

Automated bugs backlog: focus on important, ignore everything else

A static backlog did not work for our team. We started looking for an automated solution that would bring focus on the really important bugs and get rid of everything else (you still remember the ZBP idea?).

Our automated system contains 4 elements:

Prioritization. Approach to prioritize bugs, combining priority and severity and making it possible to compare bugs to each other.
Accounting. Collecting the users’ feedback.
Bug lifecycle automation. Automatic priority lowering after a certain period of time and later ticket’s closure.
Information. Automatically inform everyone involved in the process about the bugs’ status.

The first element. Prioritization

Priority and severity

If you want an automatic backlog cleaning system, you have to come up with an approach to lower priority. First, let’s talk about the meaning of priority.

In my project, when the QA / support team creates a bug ticket, they also choose an appropriate priority. That field combines the actual business priority of the problem (exactly what is called priority in the testing literature) and the way the problem affects the functionality of the system (this is called severity in the literature).

So, for us the priority field is a hybrid one, taking into account both parameters. In order to decide on the importance of a bug to be able to compare it with the others, we use the RICE framework. A product manager scores the bug with RICE value, which contains both priority and severity. This makes it possible to compare bugs to each other — I will explain this in the next section.

Of course, you might have your own process and work with priority and severity differently.

RICE for bugs

We have been using the RICE framework for the product and technical tasks, which we decided to also apply to the bugs backlog. Our variation of RICE has a few modifications, but the point is still the same — to have a benchmark to compare the importance of two different tickets to help us prioritize a backlog.

The RICE for bugs in our reading is:

R stands for Reach — how many users are affected by the problem;
I for Impact — from the functionality point of view how serious the problem is for user experience;
C for Confidence — the level of confidence in the Impact and Ease of chosen values;
E for Ease — how easy and time-consuming it is to fix the problem.

Every parameter besides Ease has a range of 1 to 5, where 1 is the lowest value of a parameter (the smallest number of affected users, the lowest impact on the user experience, etc.), and 5 is the highest one. The Ease parameter is calculated differently by the time estimate for the fix: the longer it takes to fix the problem, the less the Ease:

Once all the parameters are multiplied, we have the final RICE for a bug.

Let’s look at these two bugs:

Bug 1: Reach(1) x Impact (5) x Confidence (5) x Ease (3) = 75

Bug 2: Reach(1) x Impact (4) x Confidence (5) x Ease (5) = 100

You can see that the fix for the bug #2 is more important than the bug #1. So, both product and project managers know exactly which bug to plan for the next sprint.

The scoring process

This is how the bugs scoring process works:

QA/support reports a bug and creates a ticket in JIRA with high/medium/low priority (we do have a classification and an agreement for what we call a high/medium/low based on functionality, the platform on which the bug is reproduced, etc.);
The product managers review the high-priority bugs and set RICE for these tickets;
The medium and low-priority bugs are not scored with RICE. For now, our main goal is to deal with the high bugs. As long as we have a stream of new high bugs, we will keep working on them specifically. Once there is an opportunity to manage other priorities, we will use RICE for them as well.

The second element. Accounting and collecting the users’ feedback

We are developing a high load streaming service with a billion monthly visits and more than 100 million users. In order to better understand the scale of a problem, we need to keep track of the user feedback.

That is why we introduced the ‘number of reports’ field to monitor the reports from users on this exact problem, which is updated by our support team based on the data in Zendesk. The field helps our product managers to set proper RICE values (especially Reach and Impact).

You might think — what if there was no report from users when a bug was created, but now there are some? Shouldn’t we reconsider our RICE score, given the current reports?

Of course, we should. This is how:

At our project, we agreed on the reports’ thresholds — the first threshold is at 10 and more (but less than 30) reports, and the second one is at 30, and more reports;
As soon as the threshold is reached for any bug, a product manager of the corresponding functionality is automatically informed about the number of reports, and he has to decide whether there is a need to change RICE or not.

I want to emphasize that thresholds at 10 and 30 reports were picked empirically. If, at some point, we find that these thresholds, in most cases, do not lead to priority changes, we will reconsider them.

But what should we do if there are less than 10 reports on a bug or maybe a product manager simply forgot to change RICE based on reaching the thresholds? In order to fix this, we implemented an automated RICE reset for a high bug every 3 months. A product manager will notice the bug without the score and will set it. Otherwise, a project manager or I as a process holder, will ping the product manager.

That is how we keep track of real users’ feedback and how it affects the sprint workload: which bug is to be fixed in the upcoming sprint and which bug will go through the priority-lowering process, which I will explain now.

The third element. Automation of a bug life cycle

We agreed that if a bug is not fixed in a year, we either lose the bug or it is not important for product managers, and our users and there is no point in fixing it.

This is how the whole lowering process looks like:

If a high bug is not fixed in 6 months after it was created, that means it is not a high bug, and it can be changed to a medium priority (in case a product manager does not say no);
If a medium bug is not fixed in 3 months after it was created / was transitioned to medium priority, that means it is not a medium bug and it can be changed to a low priority (in case a product manager does not say no);
If a low bug is not fixed in 3 months after it was created / was transitioned to low priority, that means it is not a low bug, and it can be closed (in case a product manager does not say no).

The whole priority-lowering process is automated with our self-written solution Automaton, which is integrated into our Slack. Automaton is our internal instrument for all sorts of automations, it holds all the logic for automation and communicates with JIRA via API, acting like a bot in a ticket’s history.

Summing up all the intervals, we have a year-long bug life cycle. The set of rules mentioned above is our workflow for all bugs — excluding critical bugs. There are also some exceptions to this workflow when a product manager might disagree with lowering the priority, and the life cycle can be extended, let’s talk about them.

Case №1

A product manager reviews the list of planned-to-be-lowered medium bugs or planned-to-be-closed low bugs (a week before the lowering)

If there is a need to hold a bug in the current priority, a product manager simply sets RICE. I want to highlight that we also have a numeric threshold for RICE for medium and high priority. I will explain this a bit later.

If a calculated RICE is higher than the threshold or at the threshold level for high priority RICE, then the priority is set to high and the bug falls under the lowering process in 6 months.

If the calculated RICE is lower than the threshold, the bug keeps the current priority — and then if it is medium, it is lowered to low, and if it is low, it is closed.

Case №2

A product manager reviews the list of planned-to-be-lowered high bugs (a week before lowering)

If there is a need to hold a bug in the current priority, a product manager forbids the lowering for the next 6 months. We specifically have this option for our product managers, but we are monitoring how frequently they use it — so far, there has only been one bug blocked from lowering. The block is enabled by marking a Jira checkbox, which has a 6 months timer on it. As soon as 6 months are passed, the bug is added to the planned-to-be-lowered list again.

If you are still feeling uncomfortable with the flow of lowering and closing a bug — you have not yet understood the Zero Bug Policy philosophy: not fixing now — not fixing later, so no point to keep open in the backlog.

The RICE threshold for priorities

As I mentioned earlier we introduced a numeric threshold for RICE to divide high and medium priority bugs.

Look at these two bugs:

Bug 1: Reach(1) x Impact (5) x Confidence (5) x Ease (3) = 75

Bug 2: Reach(5) x Impact (5) x Confidence (5) x Ease (4) = 500

Observation №1

When we asked our product managers to score the existing 100 high bugs from our backlog, a portion of bugs got a significantly lower RICE score than others. As you can see in the above example, RICE of bug №1 is 7 times less than RICE of bug №2.

Observation №2

In those 100 high bugs a selection of bugs has been waiting for a fix for a long time, so it is incorrect to call these high: they are not taken into sprints to be fixed for a while, and they also have a significantly lower RICE score than the other high bugs.

The solution is to have a threshold for low-scored bugs. Intuitively we decided to use the value of RICE = 90 threshold: if a bug is at or higher than 90, it is high — and if it is lower, it is not high.

You can manage this more technically: calculate a 70-percentile slice of all RICEd high bugs and compare it with the same percentile for bugs waiting to be fixed. However, you can intuitively select the value and later correct it if you wish.

The fourth element. Keeping everyone informed

An automatic notification system is a must-have. You would only need to monitor the filters and check the bugs from time to time, while the system would keep everyone informed on every important action.

This is how we set up the system — and what I would suggest you do:

Create a separate channel in Slack (or maybe you use a different messenger) for notifications. Add all the process actors — product managers, project managers, QA managers.
Enable notifications for every part of the process in this channel:

A week before lowering, post a selection of planned-to-be-lowered and planned-to-be-closed bugs + tag the corresponding product manager;
As soon as a selection of bugs is lowered/closed, automatically post these bugs with the current priority/status;
When 3 months passed from the moment the RICE was set, post corresponding information about the RICE reset for these bugs;
Automatically post a notification for a product manager about bugs that have a number of reports at or above your threshold.

The mentioned earlier Automaton is also responsible for informing the process participants. Long story short: automate everything.

The whole process scheme

A bug prioritization approach + collecting users’ feedback + automated priority lowering + notifications approach = the bugs backlog management system. All the parts of the process are displayed at the scheme below, and you can also see how they are connected with each other.

The fate of 300 bugs

You have probably already noticed that I explained the design of the process — but have not yet told you what we did about the existing 300 bugs.

We coped with it. Firstly, we put the oldest bugs through the product managers’ review (10 bugs a week). When there were only 30 bugs left, we started the automatic process with all the rules mentioned above.

We started going through the bugs backlog back in August 2022.

The numbers

We closed 297 bugs where:

there were 24 high bugs — they became medium, then low, and were later closed;
there were 267 medium bugs — they became low and then closed;
6 low bugs were closed.

I must admit we started moving the high bugs through the lowering process in August 2023 (a year after we kicked off the backlog lowering). It took us some time to convince ourselves and the product managers to apply the lowering workflow to high bugs.

The conclusion

A transparent backlog is a result of many working processes, automation, and a mindset. The way you work with the bugs backlog affects many things — the morale of your team, the sprint workload — you name it.

Your efforts in cleaning up the backlog will lead you to better prioritization and focus since you will be taking into account the real users’ feedback combined with product managers’ reviews.

I wish you a clear bugs backlog and effective processes!

P.S. As always, kudos to Rita Kind-Envy for editing!

Bugs backlog automation. RICE for bugs was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Head of QA: начало

glebsarkisov — Tue, 22 Aug 2023 08:32:26 GMT

Преодоление кризисов в качестве лидера команды: первый год в роли Head of QA

Всем привет, я Глеб.

За 7 лет работы в QA я успел попробовать разные роли:

– тестировщик в стартапе;

– тест-лид в агентстве и корпорации;

– и вот недавно прошел год, как я работаю хедом QA в Mayflower.

Меняется не только моя роль, но и количество людей, за которых я отвечаю. Если несколько лет назад я управлял командой из двух тестировщиков, то сейчас отвечаю за отдел тестирования, в котором почти 30 человек. В этой статье хочу поделиться своим опытом работы в роли хеда. Это может быть полезным для тех, кто планирует расти в эту сторону, но имеет внутренние вопросики.

Про страхи

Небольшой дисклеймер о том, что я имею в виду под страхом в этой статье: конечно, это не паническое состояние или желание убежать, скорее, сомнения в своей экспертности, синдром самозванца — суперзнакомое многим в IT ощущение.

Год назад я согласился выйти на позицию хеда тестирования. С одной стороны, я был очень рад новому челленджу, а с другой стороны, в голове блуждали какие-то страхи. С ходу мне даже сложно было разобраться, что именно меня пугает и почему. Спустя некоторое время эти страхи «оформились», и я смог их для себя структурировать в понятные пойнты:

1. Люди

В предыдущей компании я был лидом семи инженеров в нескольких командах. Семь — отличная цифра, ровно столько элементов ты можешь удержать в голове. Теперь же мне на старте дали 15 человек (а за год их стало почти 30). Меня волновало, как мне удастся найти общий язык с таким большим количеством людей в команде, стоит ли это вообще делать, какие сложности ждут меня (и их) и как я буду их преодолевать.

2. Переход в новую роль

В Mayflower роль Head of QA — это всего минус один от c-level в структуре организации. Ранее напрямую с такими серьёзными ребятами мне работать не приходилось. Поэтому внутренний мандраж первое время, конечно, присутствовал. Как выглядит работа с c-level? Есть ли какие-то общепринятые правила, которых я пока не знаю? Справлюсь ли я с этим форматом?

3. Экспертиза

Другая сложная тема — принятие правильных решений по процессам тестирования. Насколько получится выстроить схему анализа проблем, поиска их решения, внедрения изменений и наблюдения за результатом? Буду ли я предлагать оптимальные изменения, как буду определять, где я неправ?

С такими вводными я взялся за новую роль. Дальше был год работы, в течение которого многие вещи стали проще, некоторые страхи испарились, а что-то оказалось сложнее, чем я думал. Поделюсь внутрянкой страхов, действиями против них и выводами, к которым я пришёл. Моя цель — помочь и поддержать тех, кто только начинает этот путь.

Люди

Опыта управления такой большой командой у меня ещё не было, поэтому, естественно, я боялся оказаться для них плохим лидером. К сожалению, Википедия не дает определения понятия «плохой лидер», поэтому я поделюсь своим, кем я точно НЕ хотел стать для своей команды.

Плохой лидер:

– не умеет в баланс между ответственностью и доверием, например, может полностью утаскивать на себя принятие решений, не обсуждая и не делегируя лидам, инженерам, или, наоборот, скидывает все проблемы и необходимые изменения своим сотрудникам;

– не выражает поддержку там, где она необходима / заслужена или перехваливает сотрудника;

– не годится для того, чтобы брать с него пример в решении проблем и движении к достижению цели;

– не справляется со своим характером, и из-за этого кто-то огребает;

– не видит общей картины происходящего, не может сделать вывод о том, что хорошо и что плохо, и не может выступать в роли визионера.

Портрет антилидера у меня был, так что же я сделал, чтобы в него не превратиться?

Конечно, сначала я знакомился с ребятами на личных встречах и разбирался в общем процессе их работы. Здесь стоит отметить, что все 27 инженеров группами по два-три человека вгружены в отдельные полностью укомплектованные продуктовые команды (со своим проджектом, продактом, аналитиком, разработчиками и тд). Мне пришлось использовать разные подходы к абсолютно разным людям, которых было много. Кроме того, они работают в разных командах, в каждой из которых существует своя атмосфера и специфика. Я понимал, что путь к нахождению общих точек соприкосновения и доверию лежит через решение кризисных ситуаций:

— острых ситуаций отдельных ребят;
— сложных кейсов по процессам работы отдела.

Эти кризисы действительно возникают и могут продолжаться, они не проходят сами по себе, а мы с обеих сторон — я и мои QA — берем на себя ответственность, пытаемся понять причины проблемы через диалог со мной, лидами, через ретро внутри команд, личный анализ и идем по договоренностям.

«Капитанский» рецепт выхода из кризиса, по которому я пытаюсь идти каждый раз:

Анализ проблемы и подсвечивание ключевых точек напряжения, которые и являются источником проблемы.
Транслирование всем задействованным лицам, кто и что сделал неправильно. Идеально, чтобы каждый четко понимал свою роль, ответственность и результат в данной проблеме, ситуации или процессе.
Обозначение и фиксирование ожиданий (в виде целей/блока ретро/тд для человека/команды). Должно быть ясно, что, кто и зачем должен выполнить в определенный срок.
Договоренность насчет формата синхронизаций с человеком/командой по статусу решения проблемы.
Сам процесс мониторинга решения проблемы.
Подведение итогов по достижению оговоренного срока для решения проблемы.

Спустя год я вижу, что с большинством у меня выстроились доверительные отношения.

Хотя диапазон кейсов был очень разный: кто-то затащил крутой подход в тестировании, и мы радовались его успехам, а кто-то ловил дизмораль из-за разных ситуаций на проекте, и я пытался ему помочь. Так мы познакомились друг с другом, QA поняли, где я могу быть полезен и в каких вопросах мне можно довериться.

Однажды меня зацепила мысль: руководитель отдела QA (применимо и к любым другим) не может делать вывод обо всем только по дашбордам, метрикам, автоматизированным уведомлениям и ощущению его лидов — ему необходимо выстраивать связь с каждым из сотрудников, слушать, о каких проблемах говорят именно они, а не надеяться на консолидированный фидбек, принесенный на блюдечке.

Чем больше человек в команде, тем сложнее следить за всем происходящим и системное общение с каждым инженером случается не чаще раза в 4–5 месяцев. И всё же, если того требует ситуация, надо делать исключения и видеться чаще.

Не стоит бояться размера отдела: всегда можно изобрести какой-то формат, в котором будет достаточное количество коммуникаций. Важно не терять прямую связь с сотрудниками: только так ты действительно будешь понимать боли, ценность конкретных достижений инженеров и фактическое влияние твоих изменений на процесс доставки.

Возвращаясь к теме про лидерство и то, как я вижу имеющийся итог: я все ещё продолжаю искать правильный баланс. Иногда мне тяжело проводить черту между личным отношением к человеку и требовательностью в рамках моей ответственности. Этот баланс становится лучше с каждым отдельным кейсом, в котором я участвую, хотя эмоционально это дается мне непросто (хотя никто и не обещал, что будет легко).

Переход в новую роль

Помимо самой софтовой части работы с людьми было тревожно начать работать в прямой связке с c-level.

Вопросы в голове были примерно такие:

Вдруг я знаю мало, а они много и поэтому их решения будут круче и применимее?
Будут ли мне давать свободу в принятии решений или придется действовать строго по указке?
Получится ли выйти на взаимопонимание и доверие в плане принимаемых решений? Смогу ли я достаточно внятно продавать свою позицию по разным вопросам?

В течение года все эти вопросы возникали в разные моменты и сейчас иногда могут возникнуть — но такова уж специфика плотной работы с c-level.

Мои наблюдения по прошествии года:

У менеджмента действительно может быть круче экспертиза по управленческим решениям. И вместо сомнений в себе, куда эффективнее попытаться перенять глобальное мышление, способность видеть общую картину благодаря их рекомендациям, советам, вопросам.
Самый большой челлендж на старте — понять, что ты сам определяешь, куда движется отдел тестирования, с какой скоростью и для каких целей.
Вас нанимают как раз потому, что нужен хороший менеджер, берущий на себя ответственность за отдел, готовый искать хорошие решения для имеющихся проблем. На старте важно договориться об ожиданиях по уровню свободы в принимаемых решениях. И поэтому надо выстраивать честный диалог с CTO, COO и тд — пусть это может казаться сложным в первое время. Как только появляются первые плоды вашей работы, диалог с c-level сразу становится более комфортным и понятным.

Экспертиза

Третьим элементом, который вызывал вопросы, оказалась моя профессиональная экспертиза и её применимость. Она, в свою очередь, раскладывается на отладку процессов и управление инструментами QA.

Отладка процессов

В плане отладки процессов я переживал, что:

мне будет сложно что-то вообще увидеть с моей позиции, не состоя при этом ни в одной продуктовой команде;
я не смогу понимать, как контролировать развитие инструментария тестирования, какие решения и для каких проблем предлагать.

Что я в итоге сделал и получилось ли всё пофиксить? Я выстроил коммуникацию со всеми холдерами процесса доставки. Засетапил синки с лидом проджект-менеджеров, QA-техлидами, настроил сбор и анализ метрик (читайте мою другую статью про Плотность дефектов “со звездочкой”). Ввел процесс постмортемов для каждого критического бага на уровне лидов фронта, бэка и QA, и в ближайшее время планирую увести это внутрь команд. На наших постмортем-встречах мы детально обсуждаем криты. Такой процесс позволяет не только быстрее и точнее залатывать открывшиеся дыры в процессах, но и действовать превентивно.

Любое решение по изменению процесса доставки стоит проводить через проджектов, обсуждать с командой QA, учитывая их комментарии и предложения. Выводы о пользе изменения можно делать по метрикам, субъективным ощущениям команд и их тимлидов, информации с ретроспектив. При таком подходе неполезные процессы отмирают сами собой, а нужные остаются и становятся естественными.

Управление инструментарием QA

Под инструментарием QA я подразумеваю фреймворки и их развитие, написание автотестов, работу с чеклистами, используемые для тестирования приложения и тд.

Для контекста, в моем случае в структуре нашего отдела над инженерами находятся техлиды тестирования, отвечающие за фреймворки и инфраструктуру тестирования. Мои запросы по части автоматизации существуют на уровне процессов и цифр:

Успеваем ли мы писать тесты? Каково качество написанных тестов? Есть ли люди, которых надо подтянуть до нужного базового уровня?
Насколько текущее решение помогает нам решать поставленные задачи? «Хватает» ли нам выбранного фреймворка?
На что из нашего бэклога мы в первую очередь должны тратить ресурс? Какие наши ожидания от полугода-года работы по разгребанию фокусных задач из бэклога?
Какие наши ожидания от скорости прохождения тестов? Сколько у нас flaky тестов сейчас и сколько мы хотим, чтобы было?

Я собираю набор метрик, наши ожидания от инфраструктуры и фреймворков и ограничения. Принятие решений по конкретным изменениям фреймворкам, мониторинг тестов, помощь и развитие тестировщиков по этим направлениям лежит в зоне ответственности QA-техлидов.

Все, что касается самих инструментов (мобилки для тестирования, отдельное приложение и остальное), артефактов (чеклист, тест-кейс и прочее) — обсуждаем c техлидами и отделом.

Как именно вы будете развивать ваш отдел напрямую зависит от его структуры и целеполагания. Например, если у вас есть кто-то когда-то немного работавший с нагрузочным тестированием — пусть сделает MVP, докажет его работоспособность и дальше может претендовать на роль «эксперта». При таком подходе развитие технического направления не размазывается на всех, а закрепляется за конкретным человеком. Потом он может искать падаванов внутри и подключать их к поддержке и развитию фреймворка, шеря с ними свои цели.

Как руководитель отдела, вы будете оформлять запрос на закупку лицензий, и если у вас ведется контроль бюджетов, вам понадобятся реальные доводы по именно такому количеству пользователей, именно этому типу лицензии и ее сроку. Поэтому финальная ответственность за решение приобрести софт/хардвер лежит на вас.

Заключение

Роль руководителя отдела тестирования сложна и к ней никогда нельзя полностью подготовиться. Конечно, все сложности реально преодолеть. Из забавных наблюдений: внутри отдела вам может быть непросто объяснить, чем конкретно вы занимаетесь, потому что спектр ответственности огромный и нет одних и тех же задач, над которыми вы работаете каждый день. С этим элементом неопределенности приходится жить, и нужно становиться маячком для своего отдела, подсвечивая, зачем мы вообще здесь собрались, почему мы тестируем именно так и к чему хотим прийти в ближайшие годы.

Каждому размышляющему о постепенном переходе в хеды/лиды я также советую найти себе ментора на первое время. Это отлично поможет фокусироваться на проблемах, быстрее находить их решение, а также вы заручитесь ментальной поддержкой.

Смелости, терпения и удачи!

Head of QA: начало was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Head of QA: Year One

glebsarkisov — Fri, 11 Aug 2023 12:00:10 GMT

Recipe for dealing with crises as a team leader.

Hi, Gleb here.

During 7 years in QA I worked as:

– a QA engineer in a startup

– a test lead in an agency and in a corporation

– and just recently a year passed since I started working as Head of QA at Mayflower.

The difference is not just in the title itself but also in the number of people I am responsible for. Just a few years ago, I managed a team of 2 QA engineers; now, there are almost 30 engineers in my department. I want to share my experience working as Head of QA in this article. It might be helpful for those interested in growing as a head of the department but still have some doubts about it.

Let’s talk fears

A tiny disclaimer on what I call “a fear” in the article: it’s not a panic attack, or trying to run away from a problem, but rather a doubt about my own expertise, my imposter syndrome — many people in IT live with this.

A year ago I was offered a Head of QA position. On the one hand, I was very happy about the new challenge, on the other hand, I was really anxious. At first it was hard to understand exactly what I am worried about and why. Later I realized that this is a set of fears and I know what it is constructed of.

1. People

In my previous company, I was a test lead for 7 engineers in separate teams. Seven is a great number. That is exactly how many elements one can hold in his head. Now, when I started out in Mayflower, I was given 15 engineers (a year passed, and now we are at around 30). I was worried about how I could find common ground with that amount of people, whether I should try doing this or not, what kind of issues I might face, and how I could deal with them.

2. Transition to a new role

At Mayflower Head of QA role is a minus one from C-level. Previously I haven’t really worked with these serious guys. So yeah, I was regularly having jitters at first. How to work with C-level? Are there any rules I am not aware of? Will I manage this?

3. Expertise

Another important point is making proper decisions for modifying QA processes. Will I be able to build a system to analyze issues, find solutions, integrate changes, and overview the results? Will I be suggesting valid things — if not, how will I know I am wrong?

This is what I started with in this new position. In the past year, many things have become more manageable, some fears vanished, but something turned out to be way harder than I thought. I will try to share these feelings with you, what I did to deal with them and what I understood. I hope everything I am going to share will help and support you on your way to being a manager of a department.

People

I never managed such a big team, of course, I was afraid of becoming a bad leader for these people. Unfortunately, there is no definition for “a bad leader” in Wikipedia, so I will show you exactly what I would not want to be for my team.

A bad leader

– cannot find a balance between his own responsibility for everything and trust for people: for example, he can think of making all decisions, not discussing, not delegating them to his engineers — or the other way around, he might hand over dealing with issues to his employees;

– cannot give support where this is important — or always overpraises his employees;

– can hardly be called a role model in decision-making and achieving goals;

– cannot deal with his own temper;

– cannot see the whole picture, cannot define what is good and bad, and is not a great visionary.

I had 1–1 meetings with each engineer to get to know each other and understand the work process from their point of view. An important detail: all 27 engineers work in groups of 2–3 in different product teams (where there is also a project manager, product manager, BA, developer, etc.). I needed to have a different approach for all these people, and I understood that our way to build proper relationships is through overcoming various crises:

– some specific person-related cases;

– resolving issues within QA department processes.

These crises do exist, and they never go away by themselves. Instead, it is me and my QAs taking responsibility, trying to understand the reasons for the problems through communication, open discussion with QA tech leads, through retrospective meetings within teams.

“Captain Obvious” recipe for dealing with crisis

Analyze the issue and list the exact reasons for the issue.
Make it transparent to everyone who did wrong — ideally, everybody understands his role, responsibility, and the outcome of the problem/situation/process.
Talk through expectations (expected result in the expected timeframe) and write them down as a goal/action point after the team’s retrospective for a person/team.
Agree on how you are going to sync on the issue status with a person/team.
Monitor how the issue is handled.
Once the deadline comes, ensure everything is done.

Many things happened: one person was able to apply a new awesome approach in testing, and we were really happy with the results. Another person was worried too much about things on the project, and I was trying to help him. That is how we got introduced to each other and built good relationships within the team.

At some point, I heard that the head of the department (QA or any other) could not make conclusions based on just dashboards, metrics, notifications system, and his leads’ feelings. Instead, he has to build a connection with each employee, listen to their issues, and never rely on consolidated feedback to be given to on a silver platter.

The more people in your team, the harder it is to overview everything — your 1–1 communication with an employee will most likely happen once in 4–5 months. You might want to see each other occasionally when it is really necessary.

Do not be afraid of your department size: you can always find a suitable approach to get enough communication in your team. It is important to have it: thus, you will be able to know what issues they are experiencing, how valuable their achievements are, and how the changes you introduce modify the product delivery process.

Back on the leadership topic: I am still in search of the perfect balance. Sometimes it is tough to draw the line between my relationship with an employee and the need to be demanding as a manager. It seems to get better with every case I get into, though emotionally, this is still not easy for me (nobody promised me this is going to be easy lol).

Transition to a new role

Besides the soft skills part of working with people, I was quite worried about working with C-level.

Here are a few examples of questions in my head:

Maybe I know nothing, and they know everything, so their solutions will be smarter and way better?
Will I be given the proper degree of freedom in decision making or will I need to work as I am told to?
Can we build mutual understanding and trust in whatever I am planning to do? Can I sell my point of view on different issues?

During the past year, these questions come and go in my head — and they still do — but this is the reality of working with C-level.

Summing up the observations on this topic:

Your management might happen to have better expertise in management decisions. Instead of self-doubts, it is way more useful to adopt that global mindset, the ability to see the whole picture based on recommendations, advice, and questions.
The biggest challenge from the start is to understand it is your duty to define QA department plans for the future — where you are going and why.
The reason you are hired is the company needs a good manager taking responsibility for the department, willing to find better solutions for existing problems. It is important to agree on the level of your freedom in decision-making. That is exactly why you need to be transparent with CTO, COO, etc. — even though this might be complicated at first. As soon as there are the first results of your work, the dialog between you and C-level gets easier and more comfortable.

Expertise

The third element I was worried about was my own professional expertise. I would like to split that topic into 2 things: optimizing processes and QA tools management.

Processes optimization

I was afraid that

it will be hard to get a good overview of what is going on from my position — since I am not a part of any product team;
I will not understand how to control the development of QA tools — what to suggest for which problems.

In the end, what have I done, and was I able to fix this? I built communication with all holders of the product delivery process. Got synced with the project manager lead and my QA tech leads (more on that role later), prepared a set of metrics and started its analysis (read my other article on Defect Density with a twist). I introduced the postmortem process for every critical bug — for now, this is handled by backend, frontend, and QA tech leads, but I am planning to move that process inside the product teams. Through critical issues analysis, we agree on what to fix right now and how to prevent us from similar issues in the future.

Any decision to change the delivery process goes through discussion with the project management team and QA team, taking into consideration their suggestions and thoughts. In order to understand the output of the introduced changes, you will track the metrics, subjective feelings of the teams, their team leads, and conversations on retrospective meetings. At some point, unnecessary processes vanish, and important ones become natural.

Managing QA tools

For me, everything related to writing autotests, automation framework development, checklists preparation, and so on is a topic of managing QA tools.

In my department structure, besides QA engineers, there are QA tech leads who are responsible for testing frameworks and infrastructure. My interest in automation is in the field of processes and numbers:

Do we have enough time for writing tests? What is the quality of our tests? If anyone’s skill in writing tests is not sufficient, how can we mentor and help?
Is our framework good enough for our purposes?
What tasks from our QA backlog should we do first? While working with the backlog, what are our expectations for the next 6–12 months?
How time-consuming are our test runs? How many flaky tests do we have? How many can we have?

I collect a subset of metrics, our expectations from infrastructure and frameworks, and our limitations. The decision-making on what exactly to be changed in frameworks, monitoring of the quality of written tests, and help and development of the engineers are in the QA tech leads’ responsibility.

Everything else, from test devices, test applications, and testing artifacts, is discussed with QA tech leads and the department.

The way you are going to grow your department is connected to its structure and goal-setting approach. For example, if there is someone who has some experience in load testing — he might do MVP to get the proof of concept — that person can become an expert on it. There is the person responsible for driving a new technical direction instead of everyone in the department. Later that engineer can look for other padawans, mentor and guide them, sharing his goals with them.

As a head of the department, you will be making license requests — if there is any budget monitoring in your company, you will have to provide explanations for exactly that number of users, that license type, and that license term. That basically means the responsibility for acquiring new software/hardware is solely yours.

Conclusion

While writing this article, I am also trying to summarize everything that happened during the past year and see how I managed that new role.

The head of the department role is not an easy one, and you are never going to be prepared for it. Some interesting observations: it might be pretty hard for you to explain your duties to your own engineers since your responsibilities area is so wide, and you cannot list the exact tasks you do on a day-to-day basis. You need to accept some level of uncertainty. You have to become sort of a lighthouse for your department, showing why we are here, why we have that testing approach, and where we should come in the next few years.

If you are thinking about becoming lead/head at some point in your career, I strongly recommend you look for a mentor — especially at the start. This will help you focus on the problems, find solutions quicker, and get good support.

Good luck!

P.S. As usual, kudos to Rita Kind-Envy for editing!

Head of QA: Year One was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Controlling Influence Between Groups in A/B Testing — Interrupted Time Series Design

Sergei Sergeev — Wed, 12 Jul 2023 08:16:01 GMT

Controlling Influence Between Groups in A/B Testing — Interrupted Time Series Design

Hello everyone, my name is Sergei, I am a Data Scientist at Mayflower. I’m building Recommendations and Personalisations systems. Of course, most, if not all, of these systems or their improvements, require thorough online testing before implementing in production. But due to the nature of tasks and data, a usual A/B testing may not be enough and even be misleading.

So, I want to discuss the so-called Interrupted Time Series (ITS) design.

It’s one of the ways to measure treatment effects (e.g., in A/B testing). This approach might be especially useful if you suspect the treatment group (the group where you test your new feature) might affect the control group.

It’s also useful for measuring the magnitude of such influence between groups. The idea behind this ITS design is simple and intuitive. In the simplest case, you compare the past, before the intervention, and the present. In other words, you are using your past data as a control group.

1. When to expect influence?

Before diving into the details of the method, let’s review the problem of influence between groups.

One of the most obvious cases is social network data. If we want to test the effect of the intervention on one group and select users independently at random, the groups will not be independent. This is because users in both groups interact with each other, and changes in behavior in one group might significantly affect behavior in the other.

But similar network effects might emerge even if there is no explicit interaction between users.

For example, in online retail, we may say that users who buy the same item are connected. Although this connection will not be a problem in many cases, there are situations where it is. Say, we want to test a new sorting algorithm in a catalog, and in both old and new algorithms item’s position is somehow correlated with the number of purchases, among other things. Say, the new model is able to find and predict very relevant items that the old one could not. And users in the treatment group will buy these items. In this case, these newly predicted items may appear on top positions in the catalog in both groups and the measured effect will have a negative bias. And so, even if users were picked independently, interactions with items make groups dependent.
Another example is food delivery services. We are testing a new feature that makes delivery faster and customers love it. There may be an increase in the number of orders in the treatment group, which may lead to an overload of restaurants. This will affect both groups, so in this case, customers are connected through the restaurants when they make orders. Because the treatment group has better delivery conditions, fewer users from this group will change their minds and cancel orders. As a result, there will be a positive bias in the measured effect.

On the other hand, there is obviously no influence of the treatment on the past. So by measuring the user behavior in time before and after the intervention, we can measure the effect.

Moreover, we can measure not only the effect itself but its changing with time as well. It might be important because it allows us to differentiate between an actual effect of treatment and an effect of novelty.

2. A/B Test as Linear Regression

Let’s discuss how it all works and how it relates to the usual A/B testing.

Suppose, we are testing how revenue per user changes if we show them a new shiny button instead of an old boring one. Often we want to know, say, whether the average user will bring us more money, so ideally we want to compare expected values of revenue if all users see one or another button.

In reality, we can’t compare them. One of the reasons for it is that we can’t show both buttons simultaneously to a given user. Thus, we compare the next best things — estimates of expectations, namely averages.

So we collect data from both groups with cool and boring buttons, average results, and compare them using some statistical tests. Instead, we could do something a bit different.

We can use the Ordinary Least Squares (OLS) algorithm.

Remember, that when using linear regression in the form above what we get is an estimation of expected y given x. It means that if we collected data and constructed a model using the OLS algorithm, when we plug in a particular number x, we get an average y for that x.

In linear regression, this number x represents some property of a user. So we get an average y (metric) for a user with this particular property. But this is exactly what we want from A/B testing. We want to get averages for users with properties that are in control or treatment groups. The easiest way is to denote x = 0 for the control group and x = 1 for the treatment group.

In this case, we get y = 𝛼 for the control group.

And y = 𝛼 + β for the treatment group.

The effect is the difference between averages which is a coefficient β.

As a bonus of using well-known and understood OLS, we get confidence intervals and p-values for all the coefficients, including β for free.

And that’s not all! Since we established that we can view the treatment effect as a coefficient of regression, we can use all the power of OLS and its generalizations to analyze testing.

One of the most useful applications is that we can use other properties of users within the same framework, not just the property of being in a treatment or control group. For example, the user’s age, income, education, country, or any other available data.

The only restriction is that these properties must be independent of treatment. Why is this useful? Because it increases the power of your tests. Let’s see how it works in the context of regression.

Suppose again, that we are measuring average money that we get from groups A and B. Again, the usual A/B test would be equivalent to the regression with just one factor.

But if we have additional data we can incorporate it to regression as well in the form of additional variables.

If these new variables are good predictors of outcome they will explain away a lot of variance. And the less the intrinsic variance of the metric the easier to detect changes in this metric that were produced by the intervention.

Let’s see intuitively why this happens.

Suppose an additional variable is the user’s income. It is reasonable to expect that users with greater income pay more. But at each particular level of income the variance of the metric is lower because a lot of variance of a metric is due to variance of income.

What happens, is you virtually compare treatment and control groups at each level of income. Because at each level the variance is lower it is easier to detect changes due to intervention. By adding new variables you virtually create new smaller and more homogeneous subgroups each having even smaller variance.

By the way, if you are familiar with variance reduction techniques such as CUPED or post-stratification, this is essentially it.

It’s worth also noting that the connection between hypothesis testing and linear regression is much deeper than described above. A/B test is just a simplest example of a causal inference. But there are a lot of situations where we can not properly divide the population into treatment and control groups (especially in economic and social sciences), but still want to measure the effect of treatment, i.e. the causal impact of our intervention. In this case, there are a lot of quasi-experimental techniques, a lot of which are based on linear regression. For more details and insights I highly recommend an excellent online book Causal Inference for The Brave and True¹.

3. Time for time!

Ok, so far, so good. We have our metric and variables describing users’ properties. The metric depends on these variables. We know how and why to use these variables to calculate the treatment effect. Now it’s time to bring time to the picture. It will be just one more generalisation.

When we are considering a process in time we have to assume that the outcome depends not only on external variables but also on itself in the past. For example, if today is Thursday it is reasonable to expect that revenue today might be somehow correlated with revenue yesterday and with revenue in the past Thursday. We might end up with something like:

This is called autoregression because it is a regression in itself (in the past).

Actually, the outcome may also depend on the random noise in the past and it is called the moving average process.

I will not go deep into detail here but the intuition is that we can treat these temporal factors as kind of variables in the regression. But to get correct error boundaries of estimations, some generalisations have to be made.

This generalization of linear regression that takes into account these kinds of dependencies on the past as well as independent variables is called the SARIMAX model and is also very well studied and understood. And it can also produce rigorous confidence intervals and p-values.

There are, however, few noticeable and useful differences:

We have to take the trend into account. For example, revenue might increase with time just because of inflation
We make comparisons with the past, so the treatment variable becomes 0 before the intervention and 1 after
Now we can detect not only the effect on average but the effect on trend as well. So we have to add a new variable for it
As data points are not users anymore but periods (e.g., days), the interpretation of independent variables also changes. These are now properties of periods as well. For example, instead of user income, we may consider the average income of users who visited our site each day. Or even weekday dummy variables for each day.

Here is an example of a process in time that abruptly changes after the intervention:

Image from an review² of different methods of Interrupted Time Series studies

So, as linear regression may (and frankly should) be considered as a base for A/B testing, SARIMAX is a base for detecting and measuring the influence of the intervention in time. And this approach to compare metrics at different time intervals, before and after intervention, is called Interrupted Time Series design.

The formula that describes this quasi-experimental setup looks like this:

Before the intervention, when x = 0, the behavior is mostly defined by level β₀ and trend β₁. But after the intervention at the moment T, when x = 1, the new level becomes β₀ + β₂ and the new trend becomes β₁ + β₃. The [t-T] coefficient allows us to start counting time intervals from the moment of the intervention.

Before going to concrete cases, one last thing I want to discuss briefly is how to select features (or exogenous variables) and parameters for the SARIMAX model.

As was mentioned before, the main condition for exogenous variables is they have to be independent of treatment. As for SARIMAX parameters, you do not want to overfit your model. It means you don’t want to construct a model so complicated and powerful that it will explain everything in your observed data, including random noise.

In that case, if you use your model for predictions, you will get poor results. And even if our aim is not to predict anything, it is obvious that a model with better predictive power is more trustworthy.

One of the methods to detect overfitting is cross-validation, leave-one-out (LOO) cross-validation in particular. For each data point, we construct a model using all data points except this one and use the model to predict it.

By measuring the average error on the whole dataset, we estimate model’s generalisation abilities. Although this method has some cons, one nice thing about it is that for linear regression, we can easily construct an exact analytical estimation without needing to recreate a model for each point in the dataset.

Things become more complicated when dealing with temporal data, but fortunately, there is a method that is asymptotically equivalent to LOO. It is called Akaike Information Criterion.

Roughly speaking, the lower its value for a model on the same data, the better the model. So when you try to select how many exogenous, past, or seasonal factors to consider, look at this criterion value.

4. Examples

Ok, now let’s look at some cases.

First, we used this approach postfactum to test the influence between groups during the test of the recommender system. The idea to test this influence came to us when we planned the design of the A/B test of the Multi-Armed Bandits (MAB) algorithm.

In short, this is a kind of sorting algorithm that will affect the whole site. Our current sorting method is related to popularity, so the influence during testing is expected.

It is hard to estimate such influence in advance, so we decided to measure the influence between groups during the test of the recommender algorithm. Recommender is smaller, although the very popular section on our site. And the idea was that if we can detect a significant influence in this case, such influence on a larger-scale test for MAB is inevitable.

4.1 Basic example with one group

First, let’s look at a basic example.

This is just a treatment group before and after the intervention, which means with old and new recommender algorithms.

For this and the next examples, I will use two metrics: one is related to users with a desired behavior and I will call it the “number of users”. The other is related to user spending and I will call it “money”.

It is interesting to note that one of the additional differences of this approach is that we can compare not only averages (like money spent by users) but also counts and amounts such as the daily number of users and daily amount of money.

As you can see, the metric for the number of users has a much smaller variance. This is intuitively expected because the uncertainty of the money metric comes from the number of users (which is more or less the variance of the user metric) and the amount of money a user spends.

We can expect that it would be harder to detect an effect on this metric due to its higher variance.

You have probably already noticed that there may be some change in behavior after the intervention. To be sure, we have to test for the statistical significance of this change. We have to account for the dependence of the metric on itself, that is, autoregression. It is better to reduce some variance by introducing exogenous factors. In this case, they are simply dummy factors for weekdays and a quadratic factor for a day of a month.

What we are seeing here are metrics for users and money, respectively, when controlled for all factors, including time-related, apart from effect.

By the way, in this case, and the next one, the trend does not change after intervention. I also want to stress that factors are divided into effect and others just for visualization purposes.

There are no actual differences during calculations. It is important because otherwise, you risk getting very wrong estimations of your effect.

Now, we see that for the user metric, the 95 confidence interval does not include zero (or estimation of metric without intervention). As for the money metric, the interval does contain zero, so the effect is not significant. The reason for this, apart from the actual absence of effect, is that because of high variance, the method is not powerful enough.

4.2 Intergroup influence

Ok, so it was an example of the most basic usage of this approach. Now, let’s see a more complex one.

We have two groups, test and control, both changing with time. And if there is an influence, we can expect that the metrics of the control group will change after the start of the intervention, maybe even statistically significant. Let’s look at the treatment group first.

This is the picture of the evolution of metrics before the intervention (orange), during the test (blue ), and after (green), we decided to use the treatment variant for all users. We can see, there is a statistically significant change after the experiment, i.e., after the exposure of a control group to the new recommender algorithm. This means that intergroup influence is significant.

This can be confirmed if we look at the control group.

Indeed, during the experiment (blue), the number of users increased significantly. Because the SARIMAX model was used to estimate the effects of the intervention, we can be pretty sure, that this is not a random temporal fluctuation in metric, but a causal effect of an intervention.

One important consequence is that in the usual A/B test, we measured the difference between groups but it is not an actual effect. There is a negative bias in this case. The actual effect can be estimated as the difference between post- and pre-experiment levels.

Another notable fact is that estimations for both groups are basically the same, which means that the method is pretty consistent. After the test, both groups became equal again, as expected.

This is also an illustration of conducting multiple tests simultaneously using this method. It can be done as long as you start your tests at different times.

This is a property of linear regression. As long as your factors are not totally collinear, you can estimate their coefficients. At the first point in time, we have our first intervention, which is turning on the new algorithm for group B. Because the test was successful, we did not turn it off, so the intervention continued. It is present during the second intervention, which is turning on a new algorithm for group A.

Here we are measuring one effect in the presence of another. So, in conclusion, for this case, we see that we can measure the influence by measuring the effect of the intervention on a control group.

4.3 Switchback

As a final example, I want to show you some of the different experiment designs where this approach might be used. As an example from econometrics of implementing some policy. I will not go into detail about this policy. I want to show another way to find and measure the effect.

Image from a chapter about ITS of causal inference book³

Here, the policy was implemented at some point at a time and, then, canceled and finally reimplemented again. So there are two points where we can detect and measure the effect to get a more reliable estimate. It may also be very useful in situations when intervention gradually changes the metric but cancellation has an abrupt, easily detectable effect on it. For example, the MAB algorithm may take some time to converge after implementation.

That’s why we decided to use this particular method for our MAB test.

5. Conclusion

That was a review of the Interrupted Times Series design, a method that can be used as an addition or sometimes an alternative to other methods to measure treatment effects such as usual A/B testing, synthetic control, etc. In conclusion, I want to mention two things.

The main risks that you should be aware of when using this method.

You have to have enough observation points. The absolute minimum is 8 before and 8 after the intervention, but it is usually recommended to have at least 100 in total.
And in contrast with the usual AB test, the increase in the sample size of users will not lead to an increase in power. You can try to divide your time intervals into smaller ones. This might lead to an increase in variance and the number of autoregressive factors. Your model may be more prone to overfitting. Or not. It may actually be a viable option but you have to spend a lot of time researching. In short, usually, this is hard to automate.
Another risk is that there might be an unexpected event of unknown duration that will affect the results. This is especially risky if there is only a treatment group. So you may need perform a separate analysis to detect such events.
And the last one is that even though we can conduct multiple tests simultaneously, the more tests we conduct, the harder to detect their effects. This is again a property of linear regression. The effect and trend factors for different tests are partially correlated and so the variance of their coefficients increases.

And the final point I want to discuss is alternative approaches to deal with the influence between groups caused by network effects. In short, the idea is that we find clusters in a network and use these clusters as units in randomization.

Image from an article⁴ by Meta Research about dealing with network effects

This is a very general and robust approach and I think this is what you should aim for when building a testing platform to incorporate network effects. But it is expensive and takes a lot of time to implement. So if you suspect that there is a significant influence between groups in your tests, you may want to start implementing this method.

In the meantime, you can use the Interrupted Time Series design as a quick, cheap, and pretty reliable alternative.

References

Flavio Regis de Arruda (2022). Interrupted Time Series (ITS) in Python Interrupted Time Series (ITS) in Python | xboard.dev

[1] Matheus Facure Alves (2022). Causal Inference for The Brave and True Causal Inference for The Brave and True — Causal Inference for the Brave and True (matheusfacure.github.io)

[2] Turner, S.L., Karahalios, A., Forbes, A.B. et al. Comparison of six statistical methods for interrupted time series studies: empirical evaluation of 190 published series. BMC Med Res Methodol 21, 134 (2021). https://doi.org/10.1186/s12874-021-01306-w

[3] Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton, Mifflin and Company. Experimental and quasi-experimental designs for generalized causal inference. (apa.org)

[4] Brian Karrer, Liang Shi, Monica Bhole (2021). Testing product changes with network effects Testing product changes with network effects — Meta Research | Meta Research (facebook.com)

Controlling Influence Between Groups in A/B Testing — Interrupted Time Series Design was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

In search of the best EcmaScript version for the website assembly

Yoskutik — Thu, 22 Jun 2023 13:52:14 GMT

Hello everyone, my name is Dima. I am Frontend Developer in Mayflower. And recently I found out that the choosing the version of ES for building a web application, as well as organizing this assembly itself, can be a difficult task. Especially if you want to make this choice based solely on evidence. In this article, I will address the following points on this topic:

How does the compilation of code for ES5 affect the performance of the site?
Which tool generates the most efficient code — TypeScript Compiler, Babel or SWC?
Does modern syntax affect the speed of JavaScript code parsing by browser?
Is it possible to achieve a real reduction of bundle’s size, taking into account the use of Brotli or GZIP, if you compile the code in a higher version of ES?
Is it really necessary to build sites in ES5 in 2023?
And also how we implemented the transition to a higher version of ES, and how our metrics have changed.

To answer questions 1–3, I even created a full-fledged benchmark, and I decided to test the fourth question on our real project with a large code base.

Is compiling into ES5 bad?

ECMAScript features update every year and it really helps developers to reduce the code base of projects and increase the readability of the code. And to get the opportunity to use the latest version of ES in the source code, developers just need to configure the build process — configure the compilation, as well as add some polyfiles.

Just a quick reminder for those who have forgotten why it is necessary to configure the assembly. For example, function Array.prototype.at appeared only in ES2022, and Chrome version below 92 does not know about the existence of such a function. Therefore, if you use it but didn’t think about ensuring backward compatibility, all users of older versions of Chrome will not be able to fully use your site.

Let me give you a couple of short examples on backward compatibility. First, you can add polyfills.

// After adding such polyfills
import "core-js/modules/es.array.at.js";
import "core-js/modules/es.array.find.js";

// You can safely you these functions
[1, 2, 3].at(-1);
[1, 2, 3].find(it => it > 2);

And second, you can use a compiler that will convert modern syntax code into code that is supported by older browsers:

// For example, this code
const sum = (a, b) => a + b;

// Using Babel or any other compiler can be converted into this code
var sum = function sum(a, b) {
  return a + b;
};

Well, I’ve never really liked the need for that backward compatibility. After all, it implies the mandatory generation of additional code, which in turn leads to an increase of the bundle’s size, clogging of RAM, and possibly a performance degradation of the application. And all this is provided that most (at least in our case) clients have a relatively recent version of the browser, which means that for them the backward compatibility can be potentially destructive.

That’s why it became interesting for me to answer the questions that I indicated at the beginning of the article. I decided to start my research by creating a benchmark. Its purpose is: isolated evaluation of the performance of features in assemblies compiled for ES5 by different tools (TypeScript, Babel, SWC), as well as in an assembly without compilation.

The experiment was performed only on features that require compilation, such as classes or asynchronous functions. I decided not to test the features depending on the use of polyfiles, because if browsers already have an implementation of a specific feature, polyfiles try not to insert their own implementation instead.

Benchmark description: parsing speed and performance test

As I wrote above, I’m going to evaluate each possible compiler separately, because the results of code generation of each compiler may differ. Therefore, in the benchmark, to test each feature, I created bundles compiled using TypeScript, SWC and Babel. You may object that it would be nice to check ESBuild as well, but at the time of writing, it was not capable of generating ES5 code, so I did not consider it.

Example of generated code difference:

// Such code
const sum = (a = 0, b = 0) => a + b;

// Babel will compile into this
var sum = function sum() {
  var a = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : 0;
  var b = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : 0;
  return a + b;
};

// And TypeScript into this
var sum = function (a, b) {
    if (a === void 0) { a = 0; }
    if (b === void 0) { b = 0; }
    return a + b;
};

In addition to these 3 builds, I created another one in which the code of the feature under test remained intact. I will continue to call this one “modern” in the text.

I was also interested to check how different features work in different browsers. After all, browsers may have different engines or at least a different set of optimizations. This means that the benchmark results may potentially differ from one browser to another. And just to automate the collection of metrics in different browsers, I created a small HTTP server on NodeJS.

Each test involves opening the generated HTML file N times with a delay between runs. Each launch was performed in a new browser tab in private mode. Upon opening the HTML file, the browser runs the JavaScript code, and after its execution sends a request to the HTTP server with the result of the test iteration run. I tried to get metrics that would be maximally correlated with the metrics of First Paint, Last Visual Change and others similar to them.

Benchmark process visualization

First of all, I created the benchmark to determine the performance of the features, but it was also interesting to look at the impact of the features on the parsing speed. Then, to evaluate the parsing speed, I created 4 additional builds, in which I simply multiplied the code from the assemblies to measure performance. And then I just measured how long it takes the browser to read the contents of the script element.

Benchmark results: not everything is so clear

We gradually came to the section with the results. Here, I made a bar chart for each version of the ES as well as for each syntax feature. Each graph shows the code execution speed for each of the builds in each of the browsers. The longest line on the graph means that the build worked the fastest.

Be careful — there are a lot of tests and graphs in this block!

Performance evaluation of ES features

ES2015 (ES6)

Arrow functions. As it turned out, there is a difference in the speed of executing the normal and arrow functions. However, only for Chrome, Opera and other V8 browsers. There, the arrow functions work 15% slower. Apparently, in these browsers, controlling the context in which the function was created is more difficult than using your own context for each function.

Test source code.

Classes. In this test, there is a huge gap in the results of different compilers. Modern and TypeScript configurations showed significantly faster results. Basically, the modern configuration shows to be the most productive, apart from that Safari worked better with TypeScript. Babel and SWC generated the code 2–3 times slower.

Test source code.

In the test of using default parameters, the results are absolutely the opposite. SWC and Babel show similar results and work out the fastest. The slowest was the TypeScript build. The modern one has not gone far from TypeScript, but still shows itself a little more effective.

Test source code.

Iteration using the for .. of construction. TypeScript is breaking all records again. Next comes the modern assembly, SWC and at the end is Babel.

Test source code.

Generators. Babel showed the fastest result among the compilers. With a modern assembly, not everything is so clear. Safari proved to be more effective than Babel. But at the same time, in Firefox, it is also the slowest. Apparently, the Firefox developers did not think much about optimizing the generators. But if we do not take into account this browser, then I would say that the modern assembly shares the first place with Babel, and SWC and TypeScript together stand in second.

Test source code.

In the test of using enhanced object literals, the situation is also ambiguous. In general, TypeScript and modern builds are the most productive, in Firefox and Safari TS is the one who takes precedence, in V8 browsers it is modern. According to the chart, Babel turned out to be the slowest, but I think this was due to some side effect, and in a real project, the results of SWC and Babel would be the same.

Test source code.

Extremely unambiguous results came out in the test of using the rest parameters. The most productive configuration is modern, the slowest is TypeScript.

Test source code.

Spread operator. Definitely the modern assembly showed itself the fastest. In Chrome and Opera, the difference was as much as 4 times. The rest of the configurations showed themselves at about the same level, but in Firefox TypeScript worked slightly slower.

Test source code.

Template strings — again, the modern assembly has definitely shown itself to be more productive. There is no difference in the assemblies with different tools.

Test source code.

ES2016

Exponentiation operator. Absolutely no difference, everything is within the margin of error.

Test source code.

ES2017

Asynchronous functions. Modern assembly is again in the first place. The largest margin in Safari — it is up to 20%. There is a slight difference between other configurations, but it will not be possible to draw unambiguous conclusions — in Chrome and Opera, Babel is the slowest build, and in Firefox the fastest.

Test source code.

ES2018

Formally speaking, only two syntactic features appeared this year — rest and spread operators in objects. However, I thought that 2 tests might not be enough. And all because, depending on how these operators were used, different tools generate code in different ways.

Here is a link to the sandboxes of the selected assemblers, if you want to look at the variety of generated code:

Let’s start with a simple one. To evaluate the rest operator, I created 2 tests — in one I just copy the object, and in the other I take several properties from the object.

In the first case, the rest operator showed quite interesting results. Browsers seem to be divided into 2 camps: Chrome and Opera are optimized for working with TypeScript code, then the modern build shows itself best in terms of speed, and Babel and SWC are weaving at the end; but in Firefox and Safari the situation is absolutely the opposite — TypeScript works the slowest, and the results for the rest of the builds are almost the same.

In the second case, in all the same Safari and Firefox, the modern configuration wins everyone. But in Opera and Chrome, it is the slowest. Of the compilers, TypeScript was again a little slower than the rest of the assemblies.

Now let’s speak about the spread operator. I have written 4 tests using the spread operator in different configurations. But regardless of how I used the operator, the benchmark results turned out to be similar to the results for the rest operator — modern and TS builds work fast in Safari and Firefox, but just as slowly in Chrome and Opera.

In all tests, there is approximately such a picture. But if you are interested in looking at all the results, you can study them in the repository.

ES2018 Bonus

A funny fact that I discovered while writing the benchmark. If you have already looked at the source code of the tests, you noticed how I used the values 'a' + i as keys. And I did it on purpose! Because, as turned out, if you use a number as a key in an object, then for some reason unknown to me in Chrome and Opera, the modern assembly begins to work incredibly quickly. And not just faster than other builds in the same browsers, but even faster than Firefox or Safari, although they showed their superiority in the tests above.

Test source code.

ES2019

Private fields in classes. Again, an unconditional victory for the modern assembly. And TypeScript shows good results, apart from the tests in Safari. But anyway you shouldn’t rely on them — TypeScript, unlike other assemblers, is not able to compile private variables in ES5.

Test source code.

ES2020

Nullish coalescing operator. Again, an unconditional victory for the modern configuration. And Babel proved to be the worst.

Test source code.

Optional chaining operator. TypeScript performed worse than other assemblies, but otherwise there is no difference.

Test source code.

ES2021

Logical operators. I was interested to check individually how they work, when assignment is applied and when not.

In the first case, the modern build shows itself slightly less productive in Chrome, but more productive in Safari. There is no difference between the collectors.

Test source code.

And in the second case, a modern assembly paired with TypeScript shows its superiority over other assemblies.

Test source code.

ES2022

Private methods in classes. The results are the same as in the class usage test. And TypeScript is still not able to use private modifiers in ES5. But in ES6, the ratio of results remains the same.

Test source code.

Parsing speed evaluation

In general, the trend to increase the speed of parsing was popular even during OptimizeJS. A lot of time has passed since then, the developer of this library himself marked it obsolete, and the V8 developers described the practices used in it as destructive. Therefore, now the front-end developers somehow do not chase, especially for a couple of milliseconds won. And I too wasn’t going to, of course. But still, I was wondering if using modern syntax could affect the browser’s speed of reading JavaScript code.

I ran the test, and got a couple of interesting results. For example, it turned out that Safari reads arrow functions slower than usual ones, despite the fact that the file with arrow functions has the smallest size.

And Firefox processes code with private fields in the class for quite a long time. And it’s funny that it reads private methods without that much difficulty.

This is where the interesting facts end. In other cases, the benchmark results show a clear correlation of time and the number of characters in the generated code, which means that in other cases, parsing of a modern assembly proved to be the most effective. If you want to see the results in detail, here is the link.

A brief summary of the benchmark

The entire text described above can be summarized with 3 main ideas.

1. The modern assembly does not have absolute superiority over ES5 and often even performs slower. However, it is the fastest in most cases.

2. There is no ideal tool for building the most productive code in ES5. At least because different browsers have different optimizations. But you can choose for yourself the best ratio of pros and cons. For example, if suddenly there’s a huge amount of generators in your application, Babel will be a very obvious choice, and if there are a lot of classes, it’s worth looking towards TypeScript.

I would say that TypeScript often performs better than other tools. However, it upsets me that in some places where it feels good in Safari, in Chrome it is able to show the worst result. Especially considering the fact that the majority of users on Chrome.

3. We can conclude that not all browsers have paid attention to optimizing work with modern syntax. Firefox works terribly with generators, Chrome has not completely organized the work of spread in objects, etc. However, it seems to me that if browsers are engaged in under-the-hood optimizations, they are more likely to pay attention to modern syntax. So who knows, maybe in a couple of years the modern assembly will definitely be the fastest.

And what about the bundle size?

The favorite phrase of developers still compiling for ES5 sounds like this:

“Well, what’s the point of chasing a reduction in the size of the bundle? Compression tools will level out all this difference anyway.”

And whether they are right in their reasoning, we will find out now.

I decided to check this point on my working project, because compression is a rather complex process, and therefore it would not be entirely fair to conduct an assessment separately for each feature.

During the tests, I removed the polyfiles from the assembly. Then I compiled our project with each of these tools, compressed them using GZip and Brotli, and calculated the total volume of the created chunks of the application. And these are the results I got.

           | Raw     | GZip    | Brotli
Modern     | 6.58 MB | 1.79 MB | 1.74 MB
TypeScript | 7.07 MB | 1.82 MB | 1.86 MB
Babel      | 7.71 MB | 1.92 MB | 1.86 MB
SWC        | 7.60 MB | 1.94 MB | 1.86 MB

We may be surprised that Brotli showed worse results on TypeScript than GZip. This happened because I was running Brotli with a compression level of 2 (the maximum is 11). I decided to choose this compression level because it is as close as possible to the settings used in Cloudflare by default.

And what do we see? The size of the project has really decreased by 7–15%, both in the raw and compressed versions. And here the decision is up to you. For someone such a difference will be insignificant, and for others, on the contrary, it will seem significant. For ourselves, we decided that this difference is big enough to try to use a modern assembly in production.

It turns out that the modern assembly gets another victory.

Well, along with this, the table shows how TypeScript shows its superiority in terms of the volume of generated code over other libraries.

Is 4% so important?

From everything described above, a simple conclusion can be drawn. Users will get a nicer UX if your product is compiled in a higher ES version. Your web application will become more productive and also your bundle will be smaller.

However, at the same time, you need to understand that according to Browserslist, only 96% of users worldwide currently have ES2015 support, 95% have ES2017, and higher versions have even lower support.

Therefore, the conclusion can be made as follows:

If these 4% of users with outdated browsers are not so important to you, then it would be more logical to build a site in a fresh version of ES. For example, in ES2018.
If they are still important, but you do not have a very large project, or the increase in quality metrics is not very important to you, you can gather under ES5. Performance will not suffer critically from this.
But if users with outdated browsers are also important to you, and even a slight increase in performance, you should think about creating two assemblies — modern and ES5 — and think about how to deliver the right assembly to the user. That’s exactly what we did in our company.

Our experience of using modern assembly

In general, the idea of separating assemblies in our product appeared long before my appearance in the Mayflower company, I just improved it a little. Now we are assembling our application twice — one assembly is compiled in ES5 format with all the required polyfiles, and another in ES2018 format with a very limited set of polyfiles.

To the question of why we stopped at ES2018. The higher we looked at the version of the standard, the less the difference between the builds of different versions was felt. We chose ES2018 as a kind of edge at which 95% of users will get a fast website, and at which the advantages of a modern build will be used to the maximum. We don’t keep private fields in the class, so the only difference between ES2018 and ES2022 is a small performance loss when using the nullish coalescing operator and, possibly, the logical operator. For sure we’ll get over this loss.

And now about how we implemented it. Especially for this article, I decided to create another repository, just to show how the assembly of the application can be organized taking into account the separation of assemblies. There I implemented a simplified implementation of our working variant. However, it still shows how you can organize not only the separation of JavaScript code assemblies, but also CSS. If you open the developer tools in the assembled site, you can see that even on this small project, you can get a reduction of files by 120 KB, which was 30% in my case. You can use the deployed assembly from this repository at this link.

And if you don’t want to look at the repository, then I will briefly describe how we determine on the client side which assembly needs to be downloaded. We are just checking the browser’s ability to handle asynchronous functions, as well as the presence of several polyfills. And then, using the window.LEGACYflag, we add a script with the desired address to the head of the document.

try {
  // Polyfills check
  if (
    !('IntersectionObserver' in window) ||
    !('Promise' in window) ||
    !('fetch' in window) ||
    !('finally' in Promise.prototype)
  ) {
    throw {};
  }

  // Syntax check
  eval('const a = async ({ ...rest } = {}) => rest; let b = class {};');
  window.LEGACY = false;
} catch (e) {
  window.LEGACY = true;
}

Real metrics

Of course, metrics in a vacuum is a good thing, but what about real metrics? In the end, we deployed a strict separation of assemblies on ES5 and ES2018 on the production. And here is such a difference in metrics Sitespeed.io we received on different builds:

First Paint — 13% faster
Page Load Time — 13% faster than
Last Visual Change — 8% faster
Total blocking time — 13% less
Speed Index — 9% faster

For the most part, this difference was achieved due to the smaller size of the downloaded files. But in any case, the transition to ES2018 could slightly affect the metrics for the better. And the best part is that this gain was obtained almost without touching the source code.

The end

Thank you for your time. I hope you, like me, were interested in learning about performance, parsing speed, and the metrics obtained.

I highly recommend looking at the benchmark repository. In the article, I described the conclusions only on the performance of assemblies, but in the benchmark I also wanted to look at the difference in the performance of browsers in different OS and architectures. For example, you can find out if Microsoft’s assurances that Edge is faster than Chrome are true or not.

I will also once again give a link to the repository with an example of organizing the separation of assemblies not only JavaScript but also CSS code. And in addition to it, a link to GH pages with the deployment of this assembly.

And that’s it. Write your thoughts in the comments, ask questions. Bye!

In search of the best EcmaScript version for the website assembly was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

В поисках лучшей версии EcmaScript для сборки

Yoskutik — Thu, 22 Jun 2023 13:51:11 GMT

В поисках лучшей версии EcmaScript для сборки сайта

Всем привет, меня зовут Дима! Я — Frontend Developer в Mayflower. И недавно я выяснил, что выбор версии ES для сборки веб-приложения, а также организация самой этой сборки, может оказаться весьма сложной задачей. Особенно, если вы собираетесь делать этот выбор, основываясь исключительно на доказательной базе. В этой статье я постараюсь ответить на следующие вопросы, возникшие в ходе моего расследования на эту тему:

Как влияет компиляция кода под ES5 на производительность сайта?
Какой инструмент генерирует самый производительный код — TypeScript Compiler, Babel или SWC?
Влияет ли современный синтаксис на скорость чтения браузером JavaScript кода?
Можно ли добиться реального уменьшения объёма бандла с учетом использования Brotli или GZIP, если компилировать код в более высокой версии ES?
Действительно ли нужно собирать сайты под ES5 в 2023 году?
А также как мы реализовали переход на более высокую версию ES, и как изменились наши метрики.

Для ответа на вопросы 1–3 я даже создал полноценный бенчмарк, а четвертый вопрос я решил проверить на нашем реальном проекте с большой кодовой базой.

Компилировать под ES5 плохо?

Ежегодно добавляемые в EcmaScript фичи помогают разработчикам все больше сокращать кодовую базу проектов и все сильнее повышать читаемость кода. Настроив процесс сборки своего продукта, настроив компиляцию, а также добавив полифилы, разработчики получают возможность использовать самую свежую версию ES в исходном коде.

А для тех, кто позабыл, почему необходимо настраивать сборку, я кратко напомню. Условная функция Array.prototype.at появилась только в ES2022, и какой-нибудь Chrome версии ниже 92 о существовании такой функции не знает. Следовательно, если вы будете её использовать и об обеспечении обратной совместимости не подумаете, все пользователи старых версий Chrome не смогут в полной мере пользоваться вашим сайтом.

Давайте я приведу пару коротких примеров по обеспечению обратной совместимости. Во-первых, вы можете добавить полифилы.

// После добавления этих импортов
import "core-js/modules/es.array.at.js";
import "core-js/modules/es.array.find.js";

// Вы можете без страха использовать эти функции
[1, 2, 3].at(-1);
[1, 2, 3].find(it => it > 2);

А во-вторых, вы можете использовать компилятор, который превратит код современного синтаксиса, в код, поддерживаемый старыми браузерами.

// Например, такой код
const sum = (a, b) => a + b;

// При помощи Babel или другого компилятора можно превратить в такой
var sum = function sum(a, b) {
  return a + b;
};

Что ж, необходимость той самой организации обратной совместимости мне никогда особо не нравилась. Ведь она подразумевает обязательную генерацию дополнительного кода, что в свою очередь означает увеличение размера бандла, засорение оперативной памяти, а также, возможно, снижение производительности приложения. И все это при условии того, что большинство (по крайней мере в нашем случае) клиентов имеют относительно свежую версию браузера, а значит для них процесс организации обратной совместимости может быть потенциально деструктивным.

Потому мне и стало интересно ответить на вопросы, которые я указал еще в начале статьи. Свое исследование я решил начать с создания бенчмарка. Цель: изолированная оценка производительности фич в сборках, скомпилированных под ES5 разными инструментами (TypeScript, Babel, SWC), а также в сборке без компиляции.

Эксперимент ставился только над фичами, требующих компиляции, такие как классы или асинхронные функции. Фичи, завязанные на использовании полифилов я решил не тестировать, т.к. если в браузере уже есть реализация всё того же Array.prototype.at, полифилы стараются не вставлять вместо нее собственную реализацию.

Описание бенчмарка: тест скорости парсинга и производительности

Как я и написал выше, я собираюсь оценить каждый возможный сборщик в отдельности, т.к. результаты генерации кода одного сборщика могут отличаться от результатов другого. Поэтому в бенчмарке для проверки каждой фичи я создал сборки, собранные при помощи TypeScript, SWC и Babel. Вы можете возразить, что неплохо было бы проверить ещё ESBuild, но на момент написания статьи он генерировать код стандарта ES5 был не способен, поэтому его я не рассматривал.

Пример разницы генерируемого кода:

// Этот код
const sum = (a = 0, b = 0) => a + b;

// Babel скомпилирует в такой код
var sum = function sum() {
  var a = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : 0;
  var b = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : 0;
  return a + b;
};

// А TypeScript в такой
var sum = function (a, b) {
    if (a === void 0) { a = 0; }
    if (b === void 0) { b = 0; }
    return a + b;
};

Помимо трех указанных сборок, я создал еще одну, в которой код тестируемой фичи оставался нетронутым. Её я далее по тексту буду называть современной.

Мне также было интересно проверить, как работают разные фичи в разных браузерах. Ведь браузеры могут иметь разные движки или хотя бы разный набор оптимизаций. А значит и результаты бенчмарка потенциально могут отличаться от одного браузера к другому. И как раз для автоматизации сбора метрик в разных браузерах я создал небольшой HTTP сервер на NodeJS.

Каждый тест подразумевает запуск сгенерированного HTML файла N раз с задержкой между запусками. Каждый запуск производился в новой вкладке браузера в приватном режиме. По открытию HTML файла браузер запускает JavaScript код, а после его выполнения отправляет в HTTP сервер запрос с результатом прогона итерации теста. Таким образом я пытался получить метрики, которые бы максимально коррелировали с метриками First Paint, Last Visual Change и другими схожими.

Визуализация работы бенчмарка

По большей части бенчмарк я создавал для определения производительности фич, но посмотреть на влияние фич на скорость парсинга мне тоже было интересно. Поэтому для оценки скорости парсинга я создал 4 дополнительные сборки, в которых по большей части просто размножил код из сборок для измерения производительности. А далее я просто замерял, сколько нужно времени браузеру, чтобы прочитать содержимое элемента script.

Результаты бенчмарка: не все так однозначно

Мы постепенно подошли к секции с результатами. В ней я для каждой версии стандарта ES, а также для каждой синтаксической фичи составил график. В каждом графике показывается скорость выполнения кода для каждой из сборок в каждом из браузеров. Самая длинная линия на графике означает, что сборка отработала быстрее всего.

Будьте осторожны — теста и графиков в этом блоке получилось много!

Оценка производительности ES фич

ES2015 (ES6)

Стрелочные функции. Как оказалось, разница в скорости вызова обычной и стрелочной функций действительно есть. Правда, наблюдается она только в Chrome, Opera и других V8 браузерах. В них стрелочные функции работают на 15% медленнее. По всей видимости в этих браузерах контролировать контекст, в котором функция была создана, сложнее, чем использовать собственный контекст для каждой функции.

Исходный код теста.

Классы. В этом тесте видна огромная пропасть в результатах у разных компиляторов. Использование современной и TypeScript конфигураций показали более быстрые результаты. В основном, современная конфигурация показывает себя производительнее всех, однако Safari лучше отработал с TypeScript. Babel и SWC же сгенерировали код в 2–3 раза медленнее.

Исходный код теста.

В тесте использования параметров по умолчанию итоги абсолютно противоположные. SWC и Babel показывают схожие результаты и отрабатывают быстрее всего. Самой медленной оказалась сборка от TypeScript. Современная же недалеко ушла от TypeScript, но все же показывает себя немножко эффективнее.

Исходный код теста.

Итерирование при помощи конструкции for .. of. Снова все рекорды бьёт TypeScript. Далее идут современная сборка, SWC и в конце находится Babel.

Исходный код теста.

Генераторы. Среди сборщиков Babel показал самый быстрый результат. С современной сборкой не все так однозначно. В Safari она показала себя эффективнее, чем Babel. Но при этом в Firefox она же является самой медленной. По всей видимости, разработчики Firefox не особо думали об оптимизации работы генераторов. Но если не брать в расчет этот браузер, то я бы сказал, что современная сборка делит первое место с Babel, а SWC и TypeScript вместе стоят на втором.

Исходный код теста.

В тесте использования вычисляемых свойств объектов ситуация тоже неоднозначная. В целом, TypeScript и современная сборки являются самыми производительными, в Firefox и Safari первенство у TS, в V8 браузерах у современной. Судя по графику Babel оказался самым медленным, но, думаю, это произошло вследствие некоторого сайд эффекта, и в реальном проекте результаты SWC и Babel были бы одинаковы.

Исходный код теста.

Крайне однозначные итоги вышли в тесте использования rest параметра. Самая производительная конфигурация — современная, самая медленная — TypeScript.

Исходный код теста.

Spread оператор. Однозначно быстрее себя показала современная сборка. В Chrome и Opera разница составила аж 4 раза. Остальные же конфигурации показали себя примерно на одном уровне, однако в Firefox TypeScript отработал слегка медленнее.

Исходный код теста.

Шаблонные строки — опять же, однозначно производительнее себя показала современная сборка. Какой-либо разницы в сборках разными инструментами нет.

Исходный код теста.

ES2016

Оператор возведения в степень. Разница настолько невелика, что заметить её сложно. Все в пределах погрешности.

Исходный код теста.

ES2017

Асинхронные функции. Современная сборка снова на первом месте. Наибольший отрыв в Safari — до 20%. Небольшая разница между другими конфигурациями наблюдается, но однозначных выводов сделать не получится — в Chrome и Opera Babel является самой медленной сборкой, а в Firefox самой быстрой.

Исходный код теста.

ES2018

Формально говоря, в этом году появилось всего 2 синтаксических фичи — rest и spread операторы в объектах. Однако, я подумал, что 2х тестов может быть недостаточно. А все потому, что в зависимости, от того, как были использованы эти операторы, разные инструменты генерируют код по разному.

Вот ссылка на песочницы выбранных сборщиков, если вы желаете посмотреть на разнообразие генерируемого кода:

Начнем с простого. Для оценки rest оператора я создал два теста — в одном я просто копирую объект, а в другом я беру из объекта несколько пропертей.

В первом случае rest оператор показал довольно интересные итоги. Браузеры будто разделились на два лагеря: Chrome и Opera оптимизированы для работы с кодом от TypeScript, затем по скорости себя лучше всего показывает современная сборка, а Babel и SWC плетутся в конце; но в Firefox и Safari ситуация абсолютно обратная — TypeScript работает медленнее всего, а результаты по остальным сборкам почти не отличаются.

Во втором случае во все тех же Safari и Firefox современная конфигурация всех разрывает. А вот в Opera и Chrome она является самой медленной. Из сборщиков TypeScript снова оказался немного медленнее остальных сборок.

Теперь по spread оператору. Я написал 4 теста, используя spread оператор в разных конфигурациях. Но независимо от того, как я применял оператор, результаты бенчмарка оказались схожи с итогами по rest оператору — современная и TS сборки шустро работают в Safari и Firefox, но настолько же медлительно в Chrome и Opera.

Во всех тестах наблюдается примерно такая картина. Но если вам интересно посмотреть на все результаты, можете их изучить в репозитории.

ES2018 Bonus

Забавный факт, который я обнаружил, пока писал бенчмарк. Если уже посмотрели на исходный код тестов, то заметили, как я в качестве ключей использовал значения 'a' + i. И делал я это не случайно! Ведь если в качестве ключа в объекте использовать число, то по неведомой мне причине в Chrome и Opera современная сборка начинает отрабатывать невероятно быстро. Причем не просто быстрее других сборок в этих же браузерах, но даже быстрее, чем Firefox или Safari, хотя в тестах выше они показывали свое превосходство.

Исходный код теста.

ES2019

Приватные поля в классах. Снова безоговорочная победа за современной сборкой. А TypeScript показывает неплохие результаты, не считая тестов в Safari, однако полагаться на них не стоит. TypeScript в отличии от остальных сборщиков не способен компилировать приватные переменные в ES5.

Исходный код теста.

ES2020

Оператор нулевого слияния. Снова безоговорочная победа за современной конфигурацией. А Babel показал себя хуже всего.

Исходный код теста.

Оператор опциональной последовательности. TypeScript себя показал хуже остальных сборок, а в остальном разницы нет.

Исходный код теста.

ES2021

Логические операторы. Мне было интересно проверить по отдельности как они работают, когда присваивание выполняется и когда нет.

В первом случае современная сборка показывает себя чуть хуже других сборок, а разницы между сборщиками не наблюдается.

Исходный код теста.

А во втором случае современная сборка на пару с TypeScript показывают свое превосходство над другими сборками.

Исходный код теста.

ES2022

Приватные методы в классах. Результаты такие же, как и в тесте использования классов. А ещё TypeScript все так же не способен использовать приватные модификаторы в ES5. Но в ES6 соотношение результатов остается таким же.

Исходный код теста.

Оценки скорости парсинга

Вообще тренд на повышение скорости парсинга был популярен ещё в эпоху OptimizeJS. С тех пор прошло немало времени, сам разработчик той библиотеки пометил её устаревшей, а разработчики V8 описали практики, применяемые в ней, деструктивными. Потому, сейчас фронтэнд разработчики как-то и не гоняются особо за парой выигранных миллисекунд. И я не собирался, конечно. Но все же мне было интересно, может ли использование современного синтаксиса повлиять на скорость чтения браузером JavaScript кода.

Я запустил тест и получил таки парочку интересных результатов. Например, оказалось, что Safari считывает стрелочные функции медленнее, чем обычные, несмотря на то, что файл со стрелочными функциями имеет наименьший размер.

А Firefox довольно долго обрабатывает код с приватными полями в классе. Причем забавно, что приватные методы он считывает без особых сложностей.

На этом интересные факты заканчиваются. В остальных случаях в результатах бенчмарка прослеживается четкая зависимость времени от количества символов в сгенерированном коде, что означает, что в остальных случаях парсинг современной сборки показал себя эффективнее всего. Если желаете подробно ознакомиться с результатами, вот ссылка.

Краткое резюме по бенчмарку

Весь описанный выше текст можно резюмировать тремя основными идеями.

Во-первых, современная сборка абсолютного превосходства над ES5 не имеет и нередко даже отрабатывает медленнее. Однако, она является самой быстрой в большинстве случаев.

Во-вторых, идеального инструмента для сборки самого производительного кода в ES5 нет. Как минимум из-за того, что разные браузеры имеют разные оптимизации. Но вы можете подобрать для себя наилучшее соотношение плюсов и минусов. Например, если вдруг в вашем приложении генератор генератором погоняется, Babel будет весьма очевидным выбором, а если в нем очень много классов, стоит посмотреть в сторону TypeScript.

Я бы сказал, что TypeScript часто показывает себя лучше других инструментов. Однако, меня расстраивает, что в некоторых тестах, где он хорошо себя чувствует в Safari, в Chrome он способен показать наихудший результат. Особенно учитывая тот факт, что пользователей на Chrome большинство.

И в-третьих, мы можем сделать вывод о том, что не все браузеры уделили внимание оптимизации работы с современным синтаксисом. Firefox ужасно работает с генераторами, Chrome несовершенно организовал spread в объектах, и т.п. Однако, думается мне, что если браузеры и будут заниматься подкапотными оптимизациями, с большей вероятностью они будут внимание уделять современному синтаксису. Так что кто знает, может через пару лет современная сборка станет однозначно самой быстрой.

А что по объёму файлов?

Любимая фраза разработчиков, до сих пор компилирующих под ES5 звучит так:

“Ну так а смысл гоняться за уменьшением размера бандла? Средства сжатия всю эту разницу все равно нивелируют.”

А правы ли они в своих рассуждениях, мы с вами сейчас и узнаем.

Этот пункт я решил проверить на своем рабочем проекте, т.к. сжатие является довольно комплексным процессом, а потому проводить оценку по отдельности для каждой фичи было бы не совсем честно.

На время тестов я убрал подключение полифилов из сборки. Затем я собрал наш проект каждым из указанных инструментов, сжал их при помощи GZip и Brotli, и посчитал суммарный объём созданных чанков приложения. И вот такие результаты у меня получились:

           | Raw     | GZip    | Brotli
Modern     | 6.58 Мб | 1.79 Мб | 1.74 Мб
TypeScript | 7.07 Мб | 1.82 Мб | 1.86 Мб
Babel      | 7.71 Мб | 1.92 Мб | 1.86 Мб
SWC        | 7.60 Мб | 1.94 Мб | 1.86 Мб

Вы можете удивиться тому, что на TypeScript Brotli показал результат хуже, чем у GZip. Это произошло из-за того, что я запускал Brotli с уровнем сжатия 2 (максимальный — 11). Этот уровень сжатия я решил выбрать, т.к. он максимально близок к настройкам, применяемых в Cloudflare по умолчанию, и этот CDN мы используем в нашем продукте.

И что же мы видим? Размер проекта действительно уменьшился на 7–15%, что в сыром, что в сжатом состоянии. И тут уж как посмотреть — для кого-то такая разница будет незначительной, а кому-то, наоборот, покажется существенной. Для себя в компании мы решили, что это разница достаточно велика, чтобы попытаться прикрутить более современную сборку на прод.

Выходит, современная сборка получает ещё одну победу.

Ну и ещё вместе с этим, в таблице видно, как TypeScript показывает свое превосходство в плане объема генерируемого кода над другими библиотеками.

Так ли важны 4%?

Из всего описанного выше можно сделать простой вывод. Пользователи получат более приятный UX, если ваш продукт будет скомпилирован в более высокой версии ES. Ваше веб-приложение станет более производительным, а также станет меньше весить.

Однако, вместе с этим нужно понимать, что по данным Browserslist поддержка ES2015 на данный момент есть только у 96% пользователей по всему миру, ES2017 у 95%, а у более высоких версий поддержка ещё ниже.

Поэтому вывод можно сделать такой:

Всякие ситуации бывают, и если вам не так уж важны эти 4% пользователей с устаревшими браузерами, то логичнее будет собирать сайт в свежей версии ES. Например, в ES2018.
Если все же они важны, но у вас не очень большой проект, или вам не сильно важен прирост в качественных метриках, можете собираться под ES5. Производительность от этого не пострадает критическим образом.
Но если для вас важны и пользователи с устаревшими браузерами, и даже легкий прирост в производительности, вам стоит задуматься над созданием двух сборок — современной и ES5 — и продумать то, как доставлять пользователю нужную сборку. Именно так мы и поступили в нашей компании.

Наш опыт использования современной сборки

Вообще идея о разделении сбороĸ в нашем продуĸте появилась задолго до моего появлении в ĸомпании Mayflower, я просто немного её развил. Сейчас мы собираем наше приложение дважды — одна сборка у нас собирается в формате ES5 со всеми требуемыми полифилами, и ещё одна в формате ES2018 с весьма ограниченным набором полифилов.

К вопросу о том, почему остановились на ES2018. Чем выше мы рассматривали версию стандарта, тем меньше чувствовалась разница между сборками разных версий. ES2018 мы выбрали, как некую грань, при которой и 95% пользователей получат быстрый сайт, и преимущества современной сборки будут использоваться по максимуму. Приватных полей в классе мы не держим, так что единственное, в чем будет разница между ES2018 и ES2022 — это потеря в производительности при использовании оператора нулевого слияния и, возможно, логического оператора. Но уж как-нибудь переживем эту потерю.

А теперь о том, как мы это реализовали. Специально для этой статьи я решил создать ещё один репозиторий, в котором показывается, как может быть организована сборка приложения с учетом разделения сборок. В нем я реализовал упрощенную реализацию нашей сборки. Однако, в ней все равно видно как можно организовать не только разделение сборок JavaScript кода, но так же и CSS. Если открыть инструменты разработчика в собранном сайте, видно, что даже на этом небольшом проекте, можно получить сокращение файлов на 120 Кб, что составило в моем случае 30%. Вы можете потрогать ручками деплой сборки из этого репозитория по этой ссылке.

А если вы не хотите смотреть в репозиторий, то я опишу вкратце, каким образом мы определяем на стороне клиента, какую же сборку нужно скачивать. Мы просто проверяем способность браузера к обработке асинхронных функций, а также наличие нескольких полифилов. А затем по флагу window.LEGACY мы добавляем в head документа скрипт с нужным адресом.

try {
  // Polyfills check
  if (
    !('IntersectionObserver' in window) ||
    !('Promise' in window) ||
    !('fetch' in window) ||
    !('finally' in Promise.prototype)
  ) {
    throw {};
  }

  // Syntax check
  eval('const a = async ({ ...rest } = {}) => rest; let b = class {};');
  window.LEGACY = false;
} catch (e) {
  window.LEGACY = true;
}

Реальные метрики

Метрики в вакууме — это, конечно, хорошо, но что по реальным метрикам? В конечном итоге мы таки выкатили жесткое разграничение сборок на ES5 и ES2018 на прод. И вот такую разницу в метриках Sitespeed.io мы получили на разных сборках:

First Paint — на 13% быстрее
Page Load Time — на 13% быстрее
Last Visual Change — на 8% быстрее
Cumulative Layout Shift — на 42% меньше
Total blocking time — на 13% меньше
Speed Index — на 9% быстрее

По большей части эта разница была достигнута за счет меньшего размера скачиваемых файлов. Но в любом случае, переход на ES2018 смог немного повлиять на метрики в лучшую сторону. И самое приятное, что этот выигрыш был получен, почти не трогая исходный код.

Конец

Спасибо за уделенное время. Надеюсь вам, как и мне, было интересно узнать про производительность, скорость парсинга и полученные метрики.

Крайне рекомендую посмотреть на репозиторий бенчмарка, о котором я говорил в своей статье. Там помимо приложенных графиков в статье есть ещё “усатые” диаграммы. А ещё в статье я описал не все выводы, которые я получил в своем бенчмарке. Например, мне так же было интересно посмотреть, есть ли разница в производительности браузеров в зависимости от архитектуры и операционной системы. Поэтому я запустил его не только на MacOS, но так же на Windows и Android. И там же я, например, проверял заверения Microsoft об их самом быстром браузере, сравнивая Edge и Chrome.

Так же я ещё раз дам ссылку на репозиторий с примером по организации разделения сборок не только JavaScript но и CSS кода. И в добавок к ней ссылку на GH pages с деплоем этой сборки.

И на этом все. Пишите свои мысли в комментариях, задавайте вопросы. Пока.

В поисках лучшей версии EcmaScript для сборки was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Плотность дефектов «со звёздочкой»: качество, скорость и объём в одной QA метрике

glebsarkisov — Tue, 04 Apr 2023 07:16:42 GMT

Всем привет, меня зовут Глеб! Я — Head of QA в Mayflower. В последние несколько лет мне стали интересны метрики QA — особенно такие, которые позволяют искать проблемы в процессах, вести переговоры с бизнесом, показывать пользу тестирования для проекта и использовать показатели в качестве KPI.

За время работы в различных компаниях я видел разные подходы для решения этих задач и среди множества метрик я сконцентрировался на defect density. В результате ее изучения, я кастомизировал ее и запилил свою dd “со звездочкой”. Если вы тоже находитесь в поиске метрики, учитывающей чистоту релизов, их объем и скорость, вам может быть полезна моя статья.

По классике, метрика defect density — это доля дефектов, приходящаяся на отдельный модуль в течение итерации или релиза; считается на тысячу строк кода. Идея метрики заключается в том, чтобы определить отношение дефектов в вашем коде к его объему и постепенно уменьшать его. Идея, надо признать, отличная, но нюансы внедрения метрики могут сделать ее достаточно неудобной для использования.

Если ваш проект написан на нескольких языках, имеет много модулей, отдельных сервисов, механизм подсчета этой метрики будет непросто прикрутить.
Интерпретация значений может быть затруднена: для кого-то соотносить баги и количество строк может показаться неудобным, нелогичным и неприменимым, например, при тестировании “на стороне”, когда к коду вообще может не быть доступа, а данные о его качестве хочется получать.

Хочется взять самое лучшее от этой метрики, модифицировав ее для удобства и большей информативности. Если оттолкнуться от идеи, добавить производительность команды, критичность разных дефектов, то можно посчитать defect density ”со звездочкой” — отношение дефектов различных приоритетов на продакшне к фактической пользе, которую донесла команда за спринт. Так можно учесть сразу и чистоту тестирования внутри спринта, и скорость доставки через доставленный объем задач и багфиксов. Такой показатель можно понятно объяснять бизнесу и на него можно подвязываться как на качество релизного процесса — как на уровне отдельной команды, так и на уровне всего продукта.

Подсчет метрики

Посчитать плотность дефектов “со звездочкой” можно по формуле:

D — плотность дефектов “со звездочкой”

d — дефект соответствующего приоритета p1, p2, pn

k — коэффициент соответствующего приоритета p1, p2, pn

t — тикет (задача/багфикс) соответствующего уровня сложности c1, c2, cn

h — коэффициент соответствующего уровня сложности c1, c2, cn

Числитель — вес дефектов с продакшна, знаменатель — вес доставленных тикетов. Чем показатель ниже, тем лучше был спринт

Дисклеймер: описанное ниже включает подобранные мною коэффициенты, которые работают конкретно в моем случае — я предлагаю каждому из вас опытным путем выбрать подходящие вам веса, количество уровней сложности и тд.

Как работать с метрикой

Процесс работы с метрикой, наверное, покажется кому-то очевидным — надо собирать данные, находить “норму” и делать изменения в процессах, чтобы достичь лучшего показателя. Ниже опишу подробнее каждую часть процесса.

Нахождение показателя текущего состояния процесса доставки

Для поиска текущей “нормы” важно собрать какое-то весомое количество значений — анализ хотя бы 10 спринтов вполне подойдет. Разумеется, чем больше спринтов посчитаете, тем будет лучше. При этом, не обязательно ждать 10 спринтов — можно посчитать закрытые спринты (при условии, что вам доступны эти данные и разработчики и QA списывали время в тикеты).

Определение границ и фокуса

Как только есть хотя бы 10 значений метрики, имеет смысл:

Определить верхнюю границу, выход за которую вы будете считать плохим состоянием процесса доставки — вы можете сделать это по 100 перцентилю, либо добавить среднеквадратичное отклонение к 100 перцентилю или как-то иначе.
Посчитать 95, 90 и другие перцентили для этого набора значений, чтобы обозначить себе уровни, на которые вы хотите впоследствии выходить и закрепляться (чем ниже перцентиль, тем более амбициозная цель).
К слову про KPI — как раз эти перцентили и можно выбирать как цель проектной команды / QA-лида и тд на период в зависимости от вашей конфигурации.

Метрики не работают на вас, пока вы не начинаете на них комититься и менять процессы соответственно.

Проведение анализа внутри спринта

Очевидно, максимальный вес для числителя будут составлять дефекты критических приоритетов, а плохая производительность команды (количество залогированных часов в тикетах, доставленных до продакшна из знаменателя) ухудшит показатель.

Может быть, в вашем случае будет полезно ввести процесс анализа причин пропуска критических дефектов или устаканить производительность команды через до-найм или стабилизацию отдельных участников. Подход к работе с метрикой будет состоять из формирования гипотезы источника проблемы, изменения процесса и наблюдения за результатами показателя.

Нюансы при внедрении метрики

Ваших разработчиков и QA придется приучить аккуратно и своевременно логировать время в задачи, если они еще этого не делают. Также, объясните вашей команде, что метрики нужны для измеримого результата, а не для поиска виновных — это классическая ментальная проблема команд.
Нужно проследить, что тикеты вовремя проходят по вашему флоу и меняют статусы — чтобы не получилось так, что какой-то тикет был доставлен в прошлом спринте и не попал в вашу выборку по спринту из-за того, что он не был переведен в статус delivered / closed / тд (зависит от вашего флоу).

Заключение

Для меня в работе с метриками часто было сложно описать четкую взаимосвязь одной стороны процесса с другой. Когда ко мне приходили руководители/менеджеры проектов и говорили, что QA медленно тестируют, я показывал количество багов, которое мы находим до выкатки на прод, и думал, что так понятно объясняю, какой уровень качества мы обеспечиваем. В то же время, мне хотелось самому понимать, насколько стабильна моя команда по релизам, действительно ли мы консистентно доставляем именно столько новых фичей и именно такого качества.

Плотность дефектов “со звездочкой” дала мне возможность в одной цифре учесть объем релизов, их чАстоту и чИстоту. Метрика не решит ваши проблемы за вас, но поможет их подсветить и увидеть результаты ваших усилий по исправлению процесса.

Несколько слов в напутствие:

Автоматизируйте сбор метрики, чтобы избавиться от человеческого фактора при сборе данных;
Сделайте визуализацию этой метрики общедоступной, объясните механику метрики вашей команде;
Если вы можете управлять целеполаганием команды, эта метрика хорошо подойдет для установки KPI на квартал/отчетный период в вашем флоу компании;
Подключайте и другие метрики, показывающие изъяны текущего процесса на разных этапах и в различной зоне ответственности — defect leakage, соотношение багов prod/testing и тд.

Плотность дефектов «со звёздочкой»: качество, скорость и объём в одной QA метрике was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Way to Effectiveness: Release Quality, Volume and Speed in 1 QA Metric

glebsarkisov — Tue, 14 Mar 2023 16:42:11 GMT

Hello everyone, my name is Gleb. I am Head of the QA department in Mayflower.

During the last few years, I got obsessed with metrics that help to:

Find issues in QA/delivery processes
Get better in negotiations with business owners
Show real benefits of QA for the project
Measure KPIs

While working in various companies I looked for different approaches to solve the tasks above. The defect density metric interested me the most. As a result of my research, I modified this metric and created my own defect density “with a twist”.

If you are looking for a metric which combines level of quality, release volume, and release cycle speed in just one number, this article is for you.

The problems of classic defect density

Classic defect density is the number of defects per 1,000 lines of code in a software module or the whole product during an iteration or release. The goal is to find your number of defects to the size of the code ratio and lower it afterward. Sounds good in theory, though there are some issues with rolling out the metric and getting comfortable with using it.

The problems with classic defect density:

Let’s say your project technology stack contains multiple programming languages, various modules, separate services — it would be hard to come up with a way of calculating the metric for all of them.
Some might find it hard to read or understand that specific ratio, while others would not be able to have access to code but still want to know the level of quality.

The solution

What if we take only the idea the metric offers and modify it to the point it is easier and more informative to work with?

A recipe for defect density “with a twist”: it’s the ratio of defects of different priorities in production to the actual user benefit the team delivered during the sprint.

We can take into account both the “cleanliness” of testing during the sprint and the delivery speed through the delivered volume of tasks and bug fixes.

This indicator can be explained to the business, and it can be used as a measure of the quality of the release process — both at the level of the team and at the level of the entire product.

Calculating defect density “with a twist”

Defect density “with a twist” calculation:

D — defect density “with a twist”

d — defect of the corresponding priority p1, p2, pn

k — coefficient for the corresponding priority p1, p2, pn

t — ticket (task/bugfix) of the corresponding level of complexity c1, c2, cn

h — coefficient for the corresponding level of complexity c1, c2, cn

Numerator — production defects weight in sprint, denominator — all delivered tickets weight in sprint. The lower the number, the better the sprint.

Disclaimer: the approach in the article describes coefficients which I picked for myself — I suggest you define proper coefficients and number of levels of complexity specifically for your case.

Recommendations on numerator

Taking into account different levels of defects priority, we will introduce concept of defect priority coefficient:

We are dividing defects into 5 priority types based on which we use (you might have more or fewer types in your priority approach):

p1 for critical
p2 for major
p3 for medium
p4 for minor
p5 for trivial

Critical defects are multiplied by 5, so that these defects feel significant; we are also giving corresponding coefficients k to other types of defects (again, these numbers can be different in your case — the main point here is to add more value to what is important and not to consider different types of defects equal):

kp1 = 5
kp2 = 2
kp3 = 1
kp4 = 0.5
kp5 = 0.1

As a result, our numerator is a sophisticated sum of all priorities production defects during the sprint — everything we were unable to prevent while testing the product.

Recommendations on denominator

You might want to apply the same “different priority with values” logic for tasks and bug fixes released during the sprint. Here are a few reasons not to do it.

Don’t confuse priority with severity for product tasks. While calculating the amount of tasks of some priority the final number will seriously depend on a product manager’s decision. For example, a task can be marked with critical priority simply because PMs want to have it done asap. Yeah, that sounds unprofessional to combine severity and priority but let’s face it — this is a very common thing in IT companies.

What is more important, we want to count all the effort invested into development and testing in the same formula. We spend different time on working on different tickets — that’s what I call levels of complexity c in the denominator. For instance, tickets which cost us more effort we can mark as Extreme complexity level, cheaper tickets would be Huge, Medium, etc. How to define these levels of complexity?

Let’s look at all delivered tickets throughout recent 4 sprints — this will be tasks, bug fixes, etc. Slice them up into 5 percentile layers based on how much time is logged into a ticket by developers and QA:

100p — Extreme complexity, c1
95p — Huge complexity, c2
90p — Long complexity, c3
75p — Normal complexity, c4
50p — Quick complexity, c5

Note: You can take as much sprints data as you want and as much percentile layers as you need for more levels of complexity

Once you find percentiles for each sprint out of 4, calculate arithmetic mean for each percentile layer for 4 sprints. This way, you will get limits for every complexity level.

We also want to establish values of tickets delivered to production by multiplying the amount of tickets of a certain complexity level by the coefficient of that complexity level h. You can define these coefficients by yourself, for example:

hc1 = 8
hc2 = 5
hc3 = 3
hc4 = 2
hc5 = 1

Again, it is ok that your coefficients here might come “empirically”: they will still let us get ranged tickets.

Our denominator sums up the value of all delivered tickets during the sprint — every bit of user benefit.

Now that we have the values and we’ve calculated our defect density “with a twist”, let’s talk about how to understand and treat that number.

How to work with the metric

Working with the metric might look obvious to some. You will need to collect data, find your current state of things, change processes so that you have a better value of the metric. I will try to decompose the process and define its stages.

Finding current state of things value

We need to collect a significant amount of data — at least 10 sprints analysis should work. It goes without saying, the more sprints you analyze, the better. Note, it is not necessary to wait for 10 sprints — you can definitely look into closed sprints if those tickets have some sort of logged-in time by devs and QA.

Finding control limits and focus area

As soon as you have at least 10 values for the metric, it makes sense:

To define the upper control limit, so that if the metric crosses the limit you know your delivery process does not feel good. This can be done by 100 percentile of your collected metric values or by adding standard deviation to 100 percentile;
To calculate 95 and 90 percentiles for your metric values range to define the levels which are your targets for the delivery process (the lower the percentile target, the more ambitious the goal).
I will mention potential usage of the target as some sort of KPI for a QA lead / whole development team.

Metrics cannot help you until you commit and change your processes accordingly.

Analysis during the sprint

Critical priorities’ defects will be the top value for your numerator. Bad team’s performance (the number of logged-in hours in the tickets) will definitely worsen your metric.

In such cases, it makes sense to analyze the reasons behind you slipping on critical issues or stabilize the team’s performance by hiring more engineers or optimizing your requirements. Сome up with a hypothesis for the origin of the problem, change the process, monitor the outcome — you know what to do.

Some things to note while implementing the metric

You need to instruct your devs and QA on how to properly log the work time into the tickets in case they are not doing that already. Make sure to explain to your team that you need metrics to have a measurable result, not to look for the guilty ones.
Ensure the tickets go through your workflow and change statuses on time. You do not want to have a ticket which for example was delivered to production last sprint, but was not caught by your sprint query just because it was not transitioned to delivered / closed status.

Conclusion

When working with metrics for me it was always hard to see direct connection between one part of a process and another. When other managers were telling me that my QAs are testing too slow, I used to show the amount of bugs we find before rolling out to production — I thought that this is very self-explanatory from the point of quality. At the same time I really wanted to know how stable the pace of releases is, whether or not we deliver exactly that amount of new features and of that quality.

Defect density “with a twist” lets me see everything combined in one value — the release volume, the quality and the speed. No single metric is a solution to all of your problems, but whatever might help you identify them and monitor the changes applied comes in handy.

Few words to think over:

Automate metrics collection, so that there is no human error
Make the metric transparent to every member in the team by visualizing it on a dashboard
If you are in charge of team’s goals, that metric can be used in subset of KPIs for the quarter/period
Experiment with other metrics also — defect leakage, prod/test bugs ratio, etc. — whatever helps you make the state of things visible to everyone, define issues faster and fix them

P.S. Kudos to Rita Kind-Envy for editing!

A Way to Effectiveness: Release Quality, Volume and Speed in 1 QA Metric was originally published in Mayflower team on Medium, where people are continuing the conversation by highlighting and responding to this story.