README
Segment Tagging
How to: Processes that can be done by any dev
Adding an event/view or activating an existing property for an event/view can be partially done by any developer. The flow consists of modifying the views_and_events.csv
file and then submitting it to Team Data to verify and commit the changes into the tagging plan.
1. Add new event/view
This section describes how to add a new event/view, assuming no new properties. If the new event/view also contains new properties, you will need to ask Team Data to do it.
To add new event/view (a view is a page on web or a screen on iOS and Android) requires you to modify the views_and_events.csv
file.
- Add the new event/view as a row at the end of the file.
- Some of the columns represent properties. The value "excluded" indicates that this property should never be set for the corresponding event/view. The value "obligatory" indicates that you must attempt to set this property. The exlcuded and obligatory properties of the new event/view must be decided by you. Often the new event/view occurs in the same data-context as an existing event/view. For example the events
Item Reply Stated
andItem Reply Submitted
both happen in the data-context of an ad. Therefore they should have same obligatory properties likeitemRegion
,itemLastPublishedOn
, and similar. If your new event shares this data-context with some existing event, you are lucky. You can simply copy over the exlcuded/obligatory values without going over them one-by-one. Saves time. - Note that non-property columns still need a review despite the new event sharing property settings with an existing event. For example if the new event applies only to ios then web, backend, and android columns should be set to false.
- Inform Team Data of your changes and send them the CSV file with the name of the views/events you added, and they will take care of the following steps.
Warning: The steps below are for Team Data, after the modified CSV is received and its changes are validated:
- Create a new branch:
git checkout -b new-events --track origin/new-events
. - Run
update-tagging-from-csv.py views_and_events.csv
to update thetaggingdb
. Note that you can actually use any CSV file in the argument. - Change directory to
taggingdb
and runbackup.sh
to backup the database. - Run
regenerate-master-files.py
to regenerate the JSON master files (views_and_events.csv
will be regenerated as well, but now the new event/view). - Make a single commit with message "Add new event
My Event
" and in the summary section of commit message provide more detailed description, specifying more details. - Create a pull request and Team Data members as reviewers (so that they get a notification).
2. Enable a property for an event
- Open the file
view_and_events.csv
and in the relevant row set "obligatory" in column correspondng to the property that must be enabled. - Inform Team Data of your changes and send them the CSV file with the name of the views/events you modified, and they will take care of the following steps.
Warning: The steps below are for Team Data, after the modified CSV is received and its changes are validated:
- Run
update-tagging-from-csv.py views_and_events.csv
to update thetaggingdb
. Note that you can actually use any CSV file in the argument. - Change directory to
taggingdb
and runbackup.sh
to backup the database. - Run
regenerate-master-files.py
to regenerate the JSON master files (views_and_events.csv
will be regenerated as well). - Make a single commit with message "Enable prop
myProp
for 2 events` and in the summary section of commit message provide more detailed description, specifying more details like the event names.
How to: Processes that need to be done by Team Data
1. Add a new property
To add a new property you need to use taggingdb
for now. Here are the steps for adding new propety suggestedSubCategory
which are applicable to any other new property.
- Insert the new property into table
public.properties
:insert into properties (property, data_type, alive, reserved, documentation) values ('suggestedSubCategory', 'text', true, false, 'The sub-category that was suggested during item insertion.');
- Exclude this new property from all events and views :
insert into excluded_properties_views (property, view) select 'suggestedSubCategory', view from views;
andinsert into excluded_properties_events (property, event) select 'suggestedSubCategory', event from events;
. - If the property has fixed values, insert these into table
public.property_values
. In case ofsuggestSubCategory
property, we know it has same fixed values asitemSubCategory
value. Yhe query then becomes:insert into property_values (property, value, alive) select 'suggestedSubCategory', value, true from property_values where property = 'itemSubCategory';
. If however you need to insert property values one by one then the query becomes:insert into property_values (property, value, alive) values ('suggestedSubCategory', 'Cats', true);
. Then replace "Cats" with the next property value, and so on. - Backup the database with the script
taggingdb/backup.sh
(ran from withingtaggingdb
directory). - Regenerate master files with
regenerate-master-files.py
. - Make a single commit with the message "Add new prop
suggestedSubCategory
".
2. Rename an event/view
Renaming is not supported because we use event/view name as natural primary key. To rename, delete the event/view first, commit, then create new event/view then commit.
3. Delete an event/view
To delete an event/view you need to use taggingdb
. Login into taggingdb
and perform the following steps:
- Delete the rows corresponding to the event/view to be deleted from
excluded_properties_events
orexcluded_properties_views
, e.g.delete from excluded_properties_events where event='My Redundant Event';
. - Now that the event/view to be deleted is not referenced anymore in any other table, delete it by running
delete from views where view = 'My Redundant View';
.
Adding custom dimension to Google Analytics
This section describes how to add a custom dimension to Google Analytics.
- Go into Google Analytics and determine the next available numeric ID for the new custom dimension.
- Go to any GA property, e.g.
Web
, then go to Admin section, then to Property Settings, Custom Definitions, then Custom Dimensions. - Go to the last page of the table showing the custom dimensions.
- The numeric ID of the new custom dimensions will be the index number of the last existing dimension plus one.
- Suppose the index number of the last dimension is 44. This means 45 will be the numeric ID of the new custom dimensions.
- Go to any GA property, e.g.
- Insert a row into table
custom_dimension_ids
intaggingdb
- Run
insert into custom_dimension_ids (custom_dimension_id, property) values (45, 'someProperty');
- Note that the field
custom_dimension_ids.property
is a foreign key to theproperties.property
field. - Backup the
taggingdb
database using thetaggingdb/manage.sh backup
script and commit the changes.
- Run
- In Segment, go to Google Analytics destination configuration page of each source:
ANDROID - DEV
,ANDROID - LIVE
,IOS - DEV
,IOS - LIVE
,WEB - DEV
,WEB - LIVE
. On the configuration page, add the new custom dimension in the "Custom Dimensions" section. - Go back to Google Analytics. Add the custom dimension and for each of these properties:
Android
,Android (dev)
,iOS
,iOS (dev)
,Web
, andWeb (dev)
. The custom dimension scope is "hit". - Add the custom dimension also for the rollup properties
Android + iOS + Web
andAndroid + iOS + Web (dev)
. For the rollups there will be an extra step where you have to name the custom dimension and then select the custom dimension from the dropdown that it refers to from the list of existing custom dimensions defined in the underlying properties. The name and the referred to dimension should laways be named the same of course. That is, if you name the custom dimension in the rollup property asmyDim
then it should refernce themyDim
custom dimension from all the three underlying properties. It sounds all very complicated, but it is fairly simple once you see the page where you can add the custom dimension.
Adding a custom metric to Google Analytics
To add a custom metric instead of a custom dimension to GA, just follow the instructions on how to add a custom dimension but checking the GA section Custom Definitions / Custom Metrics, and using the taggingdb
table custom_metric_ids
.
Design
The tagging plan lives in a relational database, PostgreSQL. See the taggingdb
directory. It is easy to run a local copy of the tagging database, assuming a PostgreSQL cluster is already setup locally.
The platform developers (Android, iOS, web) are given JSON master files that are built using this database. Using these master files to implement the tagging in a platform help drastically reduce the probaiblity of following issues:
- Typos in names of events, properties, pages, and screens.
- Typos in values of properties that take on values from some finite set, e.g. {let, buy, rent, sell}.
- Sudden appearance of new names of events, screens, pages, and properties not present in tagging database.
- Wrong property data types, e.g. putting "twelve" into an integer-valued property.
- Omitting obligatory properties and traits.
But there is no silver bullet, following issues have still same chance of occuring:
- Failing to fire an event/screen/page
- Firing an event/page/screen multiple times instead of only once
Terminology
view
: a page or a sceen.action
: an event or a views.tagging
: deciding which events, views, properties, and traits need to be tracked and create naming for them.item
: an ad (the one created by a user).
Concepts
- Segment is a data-pipeline, it has no analytics capabilities or similar. All it does is route data from point A (source) to point B (destination). Which is exactly what we have it for. This relieves the data-team from building connections between the tools we use.
- Segment is a tracking tool, that replaces the need to put any other tags into code. Segment can forward the tracked events, pageviews, and user-profiling to any supported tool. This relieves the developers from touching the code, when you need a new tool.
Tagging process
We assume this repo is used by two roles: tagging creators, and tagging implementers.
Tagging creators create and maintain events/pages/screens. As of February 2018 we have as tagging creators Cliff and Dmitrii.
Tagging implementers are the platform developers: Jakub, Cip, Dmitry, Marko. Tagging implementers have read-only access to the tagging. One of the reasons for keeping creators and implementers roles separate is that in the past where we allowed tagging implementers to implement new events, we ended up with different names for same events/pages/screens across platforms. This is because naturally tagging implementers are platform myopic. Tagging creators on the other hand are aware of all the events/views/pages accross all platforms and are therefore in best position to tag them.
Anyone at tutti.ch (PMs, Marketing, Sales, CEO, Happiness, Devs) can request tagging of an event
, view
or page
, identify
. Every request will be answered, not every request will result in a tag being assigned.
Master files
Each platform has a corresponding master file in JSON format: tagging-ios.json
, tagging-android.json
, and tagging-web.json
. The master file contains definitions of events, views, properties, and traits. On top of definitions the master file contains meta data, like API keys, lookup tables, version, and documentation.
Tagging is evolving with time. Things get added, deleted, renamed, etc. Therefore there can be multiple versions of each master file. We use semantic versioning for the master files, for example:
- Schema of the master file changed. In this case we bump the major: 1.0.0 -> 2.0.0.
- New keys are added (for example new events), we bump the minor: 1.0.0 -> 1.1.0.
- Lookup values changed (e.g. category code 1220 referred to cats, and is now dogs), we bump the patch number: 1.0.0 -> 1.0.1.
The versions are tagged using annotated git tags. To access version 1.1.0 of the iOS master file, you would go to https://github.com/tutti-ch/segment-tagging/blob/1.1.0/tagging-ios.json
.
Naming convention
Names for events and views should adhere to object-action framework.
Names of events, views, properties, and traits are formatted as follows:
- Names of events and views consist of two or more words, with each word capitalized. E.g.:
Item Detail
,Item Shared Through Facebook
. - Properties and traits are single token in camel case, e.g.:
itemListMinPrice
,name
.
Properties
The master files lists the expected properties for each event and view. The tagging implementer should have logic that attempts to set each listed property. If setting a listed property is not possible, then this property should be set to a null or not set at all (which is considered to be the same).
Q & A
Q: Can we have ignored IPs?
A: ???
Q: How to deal with multi-valied properties, e.g. color = [blue, red].
A: Brantley suggested just passing an array, say ['blue', 'red']
, which will get converted to string "blue,red"
or similar, in Redshift. Additionally we could also pass a string "['blue', 'red']"
that has JSON format, which will arrive into Redshift in same format. In the end both approaches result in text columns in Redshift and easy to parse: either with split_part
or json_extract_array_element_text
. I would advise passing multi-valued attributes as valid JSON strings. There may always be a destination which will allow you to parse JSON in a text field into an array, but it is less likely that a destination will allow you to parse delimited text into an array.
Q: The iOS SDK has a flushAt
parameter. Unless number of events/screens reaches this value, they will not be sent to Segment. How to avoid then that we receive events/views 3 months after the fact because the user did not reach flushAt
events during their previous session(s)?
A: Brantley will enquire and report back.
Q: In GA > Audience > Geo > Language, the language is not detected. Why?
A: Brantley mentioned that this is because we are using GA in cloud mode. We should try device mode.
Q: What happens in Redshift when you keep sending itemPrice as integer property then suddenly send it a text value?
A: Unknown value types will be dropped (=Null). If we change it long-term, we shall tell Segment, so they can replay the data in the new type.
Q: Are we allowed to compress the POST payload? E.g. will Segment libraries under the hood generate Content-Encoding: gzip
header?
A: Nope. Only uncompressed. (client: max. 15kb | server: max. 32kb)
Q: What is the difference between saying "Redshift": true
or leaving out Redshift entirely from the integrations object?
A: Doesn't change anything in the Redshift case. All other destinations will respond to this as expected. Redshift needs to be managed in the interface (--> selective sync).
Q: Is there an API-way to manage destinations? Especially GA custom dimensions?
A: No.
Q: When you have two Redshift destinations how can you refer to specific one in the integrations object?
A: You can't. See above, the integrations
object has no effect on Redshift. Manage selective sync in the GUI.
Q: For the integration specification sent throug API to have any effect, must there be a dark grey line from the source to the integration (aka destination) in the Dashboard?
A: Need to ask Segment.
Q: What happens when wrong integration name is provided in the API call?
A: It fails. Ensure that you use the exact casing seen here - https://segment.com/docs/destinations/
Q: Events are assumed to be triggered by users. What about non-user triggered events? Like we would like to have an event "Marked for NPS survey for example".
A: You could use a specific source for a job to only send subset to delighted. For example a Python source that sends events of these users only, based on whatever criteria.
Q: Are we able to set app version using Analytcs.js?
A: Yes. Apps set it automatically, web can set it. However, it's better to add a property, since not all destinations support context mappings. Like so we'd always have it available. Probably best to set both (context app.version and a custom property).
Q: Is new_visitor
property tracked automatically by Segment?
A: Need to ask Segment.
Q: What are possible values for the item.paramters
A: Need to ask backend devs.
Q: How is item.highlight
flag defined? Is it true when the highlight applies?
A: Need to ask backend devs.
Q: How is item.epoch_time
defined? Is it list time? Creation time?
A: The epoch_time is defined as the latest list_time of an ad. Note that only published ads can have a list_time.
Q: What is the payload size limit for a Segment API call.
A: The limit is 25 KB or 30 KB, Segment's Brantley was not sure. This limit is not enforced in the SDK but on the Segment's backend. Most likely you will get an HTTP error, which is probably handled and transformed by the Segment SDK.
Q: Should we include personal data like name, phone number, address, city, and email in the properties?
A: Segment informed us that they will be GDPR compliant before the deadline. Segment told us it is safe to send them personal data and that many of their clients already do so. Segment will allow for easy data deletion, as per requirement of GDPR. Segment will not pass private individuals data to Google Analytics, a special case because Google does not want that data.
Q: Is there a limit on the number of properties per API call?
A: No, but there is a payload limit. See another question.
Q: How are properties that are null treated? Is it treated the same as simply not passing the property?
A: Segment told us that nulls do not need to be sent. However if we always send null, then no corresponding Redshift table column will be created until at least on nonnull value is sent to Segment.
Q: Are nested properties allowed?
A: Nested properties are allowed but the way they will be treated by the destination is different. For Redshift as destination, the arrays will be stringified (e.g. ['a', 'b', 'c'] will be put into single varchar column), while nested objects will be flattened and names built up from the keys from all levels separated by underscores. For example {'a': {'foo': 'bar'}} will be a column a.foo with value bar in Redshift.
Q: Does Segment cover all of GA or just subset? Anything to look out for?
A: Segment covers most of the functionality of Google Analytics. The coverage depends on whether you send the "specced" events like Product Viewed, which will show up in the Ecommerce section in GA. For functionality not covered, a feature request can be made.
Q: Can you override common fields?
A: You can override common fields. For client side sources though this should not be necessary. For data originating on the backend you may want to change timestamp common field to send data from the past. Note that Google Analytics does not accept data with timestamps older that 4 hours.
Q: An anonymous users views 3 pages and then logs in. Will their 3 page views get the updated user ID of the logged-in user?
A: Yes, as per our call with Segment.
Q: Is it possible to update events and pages after the fact?
A: No, at least not in a programmatic way. Updating would be possible in special cases by contacting Segment and explaining the issue.
Q: Is it possible to delete data in segment after the fact.
A: No, at least not in a programmatic way. Deleting would be possible in special cases by contacting Segment and explaining the issue.
Q: What is Segment's opinion on using the track
method to log tutti's releases so that we can visualize the effect of releases on visits, sessions, ad insertions, etc. We would also need to track these events under some special user ID.
A: We asked Segment but they are not aware of this. They will ask around internally and we may need to ask again.
Q: Does Segment identify presence of ad blockers?
A: No. Ad blocker usage has to be done by clients.
Random mumbles
- Would it be better to have default values like "Unknown" instead of null for properties? Need to think this one through carefully.
- Marko suggested to add error messages into master file. First we will need to define failure modes (e.g. non-existent property being set, non-existen event name, wrong-property data type.)
- We assume that if a property is not available on one platform, it is not available on any other platform. Is that perhaps something we should not assume?
- If my experience counts for anything, then we should think about pages we want to tag as a connected graph.
- Connected, undirected, graph implies that we should not have any numbering in page names like "ad insertion page 1", "ad insertion page 2". Such naming would be more appropriate if we could think of tagging as a tree (data structure).
- The
page
method takes acategory
as its first argument.- What do we put there? Level-2 site? If we put level-2 site there, then we should be aware that there will be funnels that span multiple categories. Is that OK?
- Should we think of the
category
as a partition of the connected graph? In other words, can a page, saypremium features selection
belong to more than one category, say "ad insertions" and "promote the ad"?
- Creating the JSON tagging file by hand is tedious and error prone. It could result in errors, other than typos, that could go undetected, even when reviewed by another person.
- Ideally we should generate JSON tagging file programmatically, with tagging stored in a relational database. An RDBMS provides extra layer of data quality through primary keys, foreign keys, triggers, and what not.
- Do we make properties nullable or do we specify exactly the reason for the absence of value? E.g. region property could be "unknown", "not applicable", or "indeterminable" rather than just null.
- For properties that are like enumerations (region, category), should the Segment SDK wrappers create callables? E.g. to populate region Bern into a property call
region('bern')
orregion.bern
. We want the call to fail if an invalid property value is used. Avoid storing garbage.- Failures should be monitored with tools like Sentry.
- Ideally some invalid tagging should already be detected at compile time, though some values are only known at runtime, like the ad list id on the VI or the region on the LI.
- So there will be errors which emerge only at runtime.
- The tagging plan should be created with the thought that it is going to be mutated constantly. So deleting, updating, and adding new tags should be a smooth sailing.
- Should we use site-wide unique page names or is it ok just to have uniqueness within each category? E.g.
start
page in categoryad_insertion
and in categoryad_promotion
may cause confusion in the tools downstream if these tools disregard the category. Ignoring the category will result in two distinct pages under single namestart
.- if we use site-wide unique page names, do we use common prefix like
ai
for related pages? What about suffix instead for more user friendliness?
- if we use site-wide unique page names, do we use common prefix like
- How do we become aware automatically of new properties being added to tutti API that we can subsequently pass on to Segment?
- Assume by default that all properties must be passed for each pagescreen or event. Then provide a list of excluded properties. This way it is much harder to forget to include a property because that would require you to consciously add it to the excluded list. Having a list for each pagescreen that gives the included properties is more dangerous because it is easier to forget to add a property to the list (requires no conscious effort).
- Since we are going for unique pagescreen and event names site-wide, not just category wide, pages from same coherent flow will inevitably end up with commin parts, like the "ai" in ai_start_page, ai_preview_page. What if we made the prefix more distinct like "[ai]_start_page", "[ai]_preview_page". The names look a bit more ugglier though this way.
- The backend gives back category id, parent category id, and the names. How do we implement same names on backend and what we have in DWH? Do we assume backend category names are the truth and rename the ad_categories_map in DWH to comply with that? As far as I remember the backend category names contains some minor errors. Alternatively we could provide a category code to name map in the tagging.json.
- the object
location_info
contains keys area and subarea. What are those? - what happends if segment is blocked by the client, perhaps because of ad blocker?
Lookup values
The source of truth for some lookup objects like car brand, car color, etc., is the /conf/bconf
directory of tutti repo. This directory contains the configuration files with the desired key-value lookup pairs, albeit some data munging is needed.
Car brand
To extract the car brand lookup values, get the file bconf.txt.cargroup
. This file has Latin-1 encoding which my grep
struggles with. Convert it to utf-8 with: iconv -f iso-8859-1 -t utf-8 bconf.txt.cargroup > bconf.txt.cargroup-utf8
. Then extract the lookup key-values with:
grep "^\*\.\*\.cargroup\.[0-9]\+\.name" bconf.txt.cargroup-utf8 | grep -o "[0-9]\+.*" | sed 's/.name=/ /' | tr [:upper:] [:lower:]
Car type
To extract the car type lookup values, fetch the file bconf.txt.language
. This file has Latin-1 encoding which my grep
struggles with. Convert it to utf-8 with: iconv -f iso-8859-1 -t utf-8 bconf.txt.language > bconf.txt.language.utf8
. Subsequently extract the key-values with grep "chassis\..*\.en.*\.value" bconf.txt.language.utf8 | tr [:upper:] [:lower:] | grep "chassis.*" -o | sed 's/chassis.//' | sed 's/.en.value=value:/ /'
.
Car color
To extract the car color lookup values, fetch the file bconf.txt.car_params
. This file has Latin-1 encoding which my grep
struggles with. Convert it to utf-8 with: iconv -f iso-8859-1 -t utf-8 bconf.txt.car_params > bconf.txt.car_params.utf8
. Subsequently extract the key-values with grep "color.*name.de" bconf.txt.car_params.utf8 | grep "[0-9]\+.*" -o | sed 's/.name.de=/ /' | tr [:upper:] [:lower:]
.
Squared meters
You can decode a squared meters value using one of the two schemes. Which decoding scheme should be used, depends on the sub-category. For example, for sub-category Land the deconding scheme is used that has higher values of squared meters. For apartments, decoding scheme with lower values of squared meters is used. The decoding schemes are stored in bconf.txt.adparams
. This file has Latin-1 encoding which my grep
struggles with. Convert it to utf-8 with: iconv -f iso-8859-1 -t utf-8 bconf.txt.adparams > bconf.txt.adparams.utf8
. Subsequently extract the key-values with grep "sizelist.*list" bconf.txt.adparams.utf8 | grep "[0-9]\{1,2\}=.*