
@prologic Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasn't sure if anyone objected), and changed "MUST follow the chain" to "SHOULD follow the chain.
James Cook. Time-space trader and software hipster.
@prologic Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasn't sure if anyone objected), and changed "MUST follow the chain" to "SHOULD follow the chain.
Sorry, you're right, I should have used numbers!
I'm don't understand what "preserve the original hash" could mean other than "make sure there's still a twt in the feed with that hash". Maybe the text could be clarified somehow.
I'm also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But it's not universal; e.g. as a jenny user I just see the plain text.
@prologic Thanks for writing that up!
I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.
I am not sure how I feel about all this being done at once, vs. letting conventions arise.
For example, even today I could reply to twt abc1234 with "(#abc1234) Edit: ..." and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.
Similarly we could just start using 11-digit hashes. We should iron out whether it's sha256 or whatever but there's no need get all the other stuff right at the same time.
I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).
However I recognize that I'm not the one implementing this stuff, and it's less work to just have everything determined up front.
Misc comments (I haven't read the whole thing):
Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. I'd suggest gaining 11 bits with base64 instead.
"Clients MUST preserve the original hash" --- do you mean they MUST preserve the original twt?
Thanks for phrasing the bit about deletions so neutrally.
I don't like the MUST in "Clients MUST follow the chain of reply-to references...". If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldn't declare the client non-conforming just because they didn't get to all the bells and whistles.
Similarly I don't like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (I'm again thinking again of the 40-line shell script).
For "who follows" lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?
Why can't feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasn't too bad, but 1.0 would have been slightly simpler.
Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.
I'm a little sad about other protocols being not recommended.
I don't know how I feel about including markdown. I don't mind too much that yarn users emit twts full of markdown, but I'm more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.
@prologic Wikipedia claims sha1 is vulnerable to a "chosen-prefix attack", which I gather means I can write any two twts I like, and then cause them to have the exact same sha1 hash by appending something. I guess a twt ending in random junk might look suspcious, but perhaps the junk could be worked into an image URL like . If that's not possible now maybe it will be later.
git only uses sha1 because they're stuck with it: migrating is very hard. There was an effort to move git to sha256 but I don't know its status. I think there is progress being made with Game Of Trees, a git clone that uses the same on-disk format.
I can't imagine any benefit to using sha1, except that maybe some very old software might support sha1 but not sha256.
@movq Agreed that hashes have a benefit. I came up with a similar example where when I twted about an 11-character hash collision. Perhaps hashes could be made optional somehow. Like, you could use the "replyto" idea and then additionally put a hash somewhere if you want to lock in which version of the twt you are replying to.
There's a simple reason all the current hashes end in a or q: the hash is 256 bits, the base32 encoding chops that into groups of 5 bits, and 256 isn't divisible by 5. The last character of the base32 encoding just has that left-over single bit (256 mod 5 = 1).
So I agree with #3 below, but do you have a source for #1, #2 or #4? I would expect any lack of variability in any part of a hash function's output would make it more vulnerable to attacks, so designers of hash functions would want to make the whole output vary as much as possible.
Other than the divisible-by-5 thing, my current intuition is it doesn't matter what part you take.
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
Maybe I’m being a bit too purist/minimalistic here. As I said before (in one of the 1372739 posts on this topic – or maybe I didn’t even send that twt, I don’t remember 😅), I never really liked hashes to begin with. They aren’t super hard to implement but they are kind of against the beauty of the original twtxt – because you need special client support for them. It’s not something that you could write manually in your
twtxt.txt
file. With @sorenpeter’s proposal, though, that would be possible.
Tangentially related, I was a bit disappointed to learn that the twt subject extension is now never used except with hashes. Manually-written subjects sounded so beautifully ad-hoc and organic as a way to disambiguate replies. Maybe I'll try it some time just for fun.
@falsifian You mean the idea of being able to inline
# url =
changes in your feed?
Yes, that one. But @lyse pointed out suffers a compatibility issue, since currently the first listed url is used for hashing, not the last. Unless your feed is in reverse chronological order. Heh, I guess another metadata field could indicate which version to use.
Or maybe url changes could somehow be combined with the archive feeds extension? Could the url metadata field be local to each archive file, so that to switch to a new url all you need to do is archive everything you've got and start a new file at the new url?
I don't think it's that likely my feed url will change.
@prologic Brute force. I just hashed a bunch of versions of both tweets until I found a collision.
I mostly just wanted an excuse to write the program. I don't know how I feel about actually using super-long hashes; could make the twts annoying to read if you prefer to view them untransformed.
@lyse This looks like a nice way to do it.
Another thought: if clients can't agree on the url (for example, if we switch to this new way, but some old clients still do it the old way), that could be mitigated by computing many hashes for each twt: one for every url in the feed. So, if a feed has three URLs, every twt is associated with three hashes when it comes time to put threads together.
A client stills need to choose one url to use for the hash when composing a reply, but this might add some breathing room if there's a period when clients are doing different things.
(From what I understand of jenny, this would be difficult to implement there since each pseudo-email can only have one msgid to match to the in-reply-to headers. I don't know about other clients.)
@movq Another idea: just hash the feed url and time, without the message content. And don't twt more than once per second.
Maybe you could even just use the time, and rely on @-mentions to disambiguate. Not sure how that would work out.
Though I kind of like the idea of twts being immutable. At least, it's clear which version of a twt you're replying to (assuming nobody is engineering hash collisions).
@movq @prologic Another option would be: when you edit a twt, prefix the new one with (#[old hash]) and some indication that it's an edited version of the original tweet with that hash. E.g. if the hash used to be abcd123, the new version should start "(#abcd123) (redit)".
What I like about this is that clients that don't know this convention will still stick it in the same thread. And I feel it's in the spirit of the old pre-hash (subject) convention, though that's before my time.
I guess it may not work when the edited twt itself is a reply, and there are replies to it. Maybe that could be solved by letting twts have more than one (subject) prefix.
But the great thing about the current system is that nobody can spoof message IDs.
I don't think twtxt hashes are long enough to prevent spoofing.
@prologic One of your twts begins with (#st3wsda): https://twtxt.net/twt/bot5z4q
Based on the twtxt.net web UI, it seems to be in reply to a twt by @cuaxolotl which begins "I’ve been sketching out...".
But jenny thinks the hash of that twt is 6mdqxrq. At least, there's a very twt in their feed with that hash that has the same text as appears on yarn.social (except with ' instead of ’).
Based on this, it appears jenny and yarnd disagree about the hash of the twt, or perhaps the twt was edited (though I can't see any difference, assuming ' vs ’ is just a rendering choice).
I just manually followed the steps at https://dev.twtxt.net/doc/twthashextension.html and got 6mdqxrq. I wonder what happened. Did @cuaxolo edit the twt in some subtle way after twtxt.net downloaded it? I couldn't spot a diff, other than ' appearing as ’ on yarn.social, which I assume is a transformation done by twtxt.net.
@prologic How does yarn.social's API fix the problem of centralization? I still need to know whose API to use.
Say I see a twt beginning (#hash) and I want to look up the start of the thread. Is the idea that if that twt is hosted by a a yarn.social pod, it is likely to know the thread start, so I should query that particular pod for the hash? But what if no yarn.social pods are involved?
The community seems small enough that a registry server should be able to keep up, and I can have a couple of others as backups. Or I could crawl the list of feeds followed by whoever emitted the twt that prompted my query.
I have successfully used registry servers a little bit, e.g. to find a feed that mentioned a tag I was interested in. Was even thinking of making my own, if I get bored of my too many other projects :-)
@movq Thanks, it works!
But when I tried it out on a twt from @prologic, I discovered jenny and yarn.social seem to disagree about the hash of this twt: https://twtxt.net/twt/st3wsda . jenny assigned it a hash of 6mdqxrq but the URL and prologic's reply suggest yarn.social thinks the hash is st3wsda. (And as a result, jenny --fetch-context didn't work on prologic's twt.)
@prologic Yes, fetching the twt by hash from some service could be a good alternative, in case the twt I have does not @-mention the source. (Besides yarnd, maybe this should be part of the registry API? I don't see fetch-by-hash in the registry API docs.)